File Sources¶
Read data from CSV, Parquet, and JSON files.
Basic Usage¶
Configuration¶
| Field | Required | Default | Description |
|---|---|---|---|
type |
Yes | - | Must be file |
path |
Yes | - | File path or cloud URI |
format |
No | parquet |
File format: csv, parquet, json |
options |
No | {} |
Format-specific options |
Formats¶
Parquet¶
Apache Parquet is the recommended format for performance:
Parquet benefits:
- Columnar storage (efficient for analytics)
- Built-in compression
- Schema preserved
- Fast reads with predicate pushdown
CSV¶
Read CSV files with options:
CSV Options¶
| Option | Default | Description |
|---|---|---|
delimiter |
, |
Field separator |
header |
true |
First row contains column names |
skip_rows |
0 |
Number of rows to skip |
null_values |
[""] |
Values to interpret as null |
quote_char |
" |
Quote character |
CSV Examples¶
Tab-separated file:
No header row:
Custom null values:
JSON¶
Read JSON Lines (newline-delimited JSON):
JSON Lines Format
QuickETL expects JSON Lines format where each line is a valid JSON object:
Path Patterns¶
Local Files¶
Absolute Paths¶
Cloud Storage¶
See Cloud Storage for detailed setup.
# S3
source:
type: file
path: s3://my-bucket/data/sales.parquet
# GCS
source:
type: file
path: gs://my-bucket/data/sales.parquet
# Azure
source:
type: file
path: abfs://container@account.dfs.core.windows.net/data/sales.parquet
Variables in Paths¶
Use variable substitution for dynamic paths:
Glob Patterns¶
Coming in v0.2
Glob patterns for reading multiple files are planned for a future release.
Python API¶
from quicketl.config.models import FileSource
# Parquet
source = FileSource(path="data/sales.parquet")
# CSV with options
source = FileSource(
path="data/sales.csv",
format="csv",
options={"delimiter": ";", "header": True}
)
# Cloud storage
source = FileSource(path="s3://bucket/data/sales.parquet")
Performance Tips¶
Use Parquet¶
Parquet is significantly faster than CSV for analytical workloads:
| Format | Read Time (1M rows) | File Size |
|---|---|---|
| CSV | ~2.5s | 100 MB |
| Parquet | ~0.3s | 25 MB |
Column Selection¶
With Parquet, only required columns are read. Use select transform early:
Compression¶
Parquet files are automatically compressed. For CSV, consider gzipping:
Troubleshooting¶
File Not Found¶
- Check the file path is correct
- Use absolute paths if relative paths don't work
- Ensure cloud credentials are configured
CSV Parsing Errors¶
- Check the delimiter matches your file
- Verify
headersetting is correct - Look for inconsistent row lengths
Encoding Issues¶
For files with non-UTF-8 encoding:
Related¶
- Cloud Storage - S3, GCS, Azure setup
- File Sinks - Writing files
- Pipeline YAML - Full configuration reference