File Sinks¶
Write data to Parquet and CSV files.
Basic Usage¶
Configuration¶
| Field | Required | Default | Description |
|---|---|---|---|
type |
Yes | - | Must be file |
path |
Yes | - | Output path or cloud URI |
format |
No | parquet |
Output format: parquet, csv |
partition_by |
No | [] |
Columns to partition by |
mode |
No | overwrite |
Write mode: overwrite, append |
Formats¶
Parquet (Recommended)¶
Parquet advantages:
- Efficient columnar storage
- Built-in compression (snappy by default)
- Schema preservation
- Fast analytical queries
CSV¶
Write Modes¶
Overwrite (Default)¶
Replace existing data:
Append¶
Add to existing data:
Append with Parquet
Appending creates additional files in the output directory. Consider using partitioning for incremental writes.
Partitioning¶
Partition output by column values:
This creates a directory structure:
output/sales/
├── year=2025/
│ ├── month=01/
│ │ └── data.parquet
│ └── month=02/
│ └── data.parquet
└── year=2024/
└── month=12/
└── data.parquet
Common Partitioning Patterns¶
By Date¶
# First, derive date parts
transforms:
- op: derive_column
name: year
expr: extract(year from date)
- op: derive_column
name: month
expr: extract(month from date)
sink:
type: file
path: output/data/
partition_by: [year, month]
By Region¶
Multiple Levels¶
Partitioning Benefits¶
- Query performance: Only read relevant partitions
- Incremental updates: Update specific partitions
- Parallel processing: Process partitions independently
- Data management: Delete old partitions easily
Cloud Storage¶
Write to cloud storage:
# S3
sink:
type: file
path: s3://my-bucket/output/sales.parquet
# GCS
sink:
type: file
path: gs://my-bucket/output/sales.parquet
# Azure
sink:
type: file
path: abfs://container@account.dfs.core.windows.net/output/sales.parquet
With partitioning:
Variables in Paths¶
Use runtime variables:
For daily outputs:
Python API¶
from quicketl.config.models import FileSink
# Basic
sink = FileSink(path="output/sales.parquet")
# With partitioning
sink = FileSink(
path="output/sales/",
format="parquet",
partition_by=["year", "month"]
)
# CSV
sink = FileSink(
path="output/sales.csv",
format="csv"
)
Best Practices¶
Use Parquet for Analytics¶
Parquet is significantly better for analytical workloads:
| Aspect | Parquet | CSV |
|---|---|---|
| File size | ~4x smaller | Larger |
| Read speed | ~10x faster | Slower |
| Schema | Preserved | Lost |
| Types | Full support | String only |
Partition Large Datasets¶
For datasets over 1 million rows, use partitioning:
Use Descriptive Paths¶
Include Metadata¶
Consider including run metadata:
Troubleshooting¶
Permission Denied¶
- Check write permissions on the output directory
- For cloud storage, verify credentials have write access
- Ensure the output path is writable
Path Not Found¶
QuickETL creates directories automatically. If this error occurs:
- Check the parent path is valid
- Verify cloud bucket exists
Disk Full¶
- Check available disk space
- Use cloud storage for large outputs
- Enable compression (automatic with Parquet)
Related¶
- File Sources - Reading files
- Cloud Storage - Cloud provider setup
- Performance - Optimization tips