Performance Best Practices¶
Optimize QuickETL pipeline execution for speed and efficiency.
Backend Selection¶
Choose the right backend for your workload:
| Scenario | Recommended Backend |
|---|---|
| Local files < 1GB | DuckDB |
| Local files 1-10GB | DuckDB or Polars |
| Local files > 10GB | Polars (streaming) |
| Distributed processing | Spark |
| Data in warehouse | Snowflake/BigQuery |
| Quick prototyping | DuckDB |
Quick Comparison¶
# Test different backends
time quicketl run pipeline.yml --engine duckdb
time quicketl run pipeline.yml --engine polars
File Format Optimization¶
Use Parquet¶
Parquet is significantly faster than CSV:
# Slow: CSV
source:
type: file
path: data/input.csv
format: csv
# Fast: Parquet
source:
type: file
path: data/input.parquet
format: parquet
Why Parquet is faster:
- Columnar storage (read only needed columns)
- Built-in compression
- Type preservation (no parsing)
- Predicate pushdown
Convert CSV to Parquet¶
One-time conversion:
name: convert_to_parquet
source:
type: file
path: data/large_file.csv
format: csv
sink:
type: file
path: data/large_file.parquet
format: parquet
Transform Optimization¶
1. Filter Early¶
Reduce data volume before expensive operations:
transforms:
# Good: Filter first
- op: filter
predicate: date >= '2025-01-01' AND status = 'active'
- op: join # Joins fewer rows
right: ...
- op: aggregate # Aggregates fewer rows
group_by: ...
Impact: Can reduce processing time by 10-100x.
2. Select Early¶
Only keep columns you need:
transforms:
# Good: Select needed columns early
- op: select
columns: [id, date, amount, category]
# Now operations work with less data
- op: aggregate
group_by: [category]
aggregations:
total: sum(amount)
3. Avoid Unnecessary Operations¶
# Bad: Unnecessary sort before aggregate
transforms:
- op: sort
by: [{column: date, order: asc}]
- op: aggregate # Aggregate doesn't need sorted input
group_by: [category]
# Good: Remove unnecessary sort
transforms:
- op: aggregate
group_by: [category]
4. Combine Filters¶
# Less efficient: Multiple filter operations
transforms:
- op: filter
predicate: status = 'active'
- op: filter
predicate: date >= '2025-01-01'
- op: filter
predicate: amount > 0
# More efficient: Single filter
transforms:
- op: filter
predicate: |
status = 'active'
AND date >= '2025-01-01'
AND amount > 0
Join Optimization¶
1. Filter Before Joining¶
transforms:
# Filter main table first
- op: filter
predicate: date >= '2025-01-01'
# Then join (fewer rows to match)
- op: join
right:
type: file
path: data/dimension.parquet
format: parquet
on: [id]
how: left
2. Join Smaller Tables¶
Put the larger table on the left:
# Source is large (1M rows)
source:
type: file
path: data/transactions.parquet # 1M rows
transforms:
# Join with smaller dimension table (10K rows)
- op: join
right:
type: file
path: data/products.parquet # 10K rows
format: parquet
on: [product_id]
how: left
3. Use Appropriate Join Type¶
# inner: Only matching rows (smallest result)
- op: join
how: inner
# left: All left rows (may include NULLs)
- op: join
how: left
Memory Management¶
1. Process in Chunks¶
For very large files, use streaming-capable backends:
2. Reduce Column Count¶
3. Use Appropriate Data Types¶
Parallel Execution¶
Multiple Independent Pipelines¶
Run independent pipelines in parallel:
# Sequential (slow)
quicketl run pipeline1.yml
quicketl run pipeline2.yml
quicketl run pipeline3.yml
# Parallel (fast)
quicketl run pipeline1.yml &
quicketl run pipeline2.yml &
quicketl run pipeline3.yml &
wait
Spark Parallelism¶
For Spark backend:
export SPARK_EXECUTOR_INSTANCES=10
export SPARK_EXECUTOR_CORES=4
export SPARK_EXECUTOR_MEMORY=8g
quicketl run pipeline.yml --engine spark
I/O Optimization¶
1. Use Local Storage¶
Local SSD is faster than network storage:
# Fast: Local SSD
source:
type: file
path: /local/ssd/data.parquet
# Slower: Network mount
source:
type: file
path: /mnt/network/data.parquet
2. Minimize Network Calls¶
For cloud storage, batch reads:
# Efficient: Read all matching files at once
source:
type: file
path: s3://bucket/data/*.parquet # Single glob
format: parquet
3. Compress Output¶
sink:
type: file
path: output/results.parquet
format: parquet
options:
compression: snappy # Fast compression
Benchmarking¶
Measure Execution Time¶
JSON Metrics¶
Compare Backends¶
import time
from quicketl import Pipeline
backends = ["duckdb", "polars", "pandas"]
for backend in backends:
pipeline = Pipeline.from_yaml("pipeline.yml")
start = time.time()
pipeline.run(engine=backend)
duration = time.time() - start
print(f"{backend}: {duration:.2f}s")
Profiling¶
Verbose Output¶
Shows timing for each step.
Python Profiling¶
import cProfile
from quicketl import Pipeline
pipeline = Pipeline.from_yaml("pipeline.yml")
cProfile.run("pipeline.run()", sort="cumtime")
Common Bottlenecks¶
| Symptom | Likely Cause | Solution |
|---|---|---|
| Slow start | Large CSV parsing | Use Parquet |
| Memory error | Too much data | Filter early, use Polars |
| Slow joins | Large tables | Filter before join |
| Slow writes | Many small files | Batch writes |
| Network timeout | Cloud storage | Use local cache |
Performance Checklist¶
- Using Parquet instead of CSV?
- Filtering early in pipeline?
- Selecting only needed columns?
- Using appropriate backend for data size?
- Joins ordered correctly (large left, small right)?
- No unnecessary transforms?
- Output compressed?
Related¶
- Backend Selection - Choose the right backend
- DuckDB - Optimize for DuckDB
- Polars - Optimize for Polars