Performance Best Practices¶

Optimize QuickETL pipeline execution for speed and efficiency.

Backend Selection¶

Choose the right backend for your workload:

Scenario	Recommended Backend
Local files < 1GB	DuckDB
Local files 1-10GB	DuckDB or Polars
Local files > 10GB	Polars (streaming)
Distributed processing	Spark
Data in warehouse	Snowflake/BigQuery
Quick prototyping	DuckDB

Quick Comparison¶

# Test different backends
time quicketl run pipeline.yml --engine duckdb
time quicketl run pipeline.yml --engine polars

File Format Optimization¶

Use Parquet¶

Parquet is significantly faster than CSV:

# Slow: CSV
source:
  type: file
  path: data/input.csv
  format: csv

# Fast: Parquet
source:
  type: file
  path: data/input.parquet
  format: parquet

Why Parquet is faster:

Columnar storage (read only needed columns)
Built-in compression
Type preservation (no parsing)
Predicate pushdown

Convert CSV to Parquet¶

One-time conversion:

name: convert_to_parquet
source:
  type: file
  path: data/large_file.csv
  format: csv
sink:
  type: file
  path: data/large_file.parquet
  format: parquet

Transform Optimization¶

1. Filter Early¶

Reduce data volume before expensive operations:

transforms:
  # Good: Filter first
  - op: filter
    predicate: date >= '2025-01-01' AND status = 'active'

  - op: join  # Joins fewer rows
    right: ...

  - op: aggregate  # Aggregates fewer rows
    group_by: ...

Impact: Can reduce processing time by 10-100x.

2. Select Early¶

Only keep columns you need:

transforms:
  # Good: Select needed columns early
  - op: select
    columns: [id, date, amount, category]

  # Now operations work with less data
  - op: aggregate
    group_by: [category]
    aggregations:
      total: sum(amount)

3. Avoid Unnecessary Operations¶

# Bad: Unnecessary sort before aggregate
transforms:
  - op: sort
    by: [{column: date, order: asc}]
  - op: aggregate  # Aggregate doesn't need sorted input
    group_by: [category]

# Good: Remove unnecessary sort
transforms:
  - op: aggregate
    group_by: [category]

4. Combine Filters¶

# Less efficient: Multiple filter operations
transforms:
  - op: filter
    predicate: status = 'active'
  - op: filter
    predicate: date >= '2025-01-01'
  - op: filter
    predicate: amount > 0

# More efficient: Single filter
transforms:
  - op: filter
    predicate: |
      status = 'active'
      AND date >= '2025-01-01'
      AND amount > 0

Join Optimization¶

1. Filter Before Joining¶

transforms:
  # Filter main table first
  - op: filter
    predicate: date >= '2025-01-01'

  # Then join (fewer rows to match)
  - op: join
    right:
      type: file
      path: data/dimension.parquet
      format: parquet
    on: [id]
    how: left

2. Join Smaller Tables¶

Put the larger table on the left:

# Source is large (1M rows)
source:
  type: file
  path: data/transactions.parquet  # 1M rows

transforms:
  # Join with smaller dimension table (10K rows)
  - op: join
    right:
      type: file
      path: data/products.parquet  # 10K rows
      format: parquet
    on: [product_id]
    how: left

3. Use Appropriate Join Type¶

# inner: Only matching rows (smallest result)
- op: join
  how: inner

# left: All left rows (may include NULLs)
- op: join
  how: left

Memory Management¶

1. Process in Chunks¶

For very large files, use streaming-capable backends:

engine: polars  # Supports streaming

source:
  type: file
  path: data/huge_file.parquet
  format: parquet

2. Reduce Column Count¶

transforms:
  - op: select
    columns: [id, amount, date]  # Only what's needed

3. Use Appropriate Data Types¶

transforms:
  - op: cast
    columns:
      id: int32      # Instead of int64
      amount: float32  # Instead of float64

Parallel Execution¶

Multiple Independent Pipelines¶

Run independent pipelines in parallel:

# Sequential (slow)
quicketl run pipeline1.yml
quicketl run pipeline2.yml
quicketl run pipeline3.yml

# Parallel (fast)
quicketl run pipeline1.yml &
quicketl run pipeline2.yml &
quicketl run pipeline3.yml &
wait

Spark Parallelism¶

For Spark backend:

export SPARK_EXECUTOR_INSTANCES=10
export SPARK_EXECUTOR_CORES=4
export SPARK_EXECUTOR_MEMORY=8g

quicketl run pipeline.yml --engine spark

I/O Optimization¶

1. Use Local Storage¶

Local SSD is faster than network storage:

# Fast: Local SSD
source:
  type: file
  path: /local/ssd/data.parquet

# Slower: Network mount
source:
  type: file
  path: /mnt/network/data.parquet

2. Minimize Network Calls¶

For cloud storage, batch reads:

# Efficient: Read all matching files at once
source:
  type: file
  path: s3://bucket/data/*.parquet  # Single glob
  format: parquet

3. Compress Output¶

sink:
  type: file
  path: output/results.parquet
  format: parquet
  options:
    compression: snappy  # Fast compression

Benchmarking¶

Measure Execution Time¶

time quicketl run pipeline.yml

JSON Metrics¶

quicketl run pipeline.yml --json | jq '.duration_ms'

Compare Backends¶

import time
from quicketl import Pipeline

backends = ["duckdb", "polars", "pandas"]

for backend in backends:
    pipeline = Pipeline.from_yaml("pipeline.yml")
    start = time.time()
    pipeline.run(engine=backend)
    duration = time.time() - start
    print(f"{backend}: {duration:.2f}s")

Profiling¶

Verbose Output¶

quicketl run pipeline.yml --verbose

Shows timing for each step.

Python Profiling¶

import cProfile
from quicketl import Pipeline

pipeline = Pipeline.from_yaml("pipeline.yml")
cProfile.run("pipeline.run()", sort="cumtime")

Common Bottlenecks¶

Symptom	Likely Cause	Solution
Slow start	Large CSV parsing	Use Parquet
Memory error	Too much data	Filter early, use Polars
Slow joins	Large tables	Filter before join
Slow writes	Many small files	Batch writes
Network timeout	Cloud storage	Use local cache

Performance Checklist¶

Using Parquet instead of CSV?
Filtering early in pipeline?
Selecting only needed columns?
Using appropriate backend for data size?
Joins ordered correctly (large left, small right)?
No unnecessary transforms?
Output compressed?

Backend Selection - Choose the right backend
DuckDB - Optimize for DuckDB
Polars - Optimize for Polars