Transforms¶
Transforms modify data as it flows through your pipeline. QuickETL provides 12 built-in transform operations.
Overview¶
Transforms are applied in sequence. Each transform takes the output of the previous step as input.
transforms:
- op: filter # Step 1: Filter rows
predicate: amount > 0
- op: derive_column # Step 2: Add column (uses filtered data)
name: tax
expr: amount * 0.1
- op: aggregate # Step 3: Aggregate (uses data with tax column)
group_by: [category]
aggs:
total: sum(amount)
Transform Operations¶
Data Selection¶
| Transform | Purpose | Example |
|---|---|---|
select |
Choose columns | Keep only id, name, amount |
rename |
Rename columns | Change cust_id to customer_id |
limit |
Limit rows | Take first 1000 rows |
Data Filtering¶
| Transform | Purpose | Example |
|---|---|---|
filter |
Filter rows | Keep rows where amount > 100 |
dedup |
Remove duplicates | Keep unique customer_id |
Data Modification¶
| Transform | Purpose | Example |
|---|---|---|
derive_column |
Add computed column | Calculate total = qty * price |
cast |
Convert types | Change id from int to string |
fill_null |
Replace nulls | Set null status to "unknown" |
Data Organization¶
| Transform | Purpose | Example |
|---|---|---|
sort |
Order rows | Sort by amount descending |
aggregate |
Group and summarize | Sum amount by region |
Data Combination¶
| Transform | Purpose | Example |
|---|---|---|
join |
Join datasets | Join orders with customers |
union |
Stack datasets | Combine daily files |
Quick Reference¶
# Select columns
- op: select
columns: [id, name, amount]
# Rename columns
- op: rename
mapping:
old_name: new_name
# Filter rows
- op: filter
predicate: amount > 100 AND status = 'active'
# Add computed column
- op: derive_column
name: total_with_tax
expr: amount * 1.1
# Convert types
- op: cast
columns:
id: string
amount: float64
# Replace nulls
- op: fill_null
columns:
status: "unknown"
amount: 0
# Remove duplicates
- op: dedup
columns: [customer_id]
# Sort rows
- op: sort
by: [amount]
descending: true
# Join datasets
- op: join
right: customers
on: [customer_id]
how: left
# Aggregate
- op: aggregate
group_by: [region]
aggs:
total: sum(amount)
count: count(*)
# Combine datasets
- op: union
sources: [data1, data2]
# Limit rows
- op: limit
n: 1000
Transform Order Best Practices¶
1. Filter Early¶
Apply filters as early as possible to reduce data volume:
transforms:
- op: filter # First: reduce rows
predicate: date >= '2025-01-01'
- op: derive_column # Then: compute on smaller dataset
name: metric
expr: complex_calculation
2. Select Before Aggregate¶
Remove unnecessary columns before aggregation:
transforms:
- op: select # Remove unused columns
columns: [category, amount]
- op: aggregate # Aggregate on smaller dataset
group_by: [category]
aggs:
total: sum(amount)
3. Derive Before Aggregate¶
Create columns needed for aggregation:
transforms:
- op: derive_column # Create column first
name: net_amount
expr: amount - discount
- op: aggregate # Then aggregate
group_by: [region]
aggs:
total_net: sum(net_amount)
Expression Language¶
Many transforms use SQL-like expressions:
Operators¶
| Type | Operators |
|---|---|
| Arithmetic | +, -, *, / |
| Comparison | =, !=, >, <, >=, <= |
| Logical | AND, OR, NOT |
| Null | IS NULL, IS NOT NULL |
Functions¶
| Category | Functions |
|---|---|
| String | UPPER(), LOWER(), TRIM(), CONCAT() |
| Math | ABS(), ROUND(), FLOOR(), CEIL() |
| Date | EXTRACT(), DATE_TRUNC() |
| Null | COALESCE(), NULLIF() |
| Aggregate | SUM(), AVG(), MIN(), MAX(), COUNT() |
See Expression Language for full reference.
Python API¶
from quicketl.config.transforms import (
SelectTransform,
FilterTransform,
DeriveColumnTransform,
AggregateTransform,
)
pipeline = (
Pipeline("example")
.source(source)
.transform(FilterTransform(predicate="amount > 0"))
.transform(DeriveColumnTransform(name="tax", expr="amount * 0.1"))
.transform(AggregateTransform(
group_by=["category"],
aggs={"total": "sum(amount)"}
))
.sink(sink)
)
Next Steps¶
Explore all transforms in detail:
- All Transform Operations - Complete reference for all 12 transforms