Transform Operations¶
This page documents all 12 transform operations available in QuickETL.
Quick Reference¶
| Transform | Purpose | Example |
|---|---|---|
select |
Choose and reorder columns | columns: [id, name, amount] |
rename |
Rename columns | mapping: {old: new} |
filter |
Filter rows | predicate: amount > 100 |
derive_column |
Add computed columns | expr: amount * 1.1 |
cast |
Convert types | columns: {id: string} |
fill_null |
Replace nulls | columns: {status: "unknown"} |
dedup |
Remove duplicates | columns: [customer_id] |
sort |
Order rows | by: [amount] |
join |
Join datasets | on: [customer_id] |
aggregate |
Group and summarize | aggs: {total: sum(amount)} |
union |
Combine datasets | sources: [data1, data2] |
limit |
Limit rows | n: 1000 |
select¶
Choose and reorder columns in your data.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
columns |
Yes | list[str] |
Columns to keep, in order |
Examples¶
# Keep specific columns
- op: select
columns: [id, name, amount]
# Reorder columns
- op: select
columns: [amount, id, name]
# Select after transforms for final output
- op: select
columns: [region, total_sales, order_count]
Python API¶
from quicketl.config.transforms import SelectTransform
transform = SelectTransform(columns=["id", "name", "amount"])
rename¶
Rename columns using a mapping.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
mapping |
Yes | dict[str, str] |
Old name → new name mapping |
Examples¶
# Single column
- op: rename
mapping:
cust_id: customer_id
# Multiple columns
- op: rename
mapping:
cust_id: customer_id
order_amt: amount
created: created_at
Python API¶
from quicketl.config.transforms import RenameTransform
transform = RenameTransform(mapping={"cust_id": "customer_id", "order_amt": "amount"})
filter¶
Filter rows based on a SQL-like predicate.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
predicate |
Yes | str |
SQL-like boolean expression |
Examples¶
# Simple comparison
- op: filter
predicate: amount > 100
# Multiple conditions
- op: filter
predicate: amount > 100 AND status = 'active'
# OR conditions
- op: filter
predicate: region = 'north' OR region = 'south'
# Null handling
- op: filter
predicate: email IS NOT NULL
# Date filtering with variables
- op: filter
predicate: created_at >= '${START_DATE}' AND created_at < '${END_DATE}'
# IN operator
- op: filter
predicate: category IN ('Electronics', 'Home', 'Office')
# Pattern matching
- op: filter
predicate: name LIKE 'Widget%'
Operators¶
| Operator | Description | Example |
|---|---|---|
=, !=, <> |
Equality | status = 'active' |
>, <, >=, <= |
Comparison | amount > 100 |
AND, OR, NOT |
Logical | a > 1 AND b < 10 |
IS NULL, IS NOT NULL |
Null check | email IS NOT NULL |
IN, NOT IN |
List membership | region IN ('north', 'south') |
BETWEEN |
Range | amount BETWEEN 100 AND 500 |
LIKE |
Pattern match | name LIKE 'Widget%' |
Python API¶
from quicketl.config.transforms import FilterTransform
transform = FilterTransform(predicate="amount > 100 AND status = 'active'")
derive_column¶
Create a new computed column from an expression.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
name |
Yes | str |
Name for the new column |
expr |
Yes | str |
SQL-like expression |
Examples¶
# Arithmetic
- op: derive_column
name: total_with_tax
expr: amount * 1.1
- op: derive_column
name: profit_margin
expr: (revenue - cost) / revenue * 100
# String operations
- op: derive_column
name: full_name
expr: concat(first_name, ' ', last_name)
# Date extraction
- op: derive_column
name: year
expr: extract(year from created_at)
# Conditional logic
- op: derive_column
name: size_category
expr: |
CASE
WHEN amount < 100 THEN 'small'
WHEN amount < 1000 THEN 'medium'
ELSE 'large'
END
# Null handling
- op: derive_column
name: discount_safe
expr: COALESCE(discount, 0)
Functions¶
| Category | Functions |
|---|---|
| String | upper(), lower(), trim(), concat(), substring(), length() |
| Math | abs(), round(), floor(), ceil() |
| Date | extract(), date_trunc() |
| Null | COALESCE(), NULLIF() |
Python API¶
from quicketl.config.transforms import DeriveColumnTransform
transform = DeriveColumnTransform(name="total_with_tax", expr="amount * 1.1")
cast¶
Convert column data types.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
columns |
Yes | dict[str, str] |
Column → target type mapping |
Supported Types¶
| Type | Aliases | Description |
|---|---|---|
string |
str |
Text/string |
int64 |
int, integer |
64-bit integer |
float64 |
float, double |
64-bit float |
boolean |
bool |
True/False |
date |
Date (no time) | |
datetime |
timestamp |
Date with time |
Examples¶
# String to numbers
- op: cast
columns:
quantity: int64
price: float64
# Numbers to strings
- op: cast
columns:
zip_code: string
product_id: string
# String to date
- op: cast
columns:
order_date: date
created_at: datetime
Python API¶
from quicketl.config.transforms import CastTransform
transform = CastTransform(columns={"id": "string", "amount": "float64"})
fill_null¶
Replace null values with specified defaults.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
columns |
Yes | dict[str, any] |
Column → replacement value |
Examples¶
# Numeric defaults
- op: fill_null
columns:
amount: 0
discount: 0.0
quantity: 1
# String defaults
- op: fill_null
columns:
status: "unknown"
category: "uncategorized"
# Mixed types
- op: fill_null
columns:
amount: 0
status: "pending"
is_active: true
Python API¶
from quicketl.config.transforms import FillNullTransform
transform = FillNullTransform(columns={"amount": 0, "status": "unknown"})
dedup¶
Remove duplicate rows.
Usage¶
Parameters¶
| Parameter | Required | Default | Description |
|---|---|---|---|
columns |
No | All columns | Columns to consider for uniqueness |
Examples¶
# Remove exact duplicates
- op: dedup
# Keep first per customer
- op: dedup
columns: [customer_id]
# Unique combination
- op: dedup
columns: [customer_id, product_id]
Common Pattern: Latest Record¶
transforms:
# Sort to get latest first
- op: sort
by: [updated_at]
descending: true
# Keep only first (latest) per customer
- op: dedup
columns: [customer_id]
Python API¶
from quicketl.config.transforms import DedupTransform
transform = DedupTransform(columns=["customer_id"])
sort¶
Order rows by one or more columns.
Usage¶
Parameters¶
| Parameter | Required | Default | Description |
|---|---|---|---|
by |
Yes | - | Columns to sort by |
descending |
No | false |
Sort in descending order |
Examples¶
# Ascending
- op: sort
by: [name]
# Descending
- op: sort
by: [amount]
descending: true
# Multiple columns
- op: sort
by: [category, amount]
descending: true
Python API¶
from quicketl.config.transforms import SortTransform
transform = SortTransform(by=["amount"], descending=True)
join¶
Join two datasets on one or more columns.
Usage¶
Parameters¶
| Parameter | Required | Default | Description |
|---|---|---|---|
right |
Yes | - | Reference to right dataset |
on |
Yes | - | Join key columns |
how |
No | inner |
Join type: inner, left, right, outer |
Join Types¶
| Type | Description |
|---|---|
inner |
Only matching rows from both sides |
left |
All rows from left, matching from right |
right |
All rows from right, matching from left |
outer |
All rows from both sides |
Examples¶
# Left join with customers
- op: join
right: customers
on: [customer_id]
how: left
# Multiple join keys
- op: join
right: products
on: [product_id, region]
how: left
Python API¶
from quicketl.config.transforms import JoinTransform
transform = JoinTransform(right="customers", on=["customer_id"], how="left")
aggregate¶
Group data and compute summary statistics.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
group_by |
Yes | list[str] |
Columns to group by |
aggs |
Yes | dict[str, str] |
Output column → aggregation expression |
Aggregation Functions¶
| Function | Description | Example |
|---|---|---|
sum(col) |
Sum of values | sum(amount) |
avg(col) |
Average (mean) | avg(amount) |
min(col) |
Minimum value | min(amount) |
max(col) |
Maximum value | max(amount) |
count(*) |
Count all rows | count(*) |
count(col) |
Count non-null | count(customer_id) |
Examples¶
# Basic aggregation
- op: aggregate
group_by: [category]
aggs:
total_sales: sum(amount)
# Multiple aggregations
- op: aggregate
group_by: [region]
aggs:
total_sales: sum(amount)
avg_order: avg(amount)
min_order: min(amount)
max_order: max(amount)
order_count: count(*)
# Multiple group columns
- op: aggregate
group_by: [region, category, year]
aggs:
total: sum(amount)
Python API¶
from quicketl.config.transforms import AggregateTransform
transform = AggregateTransform(
group_by=["region"],
aggs={"total_sales": "sum(amount)", "order_count": "count(*)"}
)
union¶
Vertically combine multiple datasets.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
sources |
Yes | list[str] |
References to datasets to combine |
Examples¶
# Combine two datasets
- op: union
sources: [north_sales, south_sales]
# Combine multiple
- op: union
sources: [q1_data, q2_data, q3_data, q4_data]
Schema Requirement
All datasets must have the same columns (names and types).
Python API¶
limit¶
Limit output to the first N rows.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
n |
Yes | int |
Maximum number of rows (must be > 0) |
Examples¶
# Basic limit
- op: limit
n: 100
# Top N pattern (with sort)
transforms:
- op: sort
by: [sales]
descending: true
- op: limit
n: 10
# Sample for development
- op: limit
n: ${SAMPLE_SIZE:-10000}
Python API¶
Best Practices¶
Filter Early¶
Apply filters as early as possible to reduce data volume:
transforms:
- op: filter
predicate: date >= '2025-01-01' # First: reduce rows
- op: derive_column
name: metric
expr: expensive_calculation # Then: compute on smaller dataset
Select Before Aggregate¶
Remove unnecessary columns before aggregation:
transforms:
- op: select
columns: [category, amount] # Remove unused columns
- op: aggregate
group_by: [category]
aggs:
total: sum(amount)
Derive Before Aggregate¶
Create columns needed for aggregation:
transforms:
- op: derive_column
name: net_amount
expr: amount - discount
- op: aggregate
group_by: [region]
aggs:
total_net: sum(net_amount)
Related¶
- Expression Language - Full expression syntax
- Data Types - Type reference
- Quality Checks - Validate transformed data