Skip to content

Transform Operations

This page documents all 12 transform operations available in QuickETL.

Quick Reference

Transform Purpose Example
select Choose and reorder columns columns: [id, name, amount]
rename Rename columns mapping: {old: new}
filter Filter rows predicate: amount > 100
derive_column Add computed columns expr: amount * 1.1
cast Convert types columns: {id: string}
fill_null Replace nulls columns: {status: "unknown"}
dedup Remove duplicates columns: [customer_id]
sort Order rows by: [amount]
join Join datasets on: [customer_id]
aggregate Group and summarize aggs: {total: sum(amount)}
union Combine datasets sources: [data1, data2]
limit Limit rows n: 1000

select

Choose and reorder columns in your data.

Usage

- op: select
  columns: [id, name, amount]

Parameters

Parameter Required Type Description
columns Yes list[str] Columns to keep, in order

Examples

# Keep specific columns
- op: select
  columns: [id, name, amount]

# Reorder columns
- op: select
  columns: [amount, id, name]

# Select after transforms for final output
- op: select
  columns: [region, total_sales, order_count]

Python API

from quicketl.config.transforms import SelectTransform
transform = SelectTransform(columns=["id", "name", "amount"])

rename

Rename columns using a mapping.

Usage

- op: rename
  mapping:
    old_name: new_name

Parameters

Parameter Required Type Description
mapping Yes dict[str, str] Old name → new name mapping

Examples

# Single column
- op: rename
  mapping:
    cust_id: customer_id

# Multiple columns
- op: rename
  mapping:
    cust_id: customer_id
    order_amt: amount
    created: created_at

Python API

from quicketl.config.transforms import RenameTransform
transform = RenameTransform(mapping={"cust_id": "customer_id", "order_amt": "amount"})

filter

Filter rows based on a SQL-like predicate.

Usage

- op: filter
  predicate: amount > 100

Parameters

Parameter Required Type Description
predicate Yes str SQL-like boolean expression

Examples

# Simple comparison
- op: filter
  predicate: amount > 100

# Multiple conditions
- op: filter
  predicate: amount > 100 AND status = 'active'

# OR conditions
- op: filter
  predicate: region = 'north' OR region = 'south'

# Null handling
- op: filter
  predicate: email IS NOT NULL

# Date filtering with variables
- op: filter
  predicate: created_at >= '${START_DATE}' AND created_at < '${END_DATE}'

# IN operator
- op: filter
  predicate: category IN ('Electronics', 'Home', 'Office')

# Pattern matching
- op: filter
  predicate: name LIKE 'Widget%'

Operators

Operator Description Example
=, !=, <> Equality status = 'active'
>, <, >=, <= Comparison amount > 100
AND, OR, NOT Logical a > 1 AND b < 10
IS NULL, IS NOT NULL Null check email IS NOT NULL
IN, NOT IN List membership region IN ('north', 'south')
BETWEEN Range amount BETWEEN 100 AND 500
LIKE Pattern match name LIKE 'Widget%'

Python API

from quicketl.config.transforms import FilterTransform
transform = FilterTransform(predicate="amount > 100 AND status = 'active'")

derive_column

Create a new computed column from an expression.

Usage

- op: derive_column
  name: total_with_tax
  expr: amount * 1.1

Parameters

Parameter Required Type Description
name Yes str Name for the new column
expr Yes str SQL-like expression

Examples

# Arithmetic
- op: derive_column
  name: total_with_tax
  expr: amount * 1.1

- op: derive_column
  name: profit_margin
  expr: (revenue - cost) / revenue * 100

# String operations
- op: derive_column
  name: full_name
  expr: concat(first_name, ' ', last_name)

# Date extraction
- op: derive_column
  name: year
  expr: extract(year from created_at)

# Conditional logic
- op: derive_column
  name: size_category
  expr: |
    CASE
      WHEN amount < 100 THEN 'small'
      WHEN amount < 1000 THEN 'medium'
      ELSE 'large'
    END

# Null handling
- op: derive_column
  name: discount_safe
  expr: COALESCE(discount, 0)

Functions

Category Functions
String upper(), lower(), trim(), concat(), substring(), length()
Math abs(), round(), floor(), ceil()
Date extract(), date_trunc()
Null COALESCE(), NULLIF()

Python API

from quicketl.config.transforms import DeriveColumnTransform
transform = DeriveColumnTransform(name="total_with_tax", expr="amount * 1.1")

cast

Convert column data types.

Usage

- op: cast
  columns:
    id: string
    amount: float64

Parameters

Parameter Required Type Description
columns Yes dict[str, str] Column → target type mapping

Supported Types

Type Aliases Description
string str Text/string
int64 int, integer 64-bit integer
float64 float, double 64-bit float
boolean bool True/False
date Date (no time)
datetime timestamp Date with time

Examples

# String to numbers
- op: cast
  columns:
    quantity: int64
    price: float64

# Numbers to strings
- op: cast
  columns:
    zip_code: string
    product_id: string

# String to date
- op: cast
  columns:
    order_date: date
    created_at: datetime

Python API

from quicketl.config.transforms import CastTransform
transform = CastTransform(columns={"id": "string", "amount": "float64"})

fill_null

Replace null values with specified defaults.

Usage

- op: fill_null
  columns:
    amount: 0
    status: "unknown"

Parameters

Parameter Required Type Description
columns Yes dict[str, any] Column → replacement value

Examples

# Numeric defaults
- op: fill_null
  columns:
    amount: 0
    discount: 0.0
    quantity: 1

# String defaults
- op: fill_null
  columns:
    status: "unknown"
    category: "uncategorized"

# Mixed types
- op: fill_null
  columns:
    amount: 0
    status: "pending"
    is_active: true

Python API

from quicketl.config.transforms import FillNullTransform
transform = FillNullTransform(columns={"amount": 0, "status": "unknown"})

dedup

Remove duplicate rows.

Usage

- op: dedup
  columns: [customer_id]

Parameters

Parameter Required Default Description
columns No All columns Columns to consider for uniqueness

Examples

# Remove exact duplicates
- op: dedup

# Keep first per customer
- op: dedup
  columns: [customer_id]

# Unique combination
- op: dedup
  columns: [customer_id, product_id]

Common Pattern: Latest Record

transforms:
  # Sort to get latest first
  - op: sort
    by: [updated_at]
    descending: true

  # Keep only first (latest) per customer
  - op: dedup
    columns: [customer_id]

Python API

from quicketl.config.transforms import DedupTransform
transform = DedupTransform(columns=["customer_id"])

sort

Order rows by one or more columns.

Usage

- op: sort
  by: [amount]
  descending: true

Parameters

Parameter Required Default Description
by Yes - Columns to sort by
descending No false Sort in descending order

Examples

# Ascending
- op: sort
  by: [name]

# Descending
- op: sort
  by: [amount]
  descending: true

# Multiple columns
- op: sort
  by: [category, amount]
  descending: true

Python API

from quicketl.config.transforms import SortTransform
transform = SortTransform(by=["amount"], descending=True)

join

Join two datasets on one or more columns.

Usage

- op: join
  right: customers
  on: [customer_id]
  how: left

Parameters

Parameter Required Default Description
right Yes - Reference to right dataset
on Yes - Join key columns
how No inner Join type: inner, left, right, outer

Join Types

Type Description
inner Only matching rows from both sides
left All rows from left, matching from right
right All rows from right, matching from left
outer All rows from both sides

Examples

# Left join with customers
- op: join
  right: customers
  on: [customer_id]
  how: left

# Multiple join keys
- op: join
  right: products
  on: [product_id, region]
  how: left

Python API

from quicketl.config.transforms import JoinTransform
transform = JoinTransform(right="customers", on=["customer_id"], how="left")

aggregate

Group data and compute summary statistics.

Usage

- op: aggregate
  group_by: [region]
  aggs:
    total_sales: sum(amount)
    order_count: count(*)

Parameters

Parameter Required Type Description
group_by Yes list[str] Columns to group by
aggs Yes dict[str, str] Output column → aggregation expression

Aggregation Functions

Function Description Example
sum(col) Sum of values sum(amount)
avg(col) Average (mean) avg(amount)
min(col) Minimum value min(amount)
max(col) Maximum value max(amount)
count(*) Count all rows count(*)
count(col) Count non-null count(customer_id)

Examples

# Basic aggregation
- op: aggregate
  group_by: [category]
  aggs:
    total_sales: sum(amount)

# Multiple aggregations
- op: aggregate
  group_by: [region]
  aggs:
    total_sales: sum(amount)
    avg_order: avg(amount)
    min_order: min(amount)
    max_order: max(amount)
    order_count: count(*)

# Multiple group columns
- op: aggregate
  group_by: [region, category, year]
  aggs:
    total: sum(amount)

Python API

from quicketl.config.transforms import AggregateTransform
transform = AggregateTransform(
    group_by=["region"],
    aggs={"total_sales": "sum(amount)", "order_count": "count(*)"}
)

union

Vertically combine multiple datasets.

Usage

- op: union
  sources: [data1, data2]

Parameters

Parameter Required Type Description
sources Yes list[str] References to datasets to combine

Examples

# Combine two datasets
- op: union
  sources: [north_sales, south_sales]

# Combine multiple
- op: union
  sources: [q1_data, q2_data, q3_data, q4_data]

Schema Requirement

All datasets must have the same columns (names and types).

Python API

# Use engine directly for union
combined = engine.union([north, south])

limit

Limit output to the first N rows.

Usage

- op: limit
  n: 1000

Parameters

Parameter Required Type Description
n Yes int Maximum number of rows (must be > 0)

Examples

# Basic limit
- op: limit
  n: 100

# Top N pattern (with sort)
transforms:
  - op: sort
    by: [sales]
    descending: true
  - op: limit
    n: 10

# Sample for development
- op: limit
  n: ${SAMPLE_SIZE:-10000}

Python API

from quicketl.config.transforms import LimitTransform
transform = LimitTransform(n=1000)

Best Practices

Filter Early

Apply filters as early as possible to reduce data volume:

transforms:
  - op: filter
    predicate: date >= '2025-01-01'  # First: reduce rows
  - op: derive_column
    name: metric
    expr: expensive_calculation      # Then: compute on smaller dataset

Select Before Aggregate

Remove unnecessary columns before aggregation:

transforms:
  - op: select
    columns: [category, amount]      # Remove unused columns
  - op: aggregate
    group_by: [category]
    aggs:
      total: sum(amount)

Derive Before Aggregate

Create columns needed for aggregation:

transforms:
  - op: derive_column
    name: net_amount
    expr: amount - discount
  - op: aggregate
    group_by: [region]
    aggs:
      total_net: sum(net_amount)