Skip to content

Multi-Source Join Example

This example demonstrates how to combine data from multiple sources using joins, creating an enriched dataset from orders, customers, and products.

Overview

Goal: Create an enriched order report combining:

  • Order data (transactions)
  • Customer data (demographics)
  • Product data (catalog information)

Output: A denormalized table with complete order information.

Sample Data

Orders (data/orders.csv)

order_id,customer_id,product_id,quantity,order_date,status
1001,C001,P001,2,2025-01-15,completed
1002,C002,P003,1,2025-01-15,completed
1003,C001,P002,3,2025-01-16,completed
1004,C003,P001,1,2025-01-16,pending
1005,C002,P004,2,2025-01-17,completed
1006,C004,P002,1,2025-01-17,completed
1007,C001,P003,2,2025-01-18,cancelled
1008,C005,P001,4,2025-01-18,completed

Customers (data/customers.csv)

customer_id,name,email,region,signup_date
C001,Alice Smith,alice@example.com,North,2024-06-15
C002,Bob Johnson,bob@example.com,South,2024-08-22
C003,Carol White,carol@example.com,East,2024-09-10
C004,David Brown,david@example.com,West,2024-11-05
C005,Eve Davis,eve@example.com,North,2025-01-02

Products (data/products.csv)

product_id,name,category,unit_price,cost
P001,Widget A,Electronics,29.99,15.00
P002,Gadget B,Electronics,49.99,25.00
P003,Service C,Services,99.99,40.00
P004,Widget D,Electronics,19.99,10.00
P005,Tool E,Hardware,39.99,20.00

Pipeline Configuration

Create pipelines/order_report.yml:

name: enriched_order_report
description: Join orders with customer and product data
engine: duckdb

# Start with orders
source:
  type: file
  path: data/orders.csv
  format: csv

transforms:
  # Filter to completed orders only
  - op: filter
    predicate: status = 'completed'

  # Join with customers
  - op: join
    right:
      type: file
      path: data/customers.csv
      format: csv
    on: [customer_id]
    how: left

  # Join with products
  - op: join
    right:
      type: file
      path: data/products.csv
      format: csv
    on: [product_id]
    how: left

  # Calculate derived fields
  - op: derive_column
    name: line_total
    expr: quantity * unit_price

  - op: derive_column
    name: profit
    expr: quantity * (unit_price - cost)

  - op: derive_column
    name: profit_margin
    expr: (unit_price - cost) / unit_price

  # Select and rename final columns
  - op: select
    columns:
      - order_id
      - order_date
      - customer_id
      - name        # customer name from join
      - email
      - region
      - product_id
      - name        # product name - will need rename
      - category
      - quantity
      - unit_price
      - line_total
      - profit
      - profit_margin

  # Rename duplicate 'name' columns
  - op: rename
    columns:
      name: product_name
      name_1: customer_name

  # Sort by order date
  - op: sort
    by:
      - column: order_date
        order: desc
      - column: order_id
        order: asc

# Quality checks
checks:
  - check: not_null
    columns: [order_id, customer_id, product_id, line_total]

  - check: expression
    expr: line_total > 0

  - check: expression
    expr: profit_margin >= 0 AND profit_margin <= 1

# Write enriched data
sink:
  type: file
  path: output/order_report.parquet
  format: parquet

Alternative: Explicit Column Selection

To avoid column name conflicts, use explicit selection:

transforms:
  # Join with customers
  - op: join
    right:
      type: file
      path: data/customers.csv
      format: csv
    on: [customer_id]
    how: left

  # Rename immediately after join
  - op: rename
    columns:
      name: customer_name

  # Join with products
  - op: join
    right:
      type: file
      path: data/products.csv
      format: csv
    on: [product_id]
    how: left

  # Rename product name
  - op: rename
    columns:
      name: product_name

Running the Pipeline

# Validate
quicketl validate pipelines/order_report.yml

# Execute
quicketl run pipelines/order_report.yml

Expected Output

order_id order_date customer_name region product_name category quantity line_total profit profit_margin
1008 2025-01-18 Eve Davis North Widget A Electronics 4 119.96 59.96 0.50
1006 2025-01-17 David Brown West Gadget B Electronics 1 49.99 24.99 0.50
1005 2025-01-17 Bob Johnson South Widget D Electronics 2 39.98 19.98 0.50
1003 2025-01-16 Alice Smith North Gadget B Electronics 3 149.97 74.97 0.50
1002 2025-01-15 Bob Johnson South Service C Services 1 99.99 59.99 0.60
1001 2025-01-15 Alice Smith North Widget A Electronics 2 59.98 29.98 0.50

Join Types Explained

Inner Join

Only matching rows from both tables:

- op: join
  right: ...
  on: [customer_id]
  how: inner  # Only orders with matching customers

Left Join

All rows from left table, matching from right:

- op: join
  right: ...
  on: [customer_id]
  how: left  # All orders, customer data where available

Right Join

All rows from right table, matching from left:

- op: join
  right: ...
  on: [customer_id]
  how: right  # All customers, their orders where available

Outer Join

All rows from both tables:

- op: join
  right: ...
  on: [customer_id]
  how: outer  # All orders and all customers

Multiple Join Keys

For composite keys:

- op: join
  right:
    type: file
    path: data/inventory.csv
    format: csv
  on: [product_id, warehouse_id]  # Multiple columns
  how: inner

Handling Missing Data

After a left join, the right side may have NULLs:

transforms:
  - op: join
    right: ...
    on: [customer_id]
    how: left

  # Fill NULL values from failed joins
  - op: fill_null
    columns:
      customer_name: Unknown Customer
      region: Unknown

Performance Tips

1. Filter Before Joining

Reduce data volume before expensive joins:

transforms:
  # Filter first
  - op: filter
    predicate: status = 'completed' AND order_date >= '2025-01-01'

  # Then join (fewer rows to process)
  - op: join
    right: ...

2. Join Smaller Tables

Put larger table on the left, smaller on right:

# Good: Large orders table on left
source:
  type: file
  path: data/orders.csv  # 1M rows

transforms:
  - op: join
    right:
      type: file
      path: data/products.csv  # 1K rows (smaller)

3. Select Only Needed Columns

Reduce memory by selecting early:

transforms:
  - op: join
    right:
      type: file
      path: data/customers.csv
      format: csv
    on: [customer_id]
    how: left

  # Select immediately after join
  - op: select
    columns: [order_id, customer_id, name, region]

Next Steps