Cloud Data Warehouses¶

Cloud data warehouses provide serverless, scalable analytics. QuickETL pushes transformations directly to the warehouse for efficient in-warehouse processing.

Quick Comparison¶

Feature	Snowflake	BigQuery
Provider	Multi-cloud	Google Cloud
Pricing	Compute + Storage	Per TB scanned
Scaling	Manual warehouse sizing	Automatic
Best For	Enterprise, multi-cloud	GCP-native, pay-per-query

Snowflake¶

Snowflake is a cloud-native data warehouse with separated compute and storage.

Installation¶

pip install quicketl[snowflake]

Configuration¶

Set environment variables:

export SNOWFLAKE_ACCOUNT=xy12345.us-east-1
export SNOWFLAKE_USER=quicketl_user
export SNOWFLAKE_PASSWORD=your_password
export SNOWFLAKE_DATABASE=analytics
export SNOWFLAKE_SCHEMA=public
export SNOWFLAKE_WAREHOUSE=compute_wh
export SNOWFLAKE_ROLE=analyst_role

Basic Pipeline¶

name: snowflake_etl
engine: snowflake

source:
  type: database
  connection: snowflake
  table: raw_sales

transforms:
  - op: filter
    predicate: sale_date >= '2025-01-01'
  - op: aggregate
    group_by: [region, product_category]
    aggs:
      total_revenue: sum(amount)
      order_count: count(*)

sink:
  type: database
  connection: snowflake
  table: sales_summary
  mode: replace

Write Modes¶

# Replace (TRUNCATE + INSERT)
sink:
  mode: replace

# Append (INSERT only)
sink:
  mode: append

# Merge (UPSERT)
sink:
  mode: merge
  merge_keys: [id]

Key-Pair Authentication¶

For production:

export SNOWFLAKE_PRIVATE_KEY_PATH=/path/to/rsa_key.p8
export SNOWFLAKE_PRIVATE_KEY_PASSPHRASE=your_passphrase

Cost Optimization¶

Use appropriate warehouse size (XS: 1 credit/hr, L: 8 credits/hr)
Filter early to reduce compute
Use clustering keys for large tables

ALTER TABLE sales CLUSTER BY (date, region);

Troubleshooting¶

Connection Failed: Verify account format account.region

Warehouse Suspended: Auto-resumes, or ALTER WAREHOUSE wh RESUME

Permission Denied:

GRANT SELECT ON TABLE raw.sales TO ROLE quicketl_role;
GRANT INSERT ON TABLE analytics.summary TO ROLE quicketl_role;

BigQuery¶

Google BigQuery is a serverless data warehouse with a pay-per-query model.

Installation¶

pip install quicketl[bigquery]

Configuration¶

# Service Account (recommended)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
export BIGQUERY_PROJECT=my-project-id

# Or Application Default Credentials
gcloud auth application-default login

Additional settings:

export BIGQUERY_DATASET=analytics
export BIGQUERY_LOCATION=US

Basic Pipeline¶

name: bigquery_etl
engine: bigquery

source:
  type: database
  connection: bigquery
  table: raw_data.sales

transforms:
  - op: filter
    predicate: transaction_date >= '2025-01-01'
  - op: aggregate
    group_by: [region, product_type]
    aggs:
      total_revenue: sum(amount)

sink:
  type: database
  connection: bigquery
  table: analytics.sales_summary
  mode: replace

Write Modes¶

# Replace (WRITE_TRUNCATE)
sink:
  mode: replace

# Append (WRITE_APPEND)
sink:
  mode: append

# Partitioned tables
sink:
  options:
    partition_field: date
    partition_type: DAY

Cost Optimization¶

BigQuery charges per TB scanned:

Use partitioned tables:

source:
  query: |
    SELECT * FROM `project.dataset.events`
    WHERE _PARTITIONDATE BETWEEN '2025-01-01' AND '2025-01-31'

Select only needed columns (don't use SELECT *):

transforms:
  - op: select
    columns: [id, amount, date]

Use clustering:

CREATE TABLE analytics.sales
PARTITION BY date
CLUSTER BY region, product_type
AS SELECT * FROM raw.sales;

Required Permissions¶

The service account needs these IAM roles:

roles/bigquery.dataViewer - Read tables
roles/bigquery.dataEditor - Write tables
roles/bigquery.jobUser - Run queries

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:quicketl@PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/bigquery.dataEditor"

Troubleshooting¶

Authentication Failed: Set GOOGLE_APPLICATION_CREDENTIALS or run gcloud auth application-default login

Dataset Not Found: Verify with bq ls project:dataset

Bytes Billed Too High: Always filter by partition column, select only needed columns

Choosing Between Snowflake and BigQuery¶

Scenario	Recommendation
Google Cloud data	BigQuery
Multi-cloud / AWS / Azure	Snowflake
Predictable compute costs	Snowflake
Pay-per-query model	BigQuery
Enterprise features	Snowflake
GCP ecosystem integration	BigQuery

Local Backends - For development
Distributed Backends - For Spark
Databases - PostgreSQL, MySQL, ClickHouse

Cloud Data Warehouses¶

Quick Comparison¶

Snowflake¶

Installation¶

Configuration¶

Basic Pipeline¶

Write Modes¶

Key-Pair Authentication¶

Cost Optimization¶

Troubleshooting¶

BigQuery¶

Installation¶

Configuration¶

Basic Pipeline¶

Write Modes¶

Cost Optimization¶

Required Permissions¶

Troubleshooting¶

Choosing Between Snowflake and BigQuery¶

Related¶