Distributed Backend: Apache Spark¶
Apache Spark is a distributed computing framework for big data processing. Use Spark for datasets that exceed single-machine capacity.
Installation¶
Java Required
Spark requires Java 8, 11, or 17. Verify with java -version.
When to Use Spark¶
Ideal for:
- Datasets that don't fit on a single machine
- Distributed cluster environments (YARN, Kubernetes, EMR)
- Integration with Hadoop ecosystem
- Production data lake pipelines
Consider alternatives when:
- Data fits in memory (use DuckDB or Polars - much faster)
- Low latency is critical (Spark has 5-30s startup overhead)
- Simple transformations (Spark is overkill)
Configuration¶
Basic Usage¶
name: spark_pipeline
engine: spark
source:
type: file
path: s3://bucket/data/*.parquet
format: parquet
transforms:
- op: filter
predicate: date >= '2025-01-01'
- op: aggregate
group_by: [region, category]
aggs:
revenue: sum(amount)
sink:
type: file
path: s3://bucket/output/
format: parquet
Spark Configuration¶
Configure via environment variables:
export SPARK_MASTER=spark://master:7077
export SPARK_EXECUTOR_MEMORY=4g
export SPARK_EXECUTOR_CORES=2
export SPARK_DRIVER_MEMORY=4g
Deployment Modes¶
Local Mode (Development)¶
Standalone Cluster¶
YARN (Hadoop)¶
export SPARK_MASTER=yarn
export HADOOP_CONF_DIR=/etc/hadoop/conf
quicketl run pipeline.yml --engine spark
Kubernetes¶
export SPARK_MASTER=k8s://https://kubernetes:443
export SPARK_KUBERNETES_CONTAINER_IMAGE=spark:3.5.0
quicketl run pipeline.yml --engine spark
Cloud Integration¶
AWS EMR¶
Databricks¶
export SPARK_MASTER=databricks
export DATABRICKS_HOST=https://xxx.cloud.databricks.com
export DATABRICKS_TOKEN=dapi...
Google Dataproc¶
Data Sources¶
Spark excels at reading from distributed storage:
# S3
source:
type: file
path: s3a://bucket/data/*.parquet
# HDFS
source:
type: file
path: hdfs://namenode/data/*.parquet
# Delta Lake
source:
type: file
path: s3://bucket/delta-table/
format: delta
Performance Optimization¶
1. Partition Data¶
2. Filter Early¶
Push predicates as early as possible to reduce shuffle:
transforms:
- op: filter
predicate: date >= '2025-01-01' # Filter first
- op: aggregate
group_by: [category]
aggs:
total: sum(amount)
3. Use Appropriate File Formats¶
- Parquet: Best for analytics (columnar, compressed)
- Delta: ACID transactions, time travel
- ORC: Hive compatibility
Example: Large-Scale ETL¶
name: daily_sales_etl
description: Process daily sales across all regions
engine: spark
source:
type: file
path: s3://data-lake/raw/sales/date=${DATE}/*.parquet
format: parquet
transforms:
- op: filter
predicate: amount > 0 AND status = 'completed'
- op: join
right:
type: file
path: s3://data-lake/dim/products/
format: parquet
on: [product_id]
how: left
- op: aggregate
group_by: [region, category, date]
aggs:
total_revenue: sum(amount)
order_count: count(*)
sink:
type: file
path: s3://data-lake/processed/sales_summary/
format: parquet
options:
partition_by: [date, region]
mode: overwrite
Limitations¶
- Startup Overhead: 5-30 seconds for session initialization
- Small Data: Inefficient for datasets under 1GB
- Complexity: Requires cluster management
- Cost: Cloud clusters can be expensive
Troubleshooting¶
Java Not Found¶
# macOS
brew install openjdk@17
export JAVA_HOME=/opt/homebrew/opt/openjdk@17
# Ubuntu
sudo apt install openjdk-17-jdk
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Out of Memory¶
S3 Access Denied¶
Ensure AWS credentials are configured:
Related¶
- Local Backends - For smaller datasets
- Cloud Warehouses - Serverless alternatives