Cloud Storage¶

Read and write files from AWS S3, Google Cloud Storage, and Azure ADLS.

Overview¶

QuickETL supports cloud storage through fsspec, providing a unified interface for all cloud providers.

AWS S3¶

Installation¶

pip install quicketl[aws]

Configuration¶

source:
  type: file
  path: s3://my-bucket/data/sales.parquet
  format: parquet

Authentication¶

Environment Variables (Recommended)¶

export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1

AWS Profile¶

export AWS_PROFILE=my-profile

IAM Role¶

When running on AWS (EC2, ECS, Lambda), IAM roles are automatically used.

Examples¶

Read from S3:

source:
  type: file
  path: s3://data-lake/raw/sales/2025/01/15/data.parquet

Write to S3:

sink:
  type: file
  path: s3://data-lake/processed/sales/
  format: parquet

With variables:

source:
  type: file
  path: s3://${BUCKET}/data/${DATE}/sales.parquet

Google Cloud Storage¶

Installation¶

pip install quicketl[gcp]

Configuration¶

source:
  type: file
  path: gs://my-bucket/data/sales.parquet
  format: parquet

Authentication¶

Service Account Key¶

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Application Default Credentials¶

gcloud auth application-default login

Examples¶

Read from GCS:

source:
  type: file
  path: gs://data-lake/raw/sales.parquet

Write to GCS:

sink:
  type: file
  path: gs://data-lake/processed/
  format: parquet

Azure ADLS¶

Installation¶

pip install quicketl[azure]

Configuration¶

source:
  type: file
  path: abfs://container@account.dfs.core.windows.net/data/sales.parquet
  format: parquet

Authentication¶

Connection String¶

export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;..."

Account Key¶

export AZURE_STORAGE_ACCOUNT_NAME=myaccount
export AZURE_STORAGE_ACCOUNT_KEY=...

Service Principal¶

export AZURE_TENANT_ID=...
export AZURE_CLIENT_ID=...
export AZURE_CLIENT_SECRET=...

Examples¶

Read from Azure:

source:
  type: file
  path: abfs://datalake@myaccount.dfs.core.windows.net/raw/sales.parquet

Write to Azure:

sink:
  type: file
  path: abfs://datalake@myaccount.dfs.core.windows.net/processed/
  format: parquet

URI Formats¶

Provider	Format
AWS S3	`s3://bucket/path/file.parquet`
GCS	`gs://bucket/path/file.parquet`
Azure ADLS Gen2	`abfs://container@account.dfs.core.windows.net/path/file.parquet`
Azure Blob	`az://container/path/file.parquet`

Common Patterns¶

Date-Partitioned Data¶

source:
  type: file
  path: s3://bucket/data/year=${YEAR}/month=${MONTH}/day=${DAY}/

sink:
  type: file
  path: s3://bucket/output/${DATE}/
  partition_by: [region]

Cross-Cloud Transfer¶

Read from one provider, write to another:

source:
  type: file
  path: s3://source-bucket/data.parquet

sink:
  type: file
  path: gs://dest-bucket/data.parquet

Environment-Specific Buckets¶

source:
  type: file
  path: ${DATA_BUCKET}/raw/sales.parquet

sink:
  type: file
  path: ${OUTPUT_BUCKET}/processed/

# Development
export DATA_BUCKET=s3://dev-data
export OUTPUT_BUCKET=s3://dev-output

# Production
export DATA_BUCKET=s3://prod-data
export OUTPUT_BUCKET=s3://prod-output

Performance Tips¶

Use Parquet¶

Parquet files are faster to read from cloud storage due to:

Columnar format (read only needed columns)
Built-in compression
Predicate pushdown support

Regional Proximity¶

Place compute near your data:

Use same region for storage and compute
Consider multi-region buckets for global access

Compression¶

Parquet is already compressed. For CSV:

source:
  type: file
  path: s3://bucket/data.csv.gz
  format: csv

Troubleshooting¶

Access Denied¶

Error: Access Denied

Verify credentials are set correctly
Check bucket/object permissions
Ensure IAM role has required permissions

Bucket Not Found¶

Error: Bucket not found

Check bucket name spelling
Verify bucket exists in the expected region
Check credentials have access to the bucket

Slow Performance¶

Check network connectivity
Verify data is in the same region as compute
Consider using larger instance types
Use Parquet instead of CSV

Missing Credentials¶

Error: No credentials found

Set environment variables
Configure AWS profile/GCP service account/Azure credentials
When running locally, ensure credentials file exists

Security Best Practices¶

Use IAM Roles¶

Prefer IAM roles over access keys:

# Running on AWS EC2/ECS with IAM role
source:
  type: file
  path: s3://bucket/data.parquet
  # No credentials needed - uses instance role

Don't Commit Credentials¶

Add to .gitignore:

.env
*.json  # Service account keys

Use Secrets Managers¶

For production, use secrets managers:

AWS Secrets Manager
Google Secret Manager
Azure Key Vault

Least Privilege¶

Grant minimal permissions:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": [
    "arn:aws:s3:::my-bucket/data/*"
  ]
}

File Sources - File format options
File Sinks - Writing files
Environment Variables - Credential configuration

Cloud Storage¶

Overview¶

AWS S3¶

Installation¶

Configuration¶

Authentication¶

Environment Variables (Recommended)¶

AWS Profile¶

IAM Role¶

Examples¶

Google Cloud Storage¶

Installation¶

Configuration¶

Authentication¶

Service Account Key¶

Application Default Credentials¶

Examples¶

Azure ADLS¶

Installation¶

Configuration¶

Authentication¶

Connection String¶

Account Key¶

Service Principal¶

Examples¶

URI Formats¶

Common Patterns¶

Date-Partitioned Data¶

Cross-Cloud Transfer¶

Environment-Specific Buckets¶

Performance Tips¶

Use Parquet¶

Regional Proximity¶

Compression¶

Troubleshooting¶

Access Denied¶

Bucket Not Found¶

Slow Performance¶

Missing Credentials¶

Security Best Practices¶

Use IAM Roles¶

Don't Commit Credentials¶

Use Secrets Managers¶

Least Privilege¶

Related¶