Cloud Storage¶
Read and write files from AWS S3, Google Cloud Storage, and Azure ADLS.
Overview¶
QuickETL supports cloud storage through fsspec, providing a unified interface for all cloud providers.
AWS S3¶
Installation¶
Configuration¶
Authentication¶
Environment Variables (Recommended)¶
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
AWS Profile¶
IAM Role¶
When running on AWS (EC2, ECS, Lambda), IAM roles are automatically used.
Examples¶
Read from S3:
Write to S3:
With variables:
Google Cloud Storage¶
Installation¶
Configuration¶
Authentication¶
Service Account Key¶
Application Default Credentials¶
Examples¶
Read from GCS:
Write to GCS:
Azure ADLS¶
Installation¶
Configuration¶
source:
type: file
path: abfs://container@account.dfs.core.windows.net/data/sales.parquet
format: parquet
Authentication¶
Connection String¶
Account Key¶
Service Principal¶
Examples¶
Read from Azure:
Write to Azure:
URI Formats¶
| Provider | Format |
|---|---|
| AWS S3 | s3://bucket/path/file.parquet |
| GCS | gs://bucket/path/file.parquet |
| Azure ADLS Gen2 | abfs://container@account.dfs.core.windows.net/path/file.parquet |
| Azure Blob | az://container/path/file.parquet |
Common Patterns¶
Date-Partitioned Data¶
source:
type: file
path: s3://bucket/data/year=${YEAR}/month=${MONTH}/day=${DAY}/
sink:
type: file
path: s3://bucket/output/${DATE}/
partition_by: [region]
Cross-Cloud Transfer¶
Read from one provider, write to another:
source:
type: file
path: s3://source-bucket/data.parquet
sink:
type: file
path: gs://dest-bucket/data.parquet
Environment-Specific Buckets¶
source:
type: file
path: ${DATA_BUCKET}/raw/sales.parquet
sink:
type: file
path: ${OUTPUT_BUCKET}/processed/
# Development
export DATA_BUCKET=s3://dev-data
export OUTPUT_BUCKET=s3://dev-output
# Production
export DATA_BUCKET=s3://prod-data
export OUTPUT_BUCKET=s3://prod-output
Performance Tips¶
Use Parquet¶
Parquet files are faster to read from cloud storage due to:
- Columnar format (read only needed columns)
- Built-in compression
- Predicate pushdown support
Regional Proximity¶
Place compute near your data:
- Use same region for storage and compute
- Consider multi-region buckets for global access
Compression¶
Parquet is already compressed. For CSV:
Troubleshooting¶
Access Denied¶
- Verify credentials are set correctly
- Check bucket/object permissions
- Ensure IAM role has required permissions
Bucket Not Found¶
- Check bucket name spelling
- Verify bucket exists in the expected region
- Check credentials have access to the bucket
Slow Performance¶
- Check network connectivity
- Verify data is in the same region as compute
- Consider using larger instance types
- Use Parquet instead of CSV
Missing Credentials¶
- Set environment variables
- Configure AWS profile/GCP service account/Azure credentials
- When running locally, ensure credentials file exists
Security Best Practices¶
Use IAM Roles¶
Prefer IAM roles over access keys:
# Running on AWS EC2/ECS with IAM role
source:
type: file
path: s3://bucket/data.parquet
# No credentials needed - uses instance role
Don't Commit Credentials¶
Add to .gitignore:
Use Secrets Managers¶
For production, use secrets managers:
- AWS Secrets Manager
- Google Secret Manager
- Azure Key Vault
Least Privilege¶
Grant minimal permissions:
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my-bucket/data/*"
]
}
Related¶
- File Sources - File format options
- File Sinks - Writing files
- Environment Variables - Credential configuration