Quality Check Types¶
This page documents all 5 quality check types available in QuickETL.
Quick Reference¶
| Check | Purpose | Example |
|---|---|---|
not_null |
Ensure no null values | columns: [id, name] |
unique |
Verify uniqueness | columns: [id] |
row_count |
Validate row count bounds | min: 1, max: 100000 |
accepted_values |
Check against whitelist | values: [a, b, c] |
expression |
Custom SQL validation | expr: amount >= 0 |
Usage Overview¶
Checks run after transforms, before writing to the sink:
graph LR
A[Source] --> B[Transforms]
B --> C[Quality Checks]
C --> D[Sink]
checks:
- type: not_null
columns: [id, name, amount]
- type: unique
columns: [id]
- type: row_count
min: 1
- type: expression
expr: amount >= 0
not_null¶
Verify that specified columns contain no null values.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
columns |
Yes | list[str] |
Columns that must not contain nulls |
Examples¶
# Single column
- type: not_null
columns: [id]
# Multiple columns
- type: not_null
columns: [id, customer_id, amount, created_at]
# Check join results
- type: not_null
columns: [customer_name] # Verify join found matches
Pass/Fail¶
- Pass: All values in specified columns are non-null
- Fail: Any null value found in specified columns
Python API¶
from quicketl.config.checks import NotNullCheck
check = NotNullCheck(columns=["id", "name", "amount"])
unique¶
Verify that column values are unique (no duplicates).
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
columns |
Yes | list[str] |
Columns that must be unique |
Examples¶
# Primary key
- type: unique
columns: [id]
# Composite key
- type: unique
columns: [customer_id, order_date]
# Verify after dedup
- type: unique
columns: [id]
Pass/Fail¶
- Pass: All combinations of specified columns are unique
- Fail: Duplicate values found
Null Handling¶
Nulls are treated as equal for uniqueness - two rows with NULL in id are considered duplicates.
Python API¶
from quicketl.config.checks import UniqueCheck
check = UniqueCheck(columns=["id"])
check = UniqueCheck(columns=["customer_id", "order_date"])
row_count¶
Verify that the row count is within expected bounds.
Usage¶
Parameters¶
| Parameter | Required | Default | Description |
|---|---|---|---|
min |
No | None | Minimum row count |
max |
No | None | Maximum row count |
At least one of min or max must be specified.
Examples¶
# Non-empty output
- type: row_count
min: 1
# Maximum only
- type: row_count
max: 1000000
# Expected range
- type: row_count
min: 1000
max: 10000
# With variables
- type: row_count
min: ${MIN_ROWS:-1}
max: ${MAX_ROWS:-1000000}
Pass/Fail¶
- Pass: Row count is within specified bounds
- Fail: Row count is outside bounds
Python API¶
from quicketl.config.checks import RowCountCheck
check = RowCountCheck(min=1)
check = RowCountCheck(min=100, max=10000)
accepted_values¶
Verify that all values in a column are from an allowed list.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
column |
Yes | str |
Column to check |
values |
Yes | list[any] |
Allowed values |
Examples¶
# Status field
- type: accepted_values
column: status
values: [pending, active, completed, cancelled]
# Region codes
- type: accepted_values
column: region
values: [north, south, east, west]
# Numeric values
- type: accepted_values
column: priority
values: [1, 2, 3, 4, 5]
# Boolean-ish fields
- type: accepted_values
column: is_verified
values: [true, false, "Y", "N", 1, 0]
Pass/Fail¶
- Pass: All values in the column are in the allowed list
- Fail: Any value not in the allowed list
Null Handling¶
By default, nulls are considered invalid. To allow nulls, include null in values:
Case Sensitivity¶
Values are case-sensitive. Standardize case before checking if needed:
transforms:
- op: derive_column
name: status_lower
expr: lower(status)
checks:
- type: accepted_values
column: status_lower
values: [pending, active, completed]
Python API¶
from quicketl.config.checks import AcceptedValuesCheck
check = AcceptedValuesCheck(column="status", values=["pending", "active", "completed"])
expression¶
Validate data using a custom SQL expression.
Usage¶
Parameters¶
| Parameter | Required | Type | Description |
|---|---|---|---|
expr |
Yes | str |
SQL expression that must be true for all rows |
Examples¶
# Positive values
- type: expression
expr: amount >= 0
# Range check
- type: expression
expr: amount BETWEEN 0 AND 10000
# Non-empty string
- type: expression
expr: name IS NOT NULL AND length(name) > 0
# Date validation
- type: expression
expr: created_at <= current_date()
# Multiple conditions
- type: expression
expr: amount > 0 AND quantity > 0 AND price > 0
# Business rules
- type: expression
expr: discount <= amount
# End date after start date
- type: expression
expr: end_date >= start_date
# Valid email format (basic)
- type: expression
expr: email LIKE '%@%.%'
Multiline Expressions¶
- type: expression
expr: |
amount > 0
AND quantity > 0
AND status IN ('pending', 'active', 'completed')
Complex Logic¶
Pass/Fail¶
- Pass: Expression is true for all rows
- Fail: Expression is false for any row
Python API¶
Check Results¶
Pipeline results include check details:
result = pipeline.run()
# Check overall status
if result.check_results["all_passed"]:
print("All checks passed!")
else:
# Examine individual results
for check in result.check_results["results"]:
print(f"{check['check_type']}: {'PASS' if check['passed'] else 'FAIL'}")
print(f" {check['message']}")
Failure Behavior¶
Fail on Check Failure (Default)¶
If any check fails, the pipeline stops and no output is written.
Continue on Check Failure¶
Checks are logged but pipeline continues.
Python API¶
result = pipeline.run(fail_on_check_failure=False)
if not result.check_results["all_passed"]:
print("Some checks failed:", result.check_results)
Common Patterns¶
Data Quality Gate¶
checks:
# Mandatory columns
- type: not_null
columns: [id, customer_id, amount]
# Primary key
- type: unique
columns: [id]
# Business rules
- type: expression
expr: amount > 0
- type: accepted_values
column: status
values: [pending, processing, completed, failed]
Anomaly Detection¶
checks:
# Reasonable data volume
- type: row_count
min: 1000
max: 100000
# No extreme values
- type: expression
expr: amount BETWEEN 0 AND 10000
After Join Validation¶
transforms:
- op: join
right: customers
on: [customer_id]
how: left
checks:
# Verify join found matches
- type: not_null
columns: [customer_name]
# Check join success
- type: row_count
min: 1
Related¶
- Expression Language - Full expression syntax
- Transform Operations - Prepare data before checks