📊 Python Data Workflows – ✅ Data Validation 🐍
Posted on: June 19, 2026
Description:
Cleaning data is important, but validation is what ensures the data can actually be trusted.
Even after handling missing values and duplicates, datasets may still contain invalid records. Sales could be negative, quantities might be zero, or categories may not follow expected values.
This is where data validation becomes essential.
Why Data Validation Matters
Data validation acts as a quality gate before analysis.
Without validation:
- reports may contain incorrect numbers
- dashboards may show misleading insights
- downstream workflows can fail unexpectedly
By applying simple validation rules, we can identify problematic records early.
Validating Numeric Fields
A common rule is checking whether numeric values fall within expected ranges.
invalid_sales = df[df["sales"] < 0]
This ensures sales values are valid and prevents incorrect calculations.
Validating Business Rules
Many datasets have predefined acceptable values.
valid_categories = [
"Electronics",
"Furniture",
"Clothing"
]
Anything outside these values should be reviewed.
Building Validation Reports
Instead of manually inspecting issues, a validation report summarises the dataset quality.
validation_report = {
"missing_values": ...,
"duplicates": ...
}
This creates a quick overview of potential problems.
Real-World Applications
Data validation is used extensively in:
- ETL pipelines
- Reporting systems
- Data warehouses
- Analytics platforms
- Machine learning pipelines
It ensures only high-quality data moves forward.
Key Takeaways
- Validation is different from cleaning
- Rules help enforce data quality standards
- Validation reports simplify monitoring
- Reliable insights depend on reliable data
Code Snippet:
import pandas as pd
# Step 1 — Load Dataset
df = pd.read_csv("sample_data.csv")
print("Dataset Loaded\n")
# Step 2 — Missing Values Check
missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values, "\n")
# Step 3 — Duplicate Records Check
duplicates = df.duplicated().sum()
print(f"Duplicate Records: {duplicates}\n")
# Step 4 — Validate Sales Values
invalid_sales = df[df["sales"] < 0]
print("Invalid Sales Records:")
print(invalid_sales, "\n")
# Step 5 — Validate Quantity Values
invalid_quantity = df[df["quantity"] <= 0]
print("Invalid Quantity Records:")
print(invalid_quantity, "\n")
# Step 6 — Validate Categories
valid_categories = [
"Electronics",
"Furniture",
"Clothing"
]
invalid_categories = df[
~df["category"].isin(valid_categories)
]
print("Invalid Categories:")
print(invalid_categories, "\n")
# Step 7 — Validation Summary
validation_report = {
"missing_values": int(df.isnull().sum().sum()),
"duplicates": int(df.duplicated().sum()),
"invalid_sales": len(invalid_sales),
"invalid_quantity": len(invalid_quantity),
"invalid_categories": len(invalid_categories)
}
print("Validation Report:")
print(validation_report, "\n")
# Key Takeaways
print("Key Takeaways:")
print("Validation ensures data quality")
print("Business rules catch invalid records")
print("Validation reports simplify monitoring")
print("Trusted data leads to trusted insights\n")
# Final Note
print("Good analysis starts with trusted data!")
No comments yet. Be the first to comment!