AW Dev Rethought

🕵️ Debugging is like being the detective in a crime movie where you are also the murderer - Filipe Fortes

📊 Python Data Workflows – ✅ Data Validation 🐍


Description:

Cleaning data is important, but validation is what ensures the data can actually be trusted.

Even after handling missing values and duplicates, datasets may still contain invalid records. Sales could be negative, quantities might be zero, or categories may not follow expected values.

This is where data validation becomes essential.


Why Data Validation Matters

Data validation acts as a quality gate before analysis.

Without validation:

  • reports may contain incorrect numbers
  • dashboards may show misleading insights
  • downstream workflows can fail unexpectedly

By applying simple validation rules, we can identify problematic records early.


Validating Numeric Fields

A common rule is checking whether numeric values fall within expected ranges.

invalid_sales = df[df["sales"] < 0]

This ensures sales values are valid and prevents incorrect calculations.


Validating Business Rules

Many datasets have predefined acceptable values.

valid_categories = [
    "Electronics",
    "Furniture",
    "Clothing"
]

Anything outside these values should be reviewed.


Building Validation Reports

Instead of manually inspecting issues, a validation report summarises the dataset quality.

validation_report = {
    "missing_values": ...,
    "duplicates": ...
}

This creates a quick overview of potential problems.


Real-World Applications

Data validation is used extensively in:

  • ETL pipelines
  • Reporting systems
  • Data warehouses
  • Analytics platforms
  • Machine learning pipelines

It ensures only high-quality data moves forward.


Key Takeaways

  • Validation is different from cleaning
  • Rules help enforce data quality standards
  • Validation reports simplify monitoring
  • Reliable insights depend on reliable data

Code Snippet:

import pandas as pd


# Step 1 — Load Dataset
df = pd.read_csv("sample_data.csv")

print("Dataset Loaded\n")


# Step 2 — Missing Values Check
missing_values = df.isnull().sum()

print("Missing Values:")
print(missing_values, "\n")


# Step 3 — Duplicate Records Check
duplicates = df.duplicated().sum()

print(f"Duplicate Records: {duplicates}\n")


# Step 4 — Validate Sales Values
invalid_sales = df[df["sales"] < 0]

print("Invalid Sales Records:")
print(invalid_sales, "\n")


# Step 5 — Validate Quantity Values
invalid_quantity = df[df["quantity"] <= 0]

print("Invalid Quantity Records:")
print(invalid_quantity, "\n")


# Step 6 — Validate Categories
valid_categories = [
    "Electronics",
    "Furniture",
    "Clothing"
]

invalid_categories = df[
    ~df["category"].isin(valid_categories)
]

print("Invalid Categories:")
print(invalid_categories, "\n")


# Step 7 — Validation Summary
validation_report = {
    "missing_values": int(df.isnull().sum().sum()),
    "duplicates": int(df.duplicated().sum()),
    "invalid_sales": len(invalid_sales),
    "invalid_quantity": len(invalid_quantity),
    "invalid_categories": len(invalid_categories)
}

print("Validation Report:")
print(validation_report, "\n")


# Key Takeaways
print("Key Takeaways:")
print("Validation ensures data quality")
print("Business rules catch invalid records")
print("Validation reports simplify monitoring")
print("Trusted data leads to trusted insights\n")


# Final Note
print("Good analysis starts with trusted data!")

Link copied!

Comments

Add Your Comment

Comment Added!