📊 Python Data Workflows – 🧹 Data Cleaning Pipeline 🐍

Posted on: April 24, 2026

Description:

Real-world data is rarely clean.

Before any analysis or modeling, datasets usually contain missing values, duplicate records, and inconsistent data types. If these issues are not handled properly, they can lead to incorrect insights and unreliable results.

That’s why building a data cleaning pipeline is essential.

Why a Cleaning Pipeline?

Instead of cleaning data randomly, a pipeline ensures:

consistency
repeatability
reliability

It transforms raw data into a structured format ready for analysis.

Handling Missing Values

Missing values are one of the most common problems.

df[col] = df[col].fillna(df[col].median())

Numeric columns are filled with median values, while text fields are replaced with "Unknown". This keeps the dataset usable without losing rows.

Removing Duplicates

Duplicate rows can distort metrics like totals and averages.

df = df.drop_duplicates()

This ensures each record is unique and prevents double-counting.

Fixing Data Types

Data types directly affect how you process data.

df["order_date"] = pd.to_datetime(df["order_date"])

Dates, numbers, and text must be correctly formatted to enable filtering, aggregation, and time-based analysis.

Why This Matters

Without cleaning:

analysis becomes unreliable
dashboards show incorrect trends
models learn from noisy data

With a proper pipeline:

data becomes trustworthy
workflows become reusable
insights become meaningful

Key Takeaways

Cleaning is a mandatory step in every workflow
Missing values, duplicates, and types must be handled systematically
Pipelines make data processing scalable
Clean data leads to accurate insights

Code Snippet:

import pandas as pd


def load_data(file_path):
    df = pd.read_csv(file_path)
    print("✅ Data Loaded\n")
    return df


def inspect_data(df):
    print("📌 Dataset Shape:", df.shape)
    print("\n📌 Column Names:", df.columns.tolist())
    print("\n📌 Data Types:\n", df.dtypes)
    print("\n📌 Missing Values:\n", df.isnull().sum(), "\n")


def standardize_columns(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
    print("✅ Column names standardized\n")
    return df


def remove_duplicates(df):
    before = len(df)
    df = df.drop_duplicates()
    after = len(df)
    print(f"🧹 Removed {before - after} duplicate row(s)\n")
    return df


def handle_missing_values(df):
    for col in df.columns:
        if df[col].dtype in ["int64", "float64"]:
            df[col] = df[col].fillna(df[col].median())
        else:
            df[col] = df[col].fillna("Unknown")
    print("✅ Missing values handled\n")
    return df


def fix_data_types(df):
    if "order_date" in df.columns:
        df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

    for col in ["sales", "quantity"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    print("✅ Data types standardized\n")
    return df


def verify_data(df):
    print("📌 Data Types After Cleaning:\n", df.dtypes)
    print("\n📌 Missing Values After Cleaning:\n", df.isnull().sum())
    print("\n📊 Sample Data:\n", df.head(), "\n")


def save_data(df, output_path):
    df.to_csv(output_path, index=False)
    print(f"💾 Cleaned data saved to {output_path}\n")


def main():
    file_path = "sample_data.csv"
    output_path = "cleaned_sample_data.csv"

    df = load_data(file_path)
    inspect_data(df)
    df = standardize_columns(df)
    df = remove_duplicates(df)
    df = handle_missing_values(df)
    df = fix_data_types(df)
    verify_data(df)
    save_data(df, output_path)


if __name__ == "__main__":
    main()

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

📊 Python Data Workflows – 🧹 Data Cleaning Pipeline 🐍

Description:

Why a Cleaning Pipeline?

Handling Missing Values

Fixing Data Types

Why This Matters

Key Takeaways

Code Snippet:

Comments

Add Your Comment

📊 Python Data Workflows – 🧹 Data Cleaning Pipeline 🐍

Description:

Why a Cleaning Pipeline?

Handling Missing Values

Fixing Data Types

Why This Matters

Key Takeaways

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

📊 Python Data Workflows – 🌟 Feature Engineering Basics 🐍

📊 Python Data Workflows – 📋 CSV to Insights 🐍

Comments