AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

📊 Python Data Workflows – 🧹 Data Cleaning Pipeline 🐍


Description:

Real-world data is rarely clean.

Before any analysis or modeling, datasets usually contain missing values, duplicate records, and inconsistent data types. If these issues are not handled properly, they can lead to incorrect insights and unreliable results.

That’s why building a data cleaning pipeline is essential.


Why a Cleaning Pipeline?

Instead of cleaning data randomly, a pipeline ensures:

  • consistency
  • repeatability
  • reliability

It transforms raw data into a structured format ready for analysis.


Handling Missing Values

Missing values are one of the most common problems.

df[col] = df[col].fillna(df[col].median())

Numeric columns are filled with median values, while text fields are replaced with "Unknown". This keeps the dataset usable without losing rows.


Removing Duplicates

Duplicate rows can distort metrics like totals and averages.

df = df.drop_duplicates()

This ensures each record is unique and prevents double-counting.


Fixing Data Types

Data types directly affect how you process data.

df["order_date"] = pd.to_datetime(df["order_date"])

Dates, numbers, and text must be correctly formatted to enable filtering, aggregation, and time-based analysis.


Why This Matters

Without cleaning:

  • analysis becomes unreliable
  • dashboards show incorrect trends
  • models learn from noisy data

With a proper pipeline:

  • data becomes trustworthy
  • workflows become reusable
  • insights become meaningful

Key Takeaways

  • Cleaning is a mandatory step in every workflow
  • Missing values, duplicates, and types must be handled systematically
  • Pipelines make data processing scalable
  • Clean data leads to accurate insights

Code Snippet:

import pandas as pd


def load_data(file_path):
    df = pd.read_csv(file_path)
    print("✅ Data Loaded\n")
    return df


def inspect_data(df):
    print("📌 Dataset Shape:", df.shape)
    print("\n📌 Column Names:", df.columns.tolist())
    print("\n📌 Data Types:\n", df.dtypes)
    print("\n📌 Missing Values:\n", df.isnull().sum(), "\n")


def standardize_columns(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
    print("✅ Column names standardized\n")
    return df


def remove_duplicates(df):
    before = len(df)
    df = df.drop_duplicates()
    after = len(df)
    print(f"🧹 Removed {before - after} duplicate row(s)\n")
    return df


def handle_missing_values(df):
    for col in df.columns:
        if df[col].dtype in ["int64", "float64"]:
            df[col] = df[col].fillna(df[col].median())
        else:
            df[col] = df[col].fillna("Unknown")
    print("✅ Missing values handled\n")
    return df


def fix_data_types(df):
    if "order_date" in df.columns:
        df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

    for col in ["sales", "quantity"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    print("✅ Data types standardized\n")
    return df


def verify_data(df):
    print("📌 Data Types After Cleaning:\n", df.dtypes)
    print("\n📌 Missing Values After Cleaning:\n", df.isnull().sum())
    print("\n📊 Sample Data:\n", df.head(), "\n")


def save_data(df, output_path):
    df.to_csv(output_path, index=False)
    print(f"💾 Cleaned data saved to {output_path}\n")


def main():
    file_path = "sample_data.csv"
    output_path = "cleaned_sample_data.csv"

    df = load_data(file_path)
    inspect_data(df)
    df = standardize_columns(df)
    df = remove_duplicates(df)
    df = handle_missing_values(df)
    df = fix_data_types(df)
    verify_data(df)
    save_data(df, output_path)


if __name__ == "__main__":
    main()

Link copied!

Comments

Add Your Comment

Comment Added!