📊 Python Data Workflows – 🧹 Data Cleaning Pipeline 🐍
Posted on: April 24, 2026
Description:
Real-world data is rarely clean.
Before any analysis or modeling, datasets usually contain missing values, duplicate records, and inconsistent data types. If these issues are not handled properly, they can lead to incorrect insights and unreliable results.
That’s why building a data cleaning pipeline is essential.
Why a Cleaning Pipeline?
Instead of cleaning data randomly, a pipeline ensures:
- consistency
- repeatability
- reliability
It transforms raw data into a structured format ready for analysis.
Handling Missing Values
Missing values are one of the most common problems.
df[col] = df[col].fillna(df[col].median())
Numeric columns are filled with median values, while text fields are replaced with "Unknown". This keeps the dataset usable without losing rows.
Removing Duplicates
Duplicate rows can distort metrics like totals and averages.
df = df.drop_duplicates()
This ensures each record is unique and prevents double-counting.
Fixing Data Types
Data types directly affect how you process data.
df["order_date"] = pd.to_datetime(df["order_date"])
Dates, numbers, and text must be correctly formatted to enable filtering, aggregation, and time-based analysis.
Why This Matters
Without cleaning:
- analysis becomes unreliable
- dashboards show incorrect trends
- models learn from noisy data
With a proper pipeline:
- data becomes trustworthy
- workflows become reusable
- insights become meaningful
Key Takeaways
- Cleaning is a mandatory step in every workflow
- Missing values, duplicates, and types must be handled systematically
- Pipelines make data processing scalable
- Clean data leads to accurate insights
Code Snippet:
import pandas as pd
def load_data(file_path):
df = pd.read_csv(file_path)
print("✅ Data Loaded\n")
return df
def inspect_data(df):
print("📌 Dataset Shape:", df.shape)
print("\n📌 Column Names:", df.columns.tolist())
print("\n📌 Data Types:\n", df.dtypes)
print("\n📌 Missing Values:\n", df.isnull().sum(), "\n")
def standardize_columns(df):
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("✅ Column names standardized\n")
return df
def remove_duplicates(df):
before = len(df)
df = df.drop_duplicates()
after = len(df)
print(f"🧹 Removed {before - after} duplicate row(s)\n")
return df
def handle_missing_values(df):
for col in df.columns:
if df[col].dtype in ["int64", "float64"]:
df[col] = df[col].fillna(df[col].median())
else:
df[col] = df[col].fillna("Unknown")
print("✅ Missing values handled\n")
return df
def fix_data_types(df):
if "order_date" in df.columns:
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
for col in ["sales", "quantity"]:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
print("✅ Data types standardized\n")
return df
def verify_data(df):
print("📌 Data Types After Cleaning:\n", df.dtypes)
print("\n📌 Missing Values After Cleaning:\n", df.isnull().sum())
print("\n📊 Sample Data:\n", df.head(), "\n")
def save_data(df, output_path):
df.to_csv(output_path, index=False)
print(f"💾 Cleaned data saved to {output_path}\n")
def main():
file_path = "sample_data.csv"
output_path = "cleaned_sample_data.csv"
df = load_data(file_path)
inspect_data(df)
df = standardize_columns(df)
df = remove_duplicates(df)
df = handle_missing_values(df)
df = fix_data_types(df)
verify_data(df)
save_data(df, output_path)
if __name__ == "__main__":
main()
No comments yet. Be the first to comment!