AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

📊 Python Data Workflows – 🌟 Feature Engineering Basics 🐍


Description:

After cleaning a dataset, the next step is often feature engineering.

Feature engineering means creating new columns from existing data so the dataset becomes more useful for analysis. Sometimes the raw columns are not enough. You may need derived values, categories, date parts, or transformed fields to understand the data better.


Why Feature Engineering Matters

Raw data usually tells only part of the story.

For example, a dataset may contain sales, quantity, and order_date. These columns are useful, but we can make them more powerful by creating new features like:

  • revenue per item
  • order month
  • order weekday
  • sales category

These new columns make analysis easier and more meaningful.


Creating New Numeric Features

One simple transformation is creating a new metric from existing columns.

df["revenue_per_item"] = df["sales"] / df["quantity"]

Instead of only looking at total sales, this gives us a better understanding of average value per item sold.


Extracting Date Features

Dates are very useful, but they need to be transformed first.

df["order_month"] = df["order_date"].dt.month
df["order_weekday"] = df["order_date"].dt.day_name()

With these features, we can analyse monthly trends or weekday patterns.


Creating Business Labels

Feature engineering is not only about numbers. Sometimes it is useful to convert values into readable labels.

df["sales_category"] = df["sales"].apply(sales_category)

This helps group records into simple buckets like High, Medium, and Low.


Encoding Categories

Some workflows need text values converted into numbers.

df["category_code"] = df["category"].astype("category").cat.codes

This is useful when preparing data for analysis, dashboards, or machine learning models.


Key Takeaways

  • Feature engineering adds more meaning to raw data
  • New columns can be created from numbers, dates, and text
  • Date features help with time-based analysis
  • Encoded categories are useful for analysis and ML workflows

Code Snippet:

import pandas as pd

df = pd.read_csv("sample_data.csv")

print("✅ Data Loaded")
print(df.head())


df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["sales"] = pd.to_numeric(df["sales"], errors="coerce")
df["quantity"] = pd.to_numeric(df["quantity"], errors="coerce")

print("✅ Data types fixed")


df["revenue_per_item"] = df["sales"] / df["quantity"]

print(df[["sales", "quantity", "revenue_per_item"]].head())


df["order_year"] = df["order_date"].dt.year
df["order_month"] = df["order_date"].dt.month
df["order_day"] = df["order_date"].dt.day
df["order_weekday"] = df["order_date"].dt.day_name()

print(df[["order_date", "order_year", "order_month", "order_weekday"]].head())


def sales_category(sales):
    if sales >= 2000:
        return "High"
    elif sales >= 1000:
        return "Medium"
    else:
        return "Low"


df["sales_category"] = df["sales"].apply(sales_category)

print(df[["sales", "sales_category"]].head())


df["category_code"] = df["category"].astype("category").cat.codes

print(df[["category", "category_code"]].head())


df.to_csv("feature_engineered_data.csv", index=False)

print("💾 Feature engineered dataset saved")

Link copied!

Comments

Add Your Comment

Comment Added!