AW Dev Rethought

Code is read far more often than it is written - Guido van Rossum

⚡️ Saturday ML Sparks – 💬 Sentiment Analysis (ML Approach)


Description:

Sentiment analysis is one of the most common real-world applications of Natural Language Processing (NLP).

From product reviews to customer feedback and social media posts, understanding sentiment helps businesses gauge user opinion at scale.

In this project, we build a sentiment analysis system using a classical machine learning approach — without deep learning or transformers — relying instead on TF-IDF vectorization and Logistic Regression.


Understanding the Problem

Text data is inherently unstructured, which makes it difficult for machine learning models to interpret directly.

To classify sentiment, we must first convert text into a numerical representation that captures meaning and importance.

The challenge lies in:

  • representing text numerically
  • identifying sentiment-driving words
  • building a model that generalizes to unseen text

This makes sentiment analysis a perfect example of applied machine learning for NLP.


1. Preparing a Text Dataset

We start with a small labeled dataset of text samples and their sentiments.

import pandas as pd

data = {
    "text": [
        "I absolutely loved this product",
        "Worst experience ever",
        "Very happy with the service",
        "The product quality is terrible",
        "Amazing experience and great support",
        "I will never buy this again"
    ],
    "sentiment": ["positive", "negative", "positive", "negative", "positive", "negative"]
}

df = pd.DataFrame(data)

Each text sample is associated with a sentiment label, making this a supervised learning task.


2. Train/Test Split

We split the dataset to evaluate how well the model performs on unseen text.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df["text"],
    df["sentiment"],
    test_size=0.3,
    stratify=df["sentiment"],
    random_state=42
)

Stratification ensures both sentiments are represented fairly in training and testing sets.


3. Converting Text to TF-IDF Features

Machine learning models require numeric input.

We use TF-IDF (Term Frequency–Inverse Document Frequency) to transform text into numerical vectors.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english")
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

TF-IDF gives higher importance to words that are informative and less common across documents.


4. Training a Sentiment Classifier

We use Logistic Regression, a widely used baseline model for text classification.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

Despite its simplicity, Logistic Regression performs remarkably well for many NLP classification tasks.


5. Evaluating Model Performance

We evaluate the classifier using standard metrics.

from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test_tfidf)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

These metrics help understand how well the model distinguishes between positive and negative sentiment.


6. Testing on New Text

Finally, we test the model on unseen sentences.

new_text = [
    "I really enjoyed using this",
    "This is a complete waste of money"
]

new_tfidf = vectorizer.transform(new_text)
predictions = model.predict(new_tfidf)

for text, sentiment in zip(new_text, predictions):
    print(f"{text} → {sentiment}")

This demonstrates how the trained model generalizes beyond the training data.


Key Takeaways

  1. Sentiment analysis is a core real-world NLP application.
  2. Text must be converted into numerical features before modeling.
  3. TF-IDF effectively captures word importance in documents.
  4. Logistic Regression is a strong baseline for sentiment classification.
  5. Classical ML approaches remain relevant and widely used in production NLP systems.

Conclusion

Sentiment analysis does not always require complex deep learning models.

By combining TF-IDF vectorization with Logistic Regression, we can build an efficient and interpretable sentiment classifier suitable for many real-world use cases.

This Saturday ML Spark highlights how traditional machine learning techniques continue to play a vital role in NLP — forming a strong foundation before moving on to more advanced models and architectures.


Code Snippet:

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix


data = {
    "text": [
        "I absolutely loved this product",
        "Worst experience ever",
        "Very happy with the service",
        "The product quality is terrible",
        "Amazing experience and great support",
        "I will never buy this again"
    ],
    "sentiment": ["positive", "negative", "positive", "negative", "positive", "negative"]
}

df = pd.DataFrame(data)


X_train, X_test, y_train, y_test = train_test_split(
    df["text"],
    df["sentiment"],
    test_size=0.3,
    random_state=42,
    stratify=df["sentiment"]
)


vectorizer = TfidfVectorizer(stop_words="english")
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)


y_pred = model.predict(X_test_tfidf)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


new_text = [
    "I really enjoyed using this",
    "This is a complete waste of money"
]

new_tfidf = vectorizer.transform(new_text)
predictions = model.predict(new_tfidf)

for text, sentiment in zip(new_text, predictions):
    print(f"{text} → {sentiment}")

Link copied!

Comments

Add Your Comment

Comment Added!