⚡️ Saturday ML Sparks – 🎬 Movie Recommendation with Cosine Similarity
Posted on: January 17, 2026
Description:
Movie recommendation systems are everywhere — from streaming platforms to e-commerce sites.
At their core, many recommendation engines start with a simple idea: recommend items that are similar to what a user already likes.
In this Saturday ML Spark, we build a content-based movie recommendation system using cosine similarity, one of the most widely used similarity measures in machine learning.
Understanding the Problem
Unlike collaborative filtering, which relies on user behaviour, content-based recommendation focuses on item attributes such as:
- genres
- tags
- descriptions
- keywords
The challenge is to represent these attributes numerically and then measure how similar two items are.
Cosine similarity helps answer a simple question:
How similar are two movies based on their content?
1. Preparing a Movie Dataset
We start with a small dataset of movies and their descriptions.
import pandas as pd
movies = {
"title": [
"Inception",
"The Dark Knight",
"Interstellar",
"The Matrix",
"Avengers: Endgame"
],
"description": [
"dream manipulation sci-fi thriller",
"dark gritty superhero action crime",
"space exploration time sci-fi drama",
"virtual reality dystopian sci-fi action",
"superheroes action time travel battle"
]
}
df = pd.DataFrame(movies)
In real-world systems, these descriptions could come from genres, plot summaries, or user-generated tags.
2. Converting Text to Numerical Features
Machine learning models cannot work directly with raw text.
We convert movie descriptions into numerical vectors using TF-IDF.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(df["description"])
TF-IDF assigns higher weights to words that are informative and less common across the dataset.
3. Measuring Similarity with Cosine Similarity
Once movies are represented as vectors, we compute similarity between them.
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
Cosine similarity measures the angle between vectors:
- 1.0 → identical content
- 0.0 → no similarity
4. Building the Recommendation Logic
We now create a function that returns movies most similar to a given title.
def recommend(movie_title, top_n=3):
idx = df.index[df["title"] == movie_title][0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
return [
df["title"][i]
for i, score in sim_scores[1 : top_n + 1]
]
This function:
- finds the selected movie
- ranks all other movies by similarity
- returns the top recommendations
5. Generating Movie Recommendations
We can now test the recommender system.
print(recommend("Inception"))
The output includes movies with similar themes and keywords, such as sci-fi, action, or time-based concepts.
Why Cosine Similarity Works Well
Cosine similarity is especially effective for recommendation systems because:
- it ignores document length
- it focuses on relative word importance
- it scales efficiently to large datasets
- it works well with sparse text vectors
This makes it a natural choice for content-based recommenders.
Key Takeaways
- Content-based recommendation relies on item similarity, not user behavior.
- TF-IDF converts text metadata into meaningful numeric vectors.
- Cosine similarity measures how close items are in feature space.
- Simple recommendation systems can be built with minimal data.
- This approach forms the foundation of many real-world recommender engines.
Conclusion
Movie recommendation systems don’t always need complex deep learning models to be effective.
By combining TF-IDF vectorization with cosine similarity, we can build a practical and interpretable recommender system that captures meaningful relationships between items.
This Saturday ML Spark demonstrates how a simple mathematical concept can power real-world applications — making it a great starting point for understanding recommendation systems before moving on to more advanced approaches.
Code Snippet:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
movies = {
"title": [
"Inception",
"The Dark Knight",
"Interstellar",
"The Matrix",
"Avengers: Endgame"
],
"description": [
"dream manipulation sci-fi thriller",
"dark gritty superhero action crime",
"space exploration time sci-fi drama",
"virtual reality dystopian sci-fi action",
"superheroes action time travel battle"
]
}
df = pd.DataFrame(movies)
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(df["description"])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
def recommend(movie_title, top_n=3):
idx = df.index[df["title"] == movie_title][0]
sim_scores = list(enumerate(cosine_sim[idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
recommendations = [
df["title"][i]
for i, score in sim_scores[1 : top_n + 1]
]
return recommendations
print("Recommended movies for Inception:")
print(recommend("Inception"))
No comments yet. Be the first to comment!