Enlarged chart
🎯 Try Live Demo
MACHINE LEARNING CASE STUDY

Decoding Netflix Content DNA

An immersive journey through 7,787 titles using Natural Language Processing, Clustering Algorithms, and Data Visualization to uncover hidden patterns.

0Titles Analyzed
0Clusters Found
0Visualizations
Netflix Data Intelligence Visualization
Scroll to Explore
01

The Challenge

Why content discovery on Netflix needs innovation

🎯

The Core Problem

With 7,787+ titles on Netflix, users face choice paralysis. Traditional recommendation systems rely on viewing history, but what about new users or exploring new genres?

5,000

TF-IDF features extracted from text descriptions, cast, director & genres

42.8%

variance captured with 500 PCA components (95% needs 3,111 — text is high-dimensional)

💡 Our Solution: Content-Based Clustering

1
Analyze

NLP on descriptions & metadata

2
Transform

TF-IDF vectorization

3
Cluster

K-Means grouping

4
Recommend

Similarity-based suggestions

02

Data Exploration

Understanding the Netflix content landscape

Missing Values Analysis
Director: 30% missing

Data Quality Assessment

Identifying missing values in the dataset to ensure robust analysis. The director and cast fields have the highest missing rates.

Key Finding: 2,389 titles missing director information
Content Type Distribution
69% Movies

Content Type Distribution

Netflix's catalog is dominated by Movies (69%) compared to TV Shows (31%). This ratio influences content strategy and user preferences.

Key Finding: 5,377 Movies vs 2,410 TV Shows
Rating Distribution
TV-MA leads

Content Ratings Analysis

TV-MA (Mature Audience) is the most common rating, followed by TV-14. This indicates Netflix's strong focus on adult-oriented content.

Key Finding: 2,027 titles rated TV-MA
Release Year Trend
Peak: 2019-2020

Content Growth Over Time

Massive content expansion from 2015 to 2020, with the library growing exponentially as Netflix invested heavily in original content.

Key Finding: 10x growth in content from 2015-2020
Top Countries

Global Content Origins

While United States leads in content production, India, UK, and other countries contribute significantly to Netflix's global library.

Key Finding: 35% of content is international
Top Genres

Genre Distribution

International Movies and Dramas dominate the platform, reflecting Netflix's strategy of acquiring diverse global content.

Key Finding: Drama appears in 50%+ of all titles
Audience Distribution
~50% Adults

Target Audience Distribution

Around 50% of content is produced for adult audiences, followed by young adults, older kids, and kids. Interestingly, Netflix has the least content for teenagers compared to other age groups.

Key Finding: Teens are the most underserved demographic on Netflix
03

The Analysis Pipeline

From raw text to intelligent clusters

1

Text Processing

Combine description, cast, director & genres

2

NLP Pipeline

Tokenization, Lemmatization, Stop words removal

3

TF-IDF Vectorization

5,000 features extracted from text

4

PCA Reduction

Reduced to 500 principal components

5

K-Means Clustering

14 optimal content clusters identified

TF-IDF Feature Importance

ANALYZING
TF-IDF Terms

🤔 Why TF-IDF?

Problem: Raw text (descriptions, genres) cannot be used directly for clustering - machines need numbers.

Solution: TF-IDF converts text to numerical vectors by measuring how important each word is to a document relative to the entire corpus.

Why not just word count? Common words like "the", "movie", "story" would dominate. TF-IDF down-weights frequent terms and highlights unique distinguishing words.

PCA Variance Explained

ANALYZING
PCA Variance

🤔 Why PCA (Dimensionality Reduction)?

Problem: TF-IDF creates 5,000+ features. High-dimensional data causes the "curse of dimensionality" - distances become meaningless, clustering fails.

Solution: PCA reduces dimensions while preserving key variance. We go from 5,000 → 500 features (42.8% variance — text data is inherently high-dimensional).

Key Insight: The cumulative variance plot shows knee point at ~500 components - beyond this, we get diminishing returns.

Optimal K Selection

COMPLETE
K-Means Metrics

🤔 Why K-Means? How did we choose K=12?

Why K-Means: It's efficient for large datasets, works well with numerical features, and creates compact spherical clusters - ideal for content similarity.

Elbow Method: We plot inertia (within-cluster variance) vs K. The inertia decreases steadily, so we also use Silhouette Score across K=2 to 14.

Silhouette Analysis: K=12 achieved the highest Silhouette Score (0.030), meaning items are most similar to their own cluster vs. neighboring clusters at this K.

Hierarchical Clustering (Validation)

ANALYZING
Dendrogram

🤔 Why Hierarchical Clustering too?

Purpose: Validate our K-Means results using a completely different algorithm (Agglomerative Clustering).

Dendrogram Insight: The tree structure shows natural groupings. We used Agglomerative Clustering with Ward linkage as a validation method against K-Means.

Best Practice: Always validate clustering results with multiple methods to ensure robustness.

04

Cluster Discovery

12 distinct content clusters emerged from unsupervised learning

2D Cluster Visualization
12 K-Means clusters
Silhouette: 0.030
PCA-reduced features

Cluster Sizes

Cluster Sizes

🤔 Why analyze cluster sizes?

Purpose: Check if clusters are balanced. Highly imbalanced clusters might indicate poor clustering or interesting insights.

What we found: Cluster sizes range from 148 to 1,433 titles, with the largest cluster capturing general dramas and the smallest capturing niche content.

Business Insight: Netflix has diverse content across all categories - good for catering to varied user preferences.

Cluster Composition

Cluster Composition

🤔 Why analyze composition?

Purpose: Understand WHAT makes each cluster unique - which genres, countries, and content types dominate each cluster.

Validation: If clusters have distinct compositions, it confirms our clustering captured meaningful patterns, not random groupings.

Interpretability: Helps us name and describe clusters meaningfully (e.g., "International Dramas" vs just "Cluster 1").

Cluster Word Clouds

Word Clouds

🤔 Why Word Clouds?

Visual Interpretation: Word clouds show the most frequent/important terms in each cluster at a glance. Larger words = more common in that cluster.

Cluster Validation: If word clouds show distinct themes (e.g., "drama", "romance" in one vs. "crime", "thriller" in another), our clustering successfully grouped similar content.

Communication Tool: Easy to explain cluster themes to non-technical stakeholders - one image tells the story.

12

Total Clusters

Optimal K selected via Silhouette Score

0.030

Silhouette Score

Cluster separation quality (higher = better)

78.78

Calinski-Harabasz

Cluster density (K-Means beat Agglom.)

4.46

Davies-Bouldin

Cluster overlap — lower is better

2,169

Largest Cluster

Sizes range from 85 to 2,169 titles

05

Live Recommendation Demo

Experience the power of content-based recommendations

🎬

Enter a title to discover similar content

Our AI analyzes genre, cast, director, and description to find the perfect matches

06

Key Insights & Impact

What this analysis means for Netflix

🎯

K-Means > Agglomerative

K-Means outperformed Hierarchical clustering on key metrics: Silhouette (0.030), Calinski-Harabasz (78.78), Davies-Bouldin (4.46)

📊

5,000 → 500 Dimensions

PCA reduced 5,000 TF-IDF features to 500 components (42.8% variance). Text data is inherently high-dimensional — 95% would need 3,111 components

🌍

Content-Based Cold Start

Cosine similarity on TF-IDF vectors enables recommendations without user history — solving the cold-start problem for new users and newly added content

Ready to explore the full analysis?