Netflix Content Analysis | Cinematic Case Study

The Challenge

Why content discovery on Netflix needs innovation

🎯

The Core Problem

With 7,787+ titles on Netflix, users face choice paralysis. Traditional recommendation systems rely on viewing history, but what about new users or exploring new genres?

5,000

TF-IDF features extracted from text descriptions, cast, director & genres

42.8%

variance captured with 500 PCA components (95% needs 3,111 — text is high-dimensional)

💡 Our Solution: Content-Based Clustering

Analyze

NLP on descriptions & metadata

→

Transform

TF-IDF vectorization

→

Cluster

K-Means grouping

→

Recommend

Similarity-based suggestions

Data Exploration

Understanding the Netflix content landscape

Director: 30% missing

Data Quality Assessment

Identifying missing values in the dataset to ensure robust analysis. The director and cast fields have the highest missing rates.

Key Finding: 2,389 titles missing director information

69% Movies

Content Type Distribution

Netflix's catalog is dominated by Movies (69%) compared to TV Shows (31%). This ratio influences content strategy and user preferences.

Key Finding: 5,377 Movies vs 2,410 TV Shows

TV-MA leads

Content Ratings Analysis

TV-MA (Mature Audience) is the most common rating, followed by TV-14. This indicates Netflix's strong focus on adult-oriented content.

Key Finding: 2,027 titles rated TV-MA

Peak: 2019-2020

Content Growth Over Time

Massive content expansion from 2015 to 2020, with the library growing exponentially as Netflix invested heavily in original content.

Key Finding: 10x growth in content from 2015-2020

Global Content Origins

While United States leads in content production, India, UK, and other countries contribute significantly to Netflix's global library.

Key Finding: 35% of content is international

Genre Distribution

International Movies and Dramas dominate the platform, reflecting Netflix's strategy of acquiring diverse global content.

Key Finding: Drama appears in 50%+ of all titles

~50% Adults

Target Audience Distribution

Around 50% of content is produced for adult audiences, followed by young adults, older kids, and kids. Interestingly, Netflix has the least content for teenagers compared to other age groups.

Key Finding: Teens are the most underserved demographic on Netflix

The Analysis Pipeline

From raw text to intelligent clusters

Text Processing

Combine description, cast, director & genres

NLP Pipeline

Tokenization, Lemmatization, Stop words removal

TF-IDF Vectorization

5,000 features extracted from text

PCA Reduction

Reduced to 500 principal components

K-Means Clustering

14 optimal content clusters identified

TF-IDF Feature Importance

ANALYZING

🤔 Why TF-IDF?

Problem: Raw text (descriptions, genres) cannot be used directly for clustering - machines need numbers.

Solution: TF-IDF converts text to numerical vectors by measuring how important each word is to a document relative to the entire corpus.

Why not just word count? Common words like "the", "movie", "story" would dominate. TF-IDF down-weights frequent terms and highlights unique distinguishing words.

PCA Variance Explained

ANALYZING

🤔 Why PCA (Dimensionality Reduction)?

Problem: TF-IDF creates 5,000+ features. High-dimensional data causes the "curse of dimensionality" - distances become meaningless, clustering fails.

Solution: PCA reduces dimensions while preserving key variance. We go from 5,000 → 500 features (42.8% variance — text data is inherently high-dimensional).

Key Insight: The cumulative variance plot shows knee point at ~500 components - beyond this, we get diminishing returns.

Optimal K Selection

COMPLETE

🤔 Why K-Means? How did we choose K=12?

Why K-Means: It's efficient for large datasets, works well with numerical features, and creates compact spherical clusters - ideal for content similarity.

Elbow Method: We plot inertia (within-cluster variance) vs K. The inertia decreases steadily, so we also use Silhouette Score across K=2 to 14.

Silhouette Analysis: K=12 achieved the highest Silhouette Score (0.030), meaning items are most similar to their own cluster vs. neighboring clusters at this K.

Hierarchical Clustering (Validation)

ANALYZING

🤔 Why Hierarchical Clustering too?

Purpose: Validate our K-Means results using a completely different algorithm (Agglomerative Clustering).

Dendrogram Insight: The tree structure shows natural groupings. We used Agglomerative Clustering with Ward linkage as a validation method against K-Means.

Best Practice: Always validate clustering results with multiple methods to ensure robustness.

Cluster Discovery

12 distinct content clusters emerged from unsupervised learning

12 K-Means clusters

Silhouette: 0.030

PCA-reduced features

Cluster Sizes

🤔 Why analyze cluster sizes?

Purpose: Check if clusters are balanced. Highly imbalanced clusters might indicate poor clustering or interesting insights.

What we found: Cluster sizes range from 148 to 1,433 titles, with the largest cluster capturing general dramas and the smallest capturing niche content.

Business Insight: Netflix has diverse content across all categories - good for catering to varied user preferences.

Cluster Composition

🤔 Why analyze composition?

Purpose: Understand WHAT makes each cluster unique - which genres, countries, and content types dominate each cluster.

Validation: If clusters have distinct compositions, it confirms our clustering captured meaningful patterns, not random groupings.

Interpretability: Helps us name and describe clusters meaningfully (e.g., "International Dramas" vs just "Cluster 1").

Cluster Word Clouds

🤔 Why Word Clouds?

Visual Interpretation: Word clouds show the most frequent/important terms in each cluster at a glance. Larger words = more common in that cluster.

Cluster Validation: If word clouds show distinct themes (e.g., "drama", "romance" in one vs. "crime", "thriller" in another), our clustering successfully grouped similar content.

Communication Tool: Easy to explain cluster themes to non-technical stakeholders - one image tells the story.

Total Clusters

Optimal K selected via Silhouette Score

0.030

Silhouette Score

Cluster separation quality (higher = better)

78.78

Calinski-Harabasz

Cluster density (K-Means beat Agglom.)

4.46

Davies-Bouldin

Cluster overlap — lower is better

2,169

Largest Cluster

Sizes range from 85 to 2,169 titles

Live Recommendation Demo

Experience the power of content-based recommendations

🔍

Try:

🎬

Enter a title to discover similar content

Our AI analyzes genre, cast, director, and description to find the perfect matches

Key Insights & Impact

What this analysis means for Netflix

🎯

K-Means > Agglomerative

K-Means outperformed Hierarchical clustering on key metrics: Silhouette (0.030), Calinski-Harabasz (78.78), Davies-Bouldin (4.46)

📊

5,000 → 500 Dimensions

PCA reduced 5,000 TF-IDF features to 500 components (42.8% variance). Text data is inherently high-dimensional — 95% would need 3,111 components

🌍

Content-Based Cold Start

Cosine similarity on TF-IDF vectors enables recommendations without user history — solving the cold-start problem for new users and newly added content

Ready to explore the full analysis?

View Notebook Back to Top

Decoding Netflix Content DNA

The Challenge

The Core Problem

💡 Our Solution: Content-Based Clustering

Data Exploration

Data Quality Assessment

Content Type Distribution

Content Ratings Analysis

Content Growth Over Time

Global Content Origins

Genre Distribution

Target Audience Distribution

The Analysis Pipeline

Text Processing

NLP Pipeline

TF-IDF Vectorization

PCA Reduction

K-Means Clustering

TF-IDF Feature Importance

🤔 Why TF-IDF?

PCA Variance Explained

🤔 Why PCA (Dimensionality Reduction)?

Optimal K Selection

🤔 Why K-Means? How did we choose K=12?

Hierarchical Clustering (Validation)

🤔 Why Hierarchical Clustering too?

Cluster Discovery

Cluster Sizes

🤔 Why analyze cluster sizes?

Cluster Composition

🤔 Why analyze composition?

Cluster Word Clouds

🤔 Why Word Clouds?

Total Clusters

Silhouette Score

Calinski-Harabasz

Davies-Bouldin

Largest Cluster

Live Recommendation Demo

Enter a title to discover similar content

Key Insights & Impact

K-Means > Agglomerative

5,000 → 500 Dimensions

Content-Based Cold Start

Ready to explore the full analysis?