I am a data scientist with a background
in astrophysics and mathematics, and I am an explorer at heart.
I am currently a Data Science Fellow at Insight Data Science
Watching movies is one of the most popular pasttimes today.
There are so many movies being released for streaming every year, and it is impossible to keep up with all of them.
When someone is looking for a new movie to watch, he/she will often turn to movie-related sites,
like IMDB or rotten tomatoes, which aggregate ratings, reviews, and other information about movies.
One of the most important factors influencing one's decision to watch a movie is the overall rating.
However, does the rating reasonably represent the entire population's opinion of the movie?
To address this question, I will use the enormous MovieLens data sets (http://movielens.org),
which is collected by GroupLens Research (https://grouplens.org/datasets/movielens/).
This dataset consists of more than 26 million ratings of approximately 45,000 movies,
with around 270,000 users.
Take a look at the distribution of movie release years by genre
Total number of each genre with multiple genres. For example,
first left-row figure indicates the number of each genre with Action genre.
Let's examine any correlations between the ratings of different genres
The correlation matrix heatmap above reveals some fundamental characteristics of
movies and user preferences that are worth explaining, namely:
Strong correlations exist between:
Action & Adventure
Action & SciFi
Animation & Children
Animation & Fantasy
Adventure & Fantasy
Children & Fantasy
Strong anti-correlations exist between:
Action & Drama
Action & Romance
Adventure & Drama
Children & Crime
Musical & Thriller
Correlations and anti-correlations with av_Year shows which genres tend to be more
common in recent years (Action, Adventure, Fantasy, SciFi, and Thriller) and which genres
had their heyday in the past (Film Noir, Musical, War, and Western), respectively.
Let's examine the genre preferences in a more quantitative statistical way, using Principal Component Analysis (PCA)
I take the matrix of users (rows) and number of reviews by genre (columns) and perform a
decomposition of that matrix into the sum of eigenvectors times corresponding eigenvalues.
This has the effect of re-casting the matrix into a set of orthogonal axes, and when I sort
those eigenmodes from the largest (absolute value) eigenvalue to smallest, the first couple of
eigenmodes each correspond to the axes along which we find the greatest variance in our data.
The result, plotted blow for the top two eigenmodes, shows how each of them naturally
corresponds to some of the most fundamental divides between user preferences
in genres (where we see weights of some genres having similar or opposing signs)
The upper-left of the figure above corresponds to strong weights for Action,
Adventure, Fantasy, and SciFi, while the lower-left features the strongest weights for
Drama and Romance.
This dichotomy represents the most significant single divide in the user movie preferences,
and it's surely one of the most obvious broad divides between types of films.
The upper-right is dominated by Children, Musical, Animation, and Romance while the lower-right
is Thriller, Crime, Action, and Mystery. This second dimension apparently separates light-hearted
films aimed at children from more gritty, often violent movies geared towards adults.
The figures above show the preference for each group. The y-axis is median values of fractions of rating
as a function of genre for each group. As you can see, Group 1 has higher fractions for Crime, Thriller, and Mystery
while Group 2 has higher fractions for Action, Adventure and SciFi.
On ther other hand, Group 5 has higher fractions for Adventure, Animation, Children and Fantasy.
At the same time, Group3 prefer Comedy, Drama, Romance, Musical while Group 6 prefer Documentary, Drama, FilmNoir.
Group4 has no preference. This means categorizing by only one genre is difficult to figure out people's exact
preference. For example, people in Group 2 like Adventure movie, but they prefer more Adult-style(?) adventure
comparing to Group 5 people who prefer Animation, Children and Fantasy.
Isn't it interesting?
Finally, I checked the average ratings for each group about several different movies like below:
These two dramas above demonstrate the really broad range of films that can be called drama.
Titanic is much more of a
Romance & Adventure Drama, while Fargo is focused on a Criminal Investigation.
The Silence of the Lambs is a Thriller which has strong Horror and Criminal elements,
which is why it is preferred by G1 and G6 in particular.
Meanwhile, Die Hard is primarily an Action Thriller that is well-liked by G2.
In the Adventure genre, there is a very wide range of films that span from light-hearted
movies for Children,
like Aladdin, to much darker and Action-packed films like Aliens.
In Musicals, there is likewise a split between the films intended for adults, like Chicago,
and those geared towards children, like the Lion King.
Conclusions
I have examined the ratings that people give to movies, and how those ratings might be biased by people's preferences for certain kinds of movies. It is clear that distinct biases exist based on one's movie preferences. To study these biases more quantitatively, I first used a simple approach that treated each genre independent of each other, but I also clearly found evidence of strong correlations
(and anti-correlations) between some genres. So a more rigorous approach was taken to
study user preferences which allowed for the inherent correlations between genres to be
included by using PCA.
The PCA analysis reduced the available features to just two dimensions, corresponding to
the two most significant eigenmodes that contribute to the variance in the data. From these
two dimensions, I selected six distinct clusters using the unsupervised learning algorithm called
KMeans clustering. After checking the different ratings given by users in these six clusters,
a pattern emerges that reveals some of the most fundamental divides in movie preferences.