Recommendation systems/engines/algorithms are a means to filter information to make recommendations to users. They are used for predicting a user's behavior based on their past behavior or behavior of related users.
In the current project, I seek to predict movie preferences of users. So let's get started.
I will be using the moviedataset that Cognitive Class AI. The original source of the dataset can be found here. There are significantly larger datasets with 20 Million Movie ratings as well as 20 Million Youtube Trailers datasets. If you want to work on an even larger (but synthetic dataset extrapolated from the 20M real-world ratings) you can use this dataset.
# let's get the basic imports out of the way. I will be using pandas for storing and manipulating the data.
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
# Let's make sure matplotlib shows all the figures inline
%matplotlib inline
If you need to download the data, go here.
The dataset consists of 4 csv files -- links.csv, movies.csv, ratings.csv, tags.csv -- each consisting of relevant information. I will store the data from each of those files in separate dataframes, and write them into a new dataframe as needed.
# Here is the movie informatoin dataset
movies_df = pd.read_csv('../Downloads/moviedataset/ml-latest/movies.csv')
# Let's take a look at the first 5 lines of this dataset
movies_df.head()
There are a few things we need to take care of in this dataset. For instance, the year within the paranthesis need to be extracted and saved to a new column. Furthermore, we also need to split the genres into multiple columns so they are easier to work with.
But before we start working on that, we should look at the rest of the files, in case we can write a function to clean the data instead of cleaning them one at a time.
# Here is the movie information dataset
ratings_df = pd.read_csv('../Downloads/moviedataset/ml-latest/ratings.csv')
# Let's take a look at the first 5 lines of this dataset
ratings_df.head()
# Here is the movie informatoin dataset
links_df = pd.read_csv('../Downloads/moviedataset/ml-latest/links.csv')
# Let's take a look at the first 5 lines of this dataset
links_df.head()
# Here is the movie informatoin dataset
tags_df = pd.read_csv('../Downloads/moviedataset/ml-latest/tags.csv')
# Let's take a look at the first 5 lines of this dataset
tags_df.head()
It appears the movieId is the common variable across all the datasets. Thus, we could use that column to combine these dataframes as we need them. The first thing to do is to clean the movies_df dataframe to extract the years and the genres.
# We can make use of regular expressions to find the year and the genres. First let's work on the years.
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False) # to extract the string '(dddd)' from title column'
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False) # since we no longer need the paranthesis, we remove them.
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '') # Now that we have the years extracted, let's drop them and replace them with an empty string
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip()) #to remove the trailing space
movies_df.head()
# Now that that is taken care of, let's split the genres.
# note the "|" that is used as a separator. We use this info to split the genres into multiple columns.
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()
While we were able to extract the genres and add them into a list without the old separator '|', we may not be able to use this information just yet, so let's move each genre into a separate column.
There are two ways to do this. We can simply split the genres into multiple columns and within each row, for each movie, we will have genre_1, genre_2,....genre_n.
However, the issue with this approach is that it would have an uneven number of genres for each movie, leading to a lot of empty values in columns beyond, genre_1 and genre_2 given that most movies within our data have at least 2 genres associated with the movie.
We get around this issue by using the One Hot Encoding approach, essentially, to create a dummy variable (1 if yes and 0 if no) for every genre for each movie. That way, we can see which genre a movie is part of which it is not a part of.
Let's do this now!
# We can copy the movie dataframe into a new one if you want to leave the movies_df dataset untouched.
# moviesWithGenres_df = movies_df.copy()
#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movies_df.iterrows():
for genre in row['genres']:
movies_df.at[index, genre] = 1
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
movies_df = movies_df.fillna(0)
movies_df.head()
Great! Now we have all the genres as their own respective columns with 1 if yes and 0 if no for the respective genre. Additionally, we have the genres as its own column, in case we need to quickly reference the genre for each of our movies.
However, if we do not need this genres column, we can use the following line of code to drop it.
movies_df = movies_df.drop('genres', 1)
Now that we are done preprocessing the movies_df dataframe, let's shift our focus to the rest of the data. Let's start with ratings_df
ratings_df.head()
We know for a fact that the timestamp does not provide any relevant information to us except the time at which the rating was posted. The time itself does not have any predictive value so we can safely drop this variable.
# Let's drop the timestamp.
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()
Let's do the same for the links_df.
links_df.head()
We may not need the imdbId and tmdbId of each movie for our actual predictions, however, we might need it to cross-verify our data in case any of our data has erroneous movie ids. Thus, for the time-being let's leave them alone.
How about the tags_df?
tags_df.head()
Again, we have no value for the timestamp, so we can safely remove them. The tag associated with each movie might be a key feature to predicting what kind of movie each user might like, so let's make sure to use it later in our modeling.
# removing the timestamp from tags_df
tags_df = tags_df.drop('timestamp', 1)
tags_df.head()
That should be all for the cleaning. Fortunately, the data comes fairly clean otherwise and doesn't seem to have any issues that we need to worry about. So let's shift our focus to inspecting the data and then building our recommendation systems.
There are three common ways to build recommendation engines each with their own goals -
There are a few other ways to build recommeder systems. For instance, multi-criteria recommender systems, risk-aware recommender systems, and hybrid recommender systems. To learn more about these, please check wikipedia.
Unfortunately, in our dataset, we do not have the information related to what is popular/trending. Thus, we cannot build a recommendation engine that can recommend based on popularity! Bummer! :(
However, we can test the other two! So, let's start with Content-based and then move on to Collaborative filtering.
p.s. You can follow the wonderful tutorial here if you would like to try the popularity based collaborative filtering and theoretical background primer if you would so like!
In a content-based recommendation that we are building, we inspect what an individual (in our case, user of netflix for instance.) At the outset, we ask the user to select some movies/shows they like. Then, we inspect the qualities of the movies they picked and try to match other movies that share related qualities. In our case, the qualities of the movie could be genres and the tag.
# Let's assume that a user input's the following movies with ratings for each.
# The following list shows the movies and the rating for each.
userInput = [
{'title':'Breakfast Club, The', 'rating':5},
{'title':'Toy Story', 'rating':3.5},
{'title':'Jumanji', 'rating':2},
{'title':"Pulp Fiction", 'rating':5},
{'title':'Akira', 'rating':4.5}
]
# Let's turn this into a dataframe called inputMovies since our ratings will depend on these movies that the user provided as an input.
inputMovies = pd.DataFrame(userInput)
inputMovies
# Now that we have the list of inputMovies from our user, let's find the genres for each of the movies
# For each movie in our list, let's find them in our movies_df dataframe and add the genres to our list.
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
# Once we identify them, we can merge the names of our genres into our inputMovies df
inputMovies = pd.merge(inputId, inputMovies)
#Once we merge them, we can drop the unwanted columns
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
#Now, let's look at our Final input dataframe
inputMovies
Awesome! Now that we have our user's inputs on the movies, the user specified ratings, and the genres. Let's try to match them to our original movies_df.
#Filtering out the movies from the input
userMovies = movies_df[movies_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies
# Our current dataframe does not have an index and the movieId column seems to be off.
# So, let's reset our index by dropping the wrong movieIds.
userMovies = userMovies.reset_index(drop=True)
# If we are manually analyzing and recommending movies to a single user, it might not be important to delve into optimization of our modeling
# However, once we start working on multiple user's inputs and scale up our recommendation model, we have to worry about memory optimization and the speed of our engine.
# Let's drop unnecessary columns that add no value to our predictions and use a different dataframe called userGenreTable.
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable
Great! Now we have just the list of genres and ratings for the user in our userGenreTable. However, we cannot use the genres alone to build an effective recommender system.
It would have been simpler if all the movies were rated at the same level. That would mean, we could count the number of times each genre shows up and recommend those movies that fit that genre. However, since the ratings for the movies are different, it means, our simplest approach would not work.
For example, our user rated movies Toy Story, Jumanji, and Akira as 3.5, 2.0, and 4.5 respectively. However, all three movies fall into the genre adventure. If we weight adventure equally (or as a simple count), our predictions might be off because the same genre was given different ratings. This approach would fail since it is not taking into account the interactions of different genres each movie is carrying among other things such as plot, storyline, or explicit interest of the user in a movie.
Thus, to reduce the errors made, we use weighted averages. This simply means, in our case, that we multiply the rating with genre for each movie and sum them. So, Adventure will have a total weight of 3.5x1 + 2.0x1 + 4.5+1 which would add up to 10.0. Similarly, Comedy will have a weight of 13.5, and Action will have a weight of 4.5.
Let us create these weights from our rating and genre dataframes.
# Here are the user's ratings for each movie
inputMovies['rating']
# Now to generate our weights, we can simply get a dot product of rating userGenreTable.
# please note that we are transposing the userGenreTable to ensure that the dot product is being down for each row in rating with each column in the userGenreTable.
# Let's write them to the userProfile.
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
# The user profile
userProfile
# Now that we have the userProfile, we simply need to convert our movies_df dataframe to something easy to parse and identify similar movies
# We can get the genres of each movie from movies_df dataframe
genreTable = movies_df.set_index(movies_df['movieId'])
# AS before, let's drop the unnecessary information/columns
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()
genreTable.shape
At this point we are left with 34208 rows with 20 columns for each genre.
Let's generate the weighted averages of each movie within this dataframe.
# Let's multiply the genres by the weights for this specific user's profile and then take the weighted average to create the recommendation table.
recommendations_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendations_df.head()
# Awesome. We have the movieids and the associated weights!
# All we need to do now is to match the movies with the highest weights from our movies_df and we can find the best movies to recommend!
movies_df.loc[movies_df['movieId'].isin(recommendations_df.head(20).keys())]
# Here are 20 movies that we can recommend to our user.
# However, they are not in any particular order. Ideally, we want the best movies to recommend at the top of our list. Thus, let us sort our recommendations in descending order!
recommendations_df = recommendations_df.sort_values(ascending=False)
# Let's take a look at the top 5 movies
recommendations_df.head()
# The final recommendation table
movies_df.loc[movies_df['movieId'].isin(recommendations_df.head(20).keys())]
Here we go! We generated a recommendation table for our user with the top 20 movies that we predict they might want to watch.
So far, we used the content-based recommendation system.
Let's work on a Collaborative-filtering based recommendation system. To do that, we need to identify which other users within our ratings_df database have watched these movies and we need to store them into a separate dataframe.
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()
# If we want to look at a single user's ratings, we can use groupby userId
userSubsetGroup = userSubset.groupby(['userId'])
# We can inspect individual user using the following code
userSubsetGroup.get_group(649)
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)
# Let's look at the first 5 users who closely match with our user input
userSubsetGroup[0:5]
Besides, how about the rest of the users ratings? It will be very difficult (and will be a challenging endeavor) to go through our entire list of users one at a time manually like this.
To circumvent this problem, we derive a similarity index in the form of a Pearson correlation.
If you want a primer on pearson (as opposed to spearman rank) correlations, you can read this article.
# Let's create an empty dictionary to store the Pearson Correlation with users as keys and the coefficient as values
pearsonCorrelationDict = {}
# Let's first group our subset of users by something comparable with our userInput
# We can achieve this by grouping our user subset and the input of movies by movieId.
for name, group in userSubsetGroup:
group = group.sort_values(by='movieId')
# print (group)
inputMovies = inputMovies.sort_values(by='movieId')
# print (inputMovies)
# We will need the number of ratings for calculating the Pearson Correlation
nRatings = len(group)
# Let's create a dataframe to grab and store all the movies by ID if they are in the group we are interested in.
temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
temp_df
# And then store them in a temporary buffer variable in a list format to facilitate future calculations
tempRatingList = temp_df['rating'].tolist()
tempRatingList
# Let's also put the current user group reviews in a list format
tempGroupList = group['rating'].tolist()
tempGroupList
# Now let's calculate the pearson correlation between two users, so called, x and y
Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
# If the denominator is different than zero, then divide, else, 0 correlation.
if Sxx != 0 and Syy != 0:
pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
else:
pearsonCorrelationDict[name] = 0
# If you want to take a look at the items in our dictionary, use the below line of code.
# pearsonCorrelationDict.items()
# Let's restrict ourselves to the first 10 items
from itertools import islice
def take(n, iterable):
"Return first n items of the iterable as a list"
return list(islice(iterable, n))
n_items = take(10, pearsonCorrelationDict.items())
print (n_items)
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()
# Since we have a lot of users in our database, let's restrict our view to the first 100.
x = pearsonDF.userId[:101]
y = pearsonDF.similarityIndex[:101]
plt.scatter(x, y)
plt.show()
# or plt.savefig("PearsonScatter.png")
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()
# Let's replot the top 50 from our sorted list
x = topUsers.userId
y = topUsers.similarityIndex
plt.scatter(x, y)
plt.show()
If our dataset included better parameters that define the user's choices, then we can redefine our similarity index and refine our predictions.
# Let's look at the top 10 users with high similarity of our test user.
top1000Users = pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:1000]
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)
top10Users.iplot(x = 'userId', y = 'similarityIndex')
# If you prefer to use Bokeh instead, try this
# If you don't have bokeh install, use the following command
#!pip install bokeh
import bokeh as bk
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.palettes import Plasma
output_notebook()
%matplotlib inline
import matplotlib as mpl
from matplotlib import pyplot as plt
mpl.style.use('seaborn-bright')
x = top1000Users.userId.astype(int)
y = top1000Users.similarityIndex.astype(int)
p = figure(plot_width=800, plot_height=400, title='Similarity of users with test user',
tools="hover", tooltips="@userId : @similarityIndex")
p.scatter(x=x, y=y, size = 8, color = 'blue', alpha =0.5)
show(p)
p.xaxis.axis_label = "User ID"
p.yaxis.axis_label = 'Similarity of users with test user'
# p.xaxis.major_label_orientation = 1.2
# p.y_range.start = 0
# p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
# p.legend.location = "top_center"
# p.legend.orientation = "horizontal"
# Interesting, looks like all of our 1000 users have an exact 1 similarity index!
# We can also look at the entire user list just for the sake of it, although, while making a prediction for our test user, the entire data is not very useful since we will only look at the first few users anyway.
x = pearsonDF.userId.astype(int)
y = pearsonDF.similarityIndex.astype(int)
p = figure(plot_width=800, plot_height=400, title='Similarity of users with test user',
tools="hover", tooltips="@userId : @similarityIndex")
p.scatter(x=x, y=y, size = 8, color = 'blue', alpha =0.5)
show(p)
p.xaxis.axis_label = "User ID"
p.yaxis.axis_label = 'Similarity of users with test user'
# p.xaxis.major_label_orientation = 1.2
# p.y_range.start = 0
# p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
# p.legend.location = "top_center"
# p.legend.orientation = "horizontal"
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()
# Let's then multiplies the similarity by the user's rating to get the weighted ratings!
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()
# Now let us apply a sum to the topUsers after grouping it up by the movieId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()
x = tempTopUsersRating.sum_similarityIndex.astype(int)
y = tempTopUsersRating.sum_weightedRating.astype(int)
p = figure(plot_width=800, plot_height=400, title='Similarity Index vs Weighted Ratings of top users')
p.scatter(x=x, y=y, size = 8, color = 'orange', alpha =0.5)
show(p)
p.xaxis.axis_label = "Similarity Index"
p.yaxis.axis_label = 'Weighted Rating'
# p.xaxis.major_label_orientation = 1.2
# p.y_range.start = 0
# p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
# p.legend.location = "top_center"
# p.legend.orientation = "horizontal"
# Let's now create an empty dataframe and store our final recommendations
recommendation_df = pd.DataFrame()
# We can created a __weighted average of recommendation scores (WARS)__ using the two variables we generated earlier
recommendation_df['WARS'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
# Let's also include the Movie ID from our index for the sake of completion
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()
x = recommendation_df.movieId.astype(str)
y = recommendation_df.WARS.astype(int)
p = figure(plot_width=800, plot_height=400, title='Weighted Average Recommendation Scores of each Movie')
p.scatter(x=x, y=y, size = 8, color = 'magenta', alpha =0.9)
show(p)
p.xaxis.axis_label = "Movie ID"
p.yaxis.axis_label = 'Weighted Average Recommendation Scores(WARS)'
# p.xaxis.major_label_orientation = 1.2
# p.y_range.start = 0
# p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
# p.legend.location = "top_center"
# p.legend.orientation = "horizontal"
Here we have to make an important decision. Given that there are a lot movies that have the highest WARS rating of 5.0, we need to decide how many of those movies to recommend. Again, if we had access to better markers for user preferences, we could include that information to improve our predictions. Since, we do not have that information, we can use other data driven means to make recommendations.
For instance, we can randomly pull a set number of movies out of all the movies with top WARS rating. Another way is to simply sort our table and recommend the top x movies from our sorted list.
# Let's look at the top 10 movies with our weighted average
recommendation_df = recommendation_df.sort_values(by='WARS', ascending=False)
recommendation_df.head(10)
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]
full_recommendation_df = pd.DataFrame()
full_recommendation_df = movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]
full_recommendation_df_truncated = full_recommendation_df[['title', 'year']]
full_recommendation_df_truncated.sort_values(by='year', ascending=False)
In this project, I showed two methods to recommend movies - Content based and Collaborative filtering based. Please email me at shafeem@uci.edu if you have question.