Recommender System in Python

The Amazon and Netflix are making almost 50% of their revenues by recommending appropriate products (books, movies) to their users. But how do they know what to recommend to their users? Well, they use the power of the collaboration between all other items and users.

btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.

Collaborative Filtering(CF) is maybe the most popular method in Recommender Systems at the moment. For predicting the unknown values (ratings, values, stock prices etc.), this technique uses the power of  collaboration among multiple data sets, users or data views.

The application scope of the CF is very broad, and varies from financial institutions and e-commerce to mineral exploration.

In this post I’ll explain how to implement basic, yet powerful recommender system based on item-to-item collaborative filtering. In practice, this item-item approach outperforms the user-user CF in many use cases, since the items are “simpler” than users with varied tastes. Moreover, the item similarity is more meaningful than User similarity.

We have a data set containing user hotel reviews (from 1 to 5). 0 values stand for no rating available.


Table 1: Initial Data Set

Let’s assume the User5 has not visited the Hotel1, and thus left no rating for it. We want to find out if we could recommend Hotel1 to User5.

In order to be able to make any predictions for a given user, there should be enough recommender support for that user, meaning that we should have enough ratings for other items available from that same user.


User-item support.

In our case the recommender support is 2, which means that the prediction is going to be made based on 2 most similar items that the user has already rated. The diagram above shows that the user 5 has enough support (has enough ratings for other items).

Basically, there are only two steps.

  1. For item(Hotel) i, find other similar items.
  2. Estimate rating for item i based on ratings for similar items.

Before we get into those 2 steps the hotel ratings data must be normalized. We’ll apply item normalization, meaning that the item ratings are going to be normalized based on mean rating value of that particular item. Here are the mean rating values for our hotel items.

Hotel1    3.600000
Hotel2    3.166667
Hotel3    3.000000
Hotel4    3.400000
Hotel5    3.333333
Hotel6    2.600000
dtype: float64

Hotel-Item mean ratings.

By subtracting those mean values from ratings we get the normalized data set which looks like this:


Table 2: Normalized data set.

Now, back to the 2 steps.

Step 1: We need to find most similar hotels to hotel 1. Here, we shall use Pearson Correlation, although, in this particular case, we could also use the cosine similarity. Pearson correlation is a correlation coefficient which measures which item ratings increase or decrease “together”. In Python, you can calculate it like this:

# pearson correlation similarity
# option 1
hotel_similarity_df = dfn.transpose().corr().round(2)

By inspecting the hotel correlations we can conclude that 2 hotels are pretty similar to hotel 1. Those 2 are Hotel6 and Hotel3.


Hotel similarity matrix.

Step 2
: The predicted rating for Hotel1 from User5 is calculated by multiplying similarity coefficients of Hotel3(0.41) and Hotel6(0.59) with reviews (Table 1) of those hotels for User5, 2 and 3 and dividing that by the sum of most similar hotel coefficients (0.41+0.59).


CF Formula.

In Python, it looks like this:

#calculate rating for hotel 1 from user 5
# predict by taking weighted average
r_15 = sum(hotel_ratings * hotel_sim) / sum(hotel_sim)
print "User 5 would rate Hotel 1 with: ", round(r_15,1), " stars"

So basically it comes down to:
user5_hotel1_prediction = (0.41*2 + 0.59*3)/(0.41+0.59) = 2.6

The User5 would rate Hotel1 with 2.6 stars.  By looking at the averaged hotel ratings for Hotel1, which is 3.6, I would NOT recommend this hotel to User1 since our predicted rating is below the mean of that hotel.

But which hotels could we predict to user5? Well, I leave that to you. You can use this notebook as starting point for your research. More advanced recommendation systems are implemented with SVD(Singular value Decomposition) and the CUR Decomposition, but for most use cases, the CF approach is more than enough to increase your sales by predicting items that user might like.

Have fun and enjoy. Cheers!

# coding: utf-8
# ## import some stuff ##
# In[246]:
import numpy as np
import scipy as sc
from pandas import Series,DataFrame
import pandas as pd
from scipy import spatial
from sklearn import preprocessing
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from collections import OrderedDict
from fractions import Fraction
get_ipython().magic(u'matplotlib inline')
mpl.rcParams['figure.figsize'] = (10.0, 5)
# # Part 1 #
# ## Collaborative filtering item-item ##
# This notebook is implementation of collaborative filtering algorithm in python.
# Missing rating for Hotel1, and User5 is going to be predicted.
# Recommendations are maid based on these calculations.
# Have fun…
# In[247]:
df = pd.DataFrame({'Hotel1' :[1,0,3,0,0,5,0,0,5,0,4,0],
'Hotel2' :[0,0,5,4,0,0,4,0,0,2,1,3],
'Hotel3' :[2,4,0,1,2,0,3,0,4,3,5,0],
'Hotel4' :[0,2,4,0,5,0,0,4,0,0,2,0],
'Hotel5' :[0,0,4,3,4,2,0,0,0,0,2,5],
'Hotel6' : [1,0,3,0,3,0,0,2,0,0,4,0],
}, index=['User1','User2','User3','User4','User5',
df = df.transpose()
# In[248]:
# check if hotels have enough ratings (enough support) to be able to make predictions
# In[249]:
# find 0 values
no_rating_mask = (df == 0)
# In[250]:
#comes after
#df[no_rating_mask] = None
# In[251]:
# possibility 2 to find hotel rating mean values
hotel_rating_averages = df[np.invert(no_rating_mask)].mean(axis=1)
# In[252]:
# normalise dataset
dfn = df.sub(hotel_rating_averages, axis=0)
dfn = dfn.round(1)
# In[253]:
# put 0 values where no values was found
dfn[no_rating_mask] = 0
# and round values
dfn = dfn.round(1)
# In[254]:
# In[255]:
# inspect hotel similarities
# In[256]:
# we could also plot hotel recommendation values vectors
soa = dfn.transpose().values
print zip(*soa)
X,Y,U,V,Z,E = zip(*soa)
ax = plt.gca()
ax.quiver(X,Y,U,V,Z,E, angles='xy',scale_units='xy',scale=1)
# In[257]:
# pearson correlation similarity
# option 1
hotel_similarity_df = dfn.transpose().corr().round(2)
# In[258]:
sns.heatmap(hotel_similarity_df, annot=True)
# In[259]:
# we couuld also calculate hotel similarities this way
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
A_sparse = sparse.csr_matrix(dfn.as_matrix())
#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse, dense_output=False)
print('hotel pairwise similarity:\n {}\n'.format(similarities_sparse))
# # now lets calculate how would the user 5 rate the hotel 1
# In[260]:
# Hotel1 is most similar to the Hotels 3 and 6
mask = hotel_similarity_df["Hotel1"] > 0.30
# In[261]:
# take ratings of most similar hotels (3 and 6)
hotel_ratings = df.User5[mask].values[1:]
# In[262]:
# take similarities of most similar hotels (3 and 6)
hotel_sim = hotel_similarity_df.Hotel1[mask].values[1:]
# In[263]:
#calculate rating for hotel 1 from user 5
# predict by taking weighted average
r_15 = sum(hotel_ratings * hotel_sim) / sum(hotel_sim)
print "User 5 would rate Hotel 1 with: ", round(r_15,1), " stars"

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: