The Amazon and Netflix are making almost 50% of their revenues by recommending appropriate products (books, movies) to their users. But how do they know what to recommend to their users? Well, they use the power of the collaboration between all other items and users.
btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.
Collaborative Filtering(CF) is maybe the most popular method in Recommender Systems at the moment. For predicting the unknown values (ratings, values, stock prices etc.), this technique uses the power of collaboration among multiple data sets, users or data views.
The application scope of the CF is very broad, and varies from financial institutions and e-commerce to mineral exploration.
In this post I’ll explain how to implement basic, yet powerful recommender system based on item-to-item collaborative filtering. In practice, this item-item approach outperforms the user-user CF in many use cases, since the items are “simpler” than users with varied tastes. Moreover, the item similarity is more meaningful than User similarity.
We have a data set containing user hotel reviews (from 1 to 5). 0 values stand for no rating available.

Table 1: Initial Data Set
Let’s assume the User5 has not visited the Hotel1, and thus left no rating for it. We want to find out if we could recommend Hotel1 to User5.
In order to be able to make any predictions for a given user, there should be enough recommender support for that user, meaning that we should have enough ratings for other items available from that same user.

User-item support.
In our case the recommender support is 2, which means that the prediction is going to be made based on 2 most similar items that the user has already rated. The diagram above shows that the user 5 has enough support (has enough ratings for other items).
Basically, there are only two steps.
- For item(Hotel) i, find other similar items.
- Estimate rating for item i based on ratings for similar items.
Before we get into those 2 steps the hotel ratings data must be normalized. We’ll apply item normalization, meaning that the item ratings are going to be normalized based on mean rating value of that particular item. Here are the mean rating values for our hotel items.
Out[233]: Hotel1 3.600000 Hotel2 3.166667 Hotel3 3.000000 Hotel4 3.400000 Hotel5 3.333333 Hotel6 2.600000 dtype: float64
Hotel-Item mean ratings.
By subtracting those mean values from ratings we get the normalized data set which looks like this:

Table 2: Normalized data set.
Now, back to the 2 steps.
Step 1: We need to find most similar hotels to hotel 1. Here, we shall use Pearson Correlation, although, in this particular case, we could also use the cosine similarity. Pearson correlation is a correlation coefficient which measures which item ratings increase or decrease “together”. In Python, you can calculate it like this:
# pearson correlation similarity # option 1 hotel_similarity_df = dfn.transpose().corr().round(2)
By inspecting the hotel correlations we can conclude that 2 hotels are pretty similar to hotel 1. Those 2 are Hotel6 and Hotel3.

Hotel similarity matrix.
Step 2: The predicted rating for Hotel1 from User5 is calculated by multiplying similarity coefficients of Hotel3(0.41) and Hotel6(0.59) with reviews (Table 1) of those hotels for User5, 2 and 3 and dividing that by the sum of most similar hotel coefficients (0.41+0.59).

CF Formula.
In Python, it looks like this:
#calculate rating for hotel 1 from user 5 # predict by taking weighted average r_15 = sum(hotel_ratings * hotel_sim) / sum(hotel_sim) print "User 5 would rate Hotel 1 with: ", round(r_15,1), " stars"
So basically it comes down to:
user5_hotel1_prediction = (0.41*2 + 0.59*3)/(0.41+0.59) = 2.6
The User5 would rate Hotel1 with 2.6 stars. By looking at the averaged hotel ratings for Hotel1, which is 3.6, I would NOT recommend this hotel to User1 since our predicted rating is below the mean of that hotel.
But which hotels could we predict to user5? Well, I leave that to you. You can use this notebook as starting point for your research. More advanced recommendation systems are implemented with SVD(Singular value Decomposition) and the CUR Decomposition, but for most use cases, the CF approach is more than enough to increase your sales by predicting items that user might like.
Have fun and enjoy. Cheers!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
# ## import some stuff ## | |
# In[246]: | |
import numpy as np | |
import scipy as sc | |
from pandas import Series,DataFrame | |
import pandas as pd | |
from scipy import spatial | |
from sklearn import preprocessing | |
import matplotlib.pyplot as plt | |
import matplotlib as mpl | |
import seaborn as sns | |
from collections import OrderedDict | |
from fractions import Fraction | |
get_ipython().magic(u'matplotlib inline') | |
mpl.rcParams['figure.figsize'] = (10.0, 5) | |
# # Part 1 # | |
# ## Collaborative filtering item-item ## | |
# | |
# | |
# This notebook is implementation of collaborative filtering algorithm in python. | |
# Missing rating for Hotel1, and User5 is going to be predicted. | |
# Recommendations are maid based on these calculations. | |
# | |
# Have fun… | |
# In[247]: | |
df = pd.DataFrame({'Hotel1' :[1,0,3,0,0,5,0,0,5,0,4,0], | |
'Hotel2' :[0,0,5,4,0,0,4,0,0,2,1,3], | |
'Hotel3' :[2,4,0,1,2,0,3,0,4,3,5,0], | |
'Hotel4' :[0,2,4,0,5,0,0,4,0,0,2,0], | |
'Hotel5' :[0,0,4,3,4,2,0,0,0,0,2,5], | |
'Hotel6' : [1,0,3,0,3,0,0,2,0,0,4,0], | |
}, index=['User1','User2','User3','User4','User5', | |
'User6','User7','User8','User9','User10','User11','User12']) | |
df = df.transpose() | |
df | |
# In[248]: | |
# check if hotels have enough ratings (enough support) to be able to make predictions | |
df.transpose().plot.barh(stacked=True) | |
# In[249]: | |
# find 0 values | |
no_rating_mask = (df == 0) | |
no_rating_mask | |
# In[250]: | |
#comes after | |
#df[no_rating_mask] = None | |
#df | |
# In[251]: | |
# possibility 2 to find hotel rating mean values | |
hotel_rating_averages = df[np.invert(no_rating_mask)].mean(axis=1) | |
hotel_rating_averages | |
# In[252]: | |
# normalise dataset | |
dfn = df.sub(hotel_rating_averages, axis=0) | |
dfn = dfn.round(1) | |
dfn | |
# In[253]: | |
# put 0 values where no values was found | |
dfn[no_rating_mask] = 0 | |
# and round values | |
dfn = dfn.round(1) | |
# In[254]: | |
dfn | |
# In[255]: | |
# inspect hotel similarities | |
sns.pairplot(dfn.transpose()) | |
# In[256]: | |
# we could also plot hotel recommendation values vectors | |
soa = dfn.transpose().values | |
print zip(*soa) | |
X,Y,U,V,Z,E = zip(*soa) | |
plt.figure() | |
ax = plt.gca() | |
ax.quiver(X,Y,U,V,Z,E, angles='xy',scale_units='xy',scale=1) | |
ax.set_xlim([–1,4]) | |
ax.set_ylim([–1,1]) | |
plt.draw() | |
plt.show() | |
# In[257]: | |
# pearson correlation similarity | |
# option 1 | |
hotel_similarity_df = dfn.transpose().corr().round(2) | |
hotel_similarity_df | |
# In[258]: | |
sns.heatmap(hotel_similarity_df, annot=True) | |
# In[259]: | |
# we couuld also calculate hotel similarities this way | |
from sklearn.metrics.pairwise import cosine_similarity | |
from scipy import sparse | |
A_sparse = sparse.csr_matrix(dfn.as_matrix()) | |
#also can output sparse matrices | |
similarities_sparse = cosine_similarity(A_sparse, dense_output=False) | |
print('hotel pairwise similarity:\n {}\n'.format(similarities_sparse)) | |
# # now lets calculate how would the user 5 rate the hotel 1 | |
# In[260]: | |
# Hotel1 is most similar to the Hotels 3 and 6 | |
mask = hotel_similarity_df["Hotel1"] > 0.30 | |
mask | |
# In[261]: | |
# take ratings of most similar hotels (3 and 6) | |
hotel_ratings = df.User5[mask].values[1:] | |
hotel_ratings | |
# In[262]: | |
# take similarities of most similar hotels (3 and 6) | |
hotel_sim = hotel_similarity_df.Hotel1[mask].values[1:] | |
hotel_sim | |
# In[263]: | |
#calculate rating for hotel 1 from user 5 | |
# predict by taking weighted average | |
r_15 = sum(hotel_ratings * hotel_sim) / sum(hotel_sim) | |
print "User 5 would rate Hotel 1 with: ", round(r_15,1), " stars" |