Recommender System in Python

The Amazon and Netflix are making almost 50% of their revenues by recommending appropriate products (books, movies) to their users. But how do they know what to recommend to their users? Well, they use the power of the collaboration between all other items and users.

btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.

Collaborative Filtering(CF) is maybe the most popular method in Recommender Systems at the moment. For predicting the unknown values (ratings, values, stock prices etc.), this technique uses the power of  collaboration among multiple data sets, users or data views.

The application scope of the CF is very broad, and varies from financial institutions and e-commerce to mineral exploration.

In this post I’ll explain how to implement basic, yet powerful recommender system based on item-to-item collaborative filtering. In practice, this item-item approach outperforms the user-user CF in many use cases, since the items are “simpler” than users with varied tastes. Moreover, the item similarity is more meaningful than User similarity.

We have a data set containing user hotel reviews (from 1 to 5). 0 values stand for no rating available.

cf-start-dataset

Table 1: Initial Data Set

Let’s assume the User5 has not visited the Hotel1, and thus left no rating for it. We want to find out if we could recommend Hotel1 to User5.

In order to be able to make any predictions for a given user, there should be enough recommender support for that user, meaning that we should have enough ratings for other items available from that same user.

recommender-support

User-item support.

In our case the recommender support is 2, which means that the prediction is going to be made based on 2 most similar items that the user has already rated. The diagram above shows that the user 5 has enough support (has enough ratings for other items).

Basically, there are only two steps.

  1. For item(Hotel) i, find other similar items.
  2. Estimate rating for item i based on ratings for similar items.

Before we get into those 2 steps the hotel ratings data must be normalized. We’ll apply item normalization, meaning that the item ratings are going to be normalized based on mean rating value of that particular item. Here are the mean rating values for our hotel items.


Out[233]:
Hotel1    3.600000
Hotel2    3.166667
Hotel3    3.000000
Hotel4    3.400000
Hotel5    3.333333
Hotel6    2.600000
dtype: float64

Hotel-Item mean ratings.

By subtracting those mean values from ratings we get the normalized data set which looks like this:

normalized-df

Table 2: Normalized data set.

Now, back to the 2 steps.

Step 1: We need to find most similar hotels to hotel 1. Here, we shall use Pearson Correlation, although, in this particular case, we could also use the cosine similarity. Pearson correlation is a correlation coefficient which measures which item ratings increase or decrease “together”. In Python, you can calculate it like this:

# pearson correlation similarity
# option 1
hotel_similarity_df = dfn.transpose().corr().round(2)

By inspecting the hotel correlations we can conclude that 2 hotels are pretty similar to hotel 1. Those 2 are Hotel6 and Hotel3.

item-corr-inspection

Hotel similarity matrix.


Step 2
: The predicted rating for Hotel1 from User5 is calculated by multiplying similarity coefficients of Hotel3(0.41) and Hotel6(0.59) with reviews (Table 1) of those hotels for User5, 2 and 3 and dividing that by the sum of most similar hotel coefficients (0.41+0.59).

prediction-formula

CF Formula.

In Python, it looks like this:

#calculate rating for hotel 1 from user 5
# predict by taking weighted average
r_15 = sum(hotel_ratings * hotel_sim) / sum(hotel_sim)
print "User 5 would rate Hotel 1 with: ", round(r_15,1), " stars"

So basically it comes down to:
user5_hotel1_prediction = (0.41*2 + 0.59*3)/(0.41+0.59) = 2.6

The User5 would rate Hotel1 with 2.6 stars.  By looking at the averaged hotel ratings for Hotel1, which is 3.6, I would NOT recommend this hotel to User1 since our predicted rating is below the mean of that hotel.

But which hotels could we predict to user5? Well, I leave that to you. You can use this notebook as starting point for your research. More advanced recommendation systems are implemented with SVD(Singular value Decomposition) and the CUR Decomposition, but for most use cases, the CF approach is more than enough to increase your sales by predicting items that user might like.

Have fun and enjoy. Cheers!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: