Market Basket Analysis – Mining Frequent Pairs in Python

Have you ever asked yourself how the store managers decide on product shelf placement in retail stores? There must be some strategy behind it, right? It can’t be just a random choice. Almost on daily basis, you receive product purchase recommendations from variety of sources where you have left your “digital fingerprint”. In many cases these recommendations make sense, what leaves you puzzled, how did they figured it out?

The Market Basket Analysis is perhaps the most famous method in Association Mining techniques arsenal. It’s all about finding frequent pairs, triples, quadruples of products from historical transactions or market baskets.

In this post I’ll show you small example how to implement Market Basket Analysis in Python. So let’s start…

btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.

 

The picture below visualizes our historical transactions. For the sake of simplicity, there are only 8 transactions, from 0 to 7. Transaction number 2 implies the market basket containing Balsamico, Mozzarella and Wine. Basket number 5 contains only Pasta and Wine.

dzenan-hamzic-basket-analysis-products

Each basket is a vector of bit values. Bit stands for 1 if the column product is in the basket, and 0 otherwise.

transaction_df = pd.DataFrame({'Beer' : [1,0,1,0,1,0,0,1],
                               'Coke' : [0,1,1,0,1,0,0,1],
                               'Pepsi' : [1,0,0,1,0,0,1,0],
                               'Milk' : [0,1,0,1,1,1,0,1],
                               'Juice' : [0,0,1,0,0,1,1,1]})

To ensure quality of the basket mining, it is important that every product, or at least most of them, is evenly distributed across all baskets. This feature of the transactional dataset is called product support. Let’s plot basket’s product support histogram.

# calculate occurrence(support) for every product in all transactions
product_support_dict = {}
for column in transaction_df.columns:
    product_support_dict[column] = sum(transaction_df[column]>0)

# visualise support
pd.Series(product_support_dict).plot(kind="bar")

dzenan-hamzic-product-support

We can see that every product in our transaction matrix is evenly distributed. That’s a good, and now we can start with frequent pairs mining.

Although this analysis can be implemented directly with pandas dataframe object, I choose to work with numpy matrices.  I find that working with matrices is more natural than working with dataframes.
What the code below is doing, is basically taking every product’s vector and multiplying it with every other products vector. The final vector product of that multiplication indicates matches between products. Those matches indicate pairs. Summation of the matches gives us the pair frequency. The higher the number, the more frequent is the pair. The results are saved in new matrix (frequent_item_matrix).


# take data matrix from dataframe
transaction_matrix = transaction_df.as_matrix()
# get number of rows and columns
rows, columns = transaction_matrix.shape
# init new matrix
frequent_items_matrix = np.zeros((5,5))
# compare every product with every other
for this_column in range(0, columns-1):
    for next_column in range(this_column + 1, columns):
        # multiply product pair vectors
        product_vector = transaction_matrix[:,this_column] * transaction_matrix[:,next_column]
        # check the number of pair occurrences in baskets
        count_matches = sum((product_vector)>0)
        # save values to new matrix
        frequent_items_matrix[this_column,next_column] = count_matches

#print frequent_items_matrix
plot_matrix(frequent_items_matrix)

The resulting matrix looks like this:

dzenan-hamzic-frequent-pairs-matrix-raw

Now, we only need to combine that resulting matrix with product names and we can get some meaningful information.

# and finally combine product names with data
frequent_items_df = pd.DataFrame(frequent_items_matrix, columns = transaction_df.columns.values, index = transaction_df.columns.values)

import seaborn as sns
# and plot
sns.heatmap(frequent_items_df)

dzenan-hamzic-frequent-pairs-matrix

And voila! We have our most frequent pairs. Who would have said that the Balsamico&Mozzarella and Balsamico&Pasta are the most frequent pairs 😉

Now, although we have the results, we could visualize the results differently or extract frequent pairs in another datastructure.
The code below extracts frequent pairs into new dictionary with pair as key and frequency as value.

# extract product pairs with minimum frequency(treshold) basket occurrences
def extract_pairs(treshold):
    output = {}
    # select indexes with larger or equal n
    matrix_coord_list = np.where(frequent_items_matrix >= treshold)
    # take values
    row_coords = matrix_coord_list[0]
    column_coords = matrix_coord_list[1]
    # generate pairs
    for index, value in enumerate(row_coords):
        #print index
        row = row_coords[index]
        column = column_coords[index]
        # get product names
        first_product = product_names[row]
        second_product = product_names[column]
        # number of basket matches
        matches = frequent_items_matrix[row,column]
        # put key values into dict
        output[first_product+"-"+second_product] = matches

    # return sorted dict
    sorted_output = OrderedDict(sorted(output.items(), key=lambda x: x[1]))
    return sorted_output

# plot pairs with minimum frequency of 1 basket matches
pd.Series(extract_pairs(1)).plot(kind="barh")

dzenan-hamzic-frequent-pairs-barplot

So there we have our pairs 🙂

Please note that this algorithm has execution time near O(n^2), or N over 2 pair combinations, and needs almost as much space, thus not suitable for mining frequent associations with large number of products. Check the Apriori algorithm for implementation with large data sets.

Here’s the full source.


# coding: utf-8
# In[41]:
import numpy as np
import scipy as sc
from pandas import Series,DataFrame
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from collections import OrderedDict
from fractions import Fraction
get_ipython().magic(u'matplotlib inline')
mpl.rcParams['figure.figsize'] = (10.0, 5)
# In[42]:
# nice to have
def plot_matrix(matrix):
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(matrix, interpolation='nearest')
fig.colorbar(cax)
# In[106]:
transaction_df = pd.DataFrame({'Mozzarella' : [1,0,1,0,1,0,0,1],
'Balsamico' : [0,1,1,0,1,0,0,1],
'Pepsi' : [1,0,0,1,0,0,1,0],
'Pasta' : [0,1,0,1,1,1,0,1],
'Wine' : [0,0,1,0,0,1,1,1]})
transaction_df
# In[107]:
# calculate support for every product in all transactions
product_support_dict = {}
for column in transaction_df.columns:
product_support_dict[column] = sum(transaction_df[column]>0)
# visualise support
pd.Series(product_support_dict).plot(kind="bar")
# In[108]:
transaction_matrix = transaction_df.as_matrix()
transaction_matrix
# In[109]:
bool_index = (transaction_matrix>0)
bool_index
# In[110]:
plot_matrix(transaction_matrix)
# In[111]:
# get number of rows and columns
rows, columns = transaction_matrix.shape
# init new matrix
frequent_items_matrix = np.zeros((5,5))
# compare every product with every other
for this_column in range(0, columns1):
print "this:", this_column,":",transaction_df.columns[this_column]
for next_column in range(this_column + 1, columns):
print "\tnext:", next_column,":",transaction_df.columns[next_column]
# multiply product pair vectors
product_vector = transaction_matrix[:,this_column] * transaction_matrix[:,next_column]
# check the number of pair occurrences in baskets
count_matches = sum((product_vector)>0)
print "\t", count_matches
# save values to new matrix
frequent_items_matrix[this_column,next_column] = count_matches
# In[112]:
print frequent_items_matrix
# In[113]:
plot_matrix(frequent_items_matrix)
# In[114]:
# combine matrix with names
frequent_items_df = pd.DataFrame(frequent_items_matrix, columns = transaction_df.columns.values, index = transaction_df.columns.values)
sns.heatmap(frequent_items_df)
# In[115]:
product_names = transaction_df.columns.values
# extract product pairs with minimum frequency(treshold) basket occurrences
def extract_pairs(treshold):
output = {}
# select indexes with larger or equal n
matrix_coord_list = np.where(frequent_items_matrix >= treshold)
# take values
row_coords = matrix_coord_list[0]
column_coords = matrix_coord_list[1]
# generate pairs
for index, value in enumerate(row_coords):
#print index
row = row_coords[index]
column = column_coords[index]
# get product names
first_product = product_names[row]
second_product = product_names[column]
# number of basket matches
matches = frequent_items_matrix[row,column]
# put key values into dict
output[first_product+"-"+second_product] = matches
# return sorted dict
sorted_output = OrderedDict(sorted(output.items(), key=lambda x: x[1]))
return sorted_output
# plot pairs with minimum frequency of 2 basket matches
min_frequency = 1
ax = pd.Series(extract_pairs(min_frequency)).plot(kind="barh", title="Frequent Pairs with Frequency >= " + str(min_frequency))

You can also find the source @https://github.com/dzenanh/mmds.

‘Nuff said. Enjoy the source. Cheers!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: