Have you ever asked yourself how google ranks the pages when you search something on google.com?
If yes, have a look at PageRank algorithm definition. I’ll not go into much details here, but to give you an idea, the World Wide Web can be seen as a large graph, consisting of pages as nodes and links as edges between those nodes. The graph looks something like this.
The webpage relevance is measured by the number of incoming links that lead to it. Furthermore, the relevance increases if the incoming link to the page comes from a well known page like e.g. Wall Street Journal.
btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.
Graphs can be represented as matrices, thus the WWW(all webpages and links in between) can also be represented as a matrix. This WWW matrix is then multiplied by initial vector which consists of default probabilities of an user landing to every single webpage at any given moment. This process is repeated until the resulting vector, with rank for every single page, converges. It is usually between 50 and 100 times.
‘Nuff said. Here’s the code…
# import some stuff import numpy as np import scipy as sc import pandas as pd from fractions import Fraction # keep it clean and tidy def float_format(vector, decimal): return np.round((vector).astype(np.float), decimals=decimal) # we have 3 webpages and probability of landing to each one is 1/3 #(defaultProbability) dp = Fraction(1,3) # WWW matrix M = np.matrix([[0,0,1], [Fraction(1,2),0,0], [Fraction(1,2),1,0]]) E = np.zeros((3,3)) E[:] = dp # taxation beta = 0.7 # WWW matrix A = beta * M + ((1-beta) * E) # initial vector r = np.matrix([dp, dp, dp]) r = np.transpose(r) previous_r = r for it in range(1,100): r = A * r print float_format(r,3) #check if converged if (previous_r==r).all(): break previous_r = r print "Final:\n", float_format(r,3) print "sum", np.sum(r)
The output would be:
[[ 0.333] [ 0.217] [ 0.45 ]] [[ 0.415] [ 0.217] [ 0.368]] [[ 0.358] [ 0.245] [ 0.397]] . . . [[ 0.378] [ 0.225] [ 0.397]] [[ 0.378] [ 0.232] [ 0.39 ]] [[ 0.373] [ 0.232] [ 0.395]] [[ 0.376] [ 0.231] [ 0.393]] [[ 0.375] [ 0.232] [ 0.393]] Final: [[ 0.375] [ 0.232] [ 0.393]] sum 1.0
The output tells us that the third page would be ranked first, first page ranked second and the second page third as a result of your search.
Long story short, it seems all nice and simple, but try to multiply over billion pages vector with billion x billion matrix for 50 times 🙂 That’s one of the main reasons why google became Google at first place.
Enjoy the source and have fun. Cheers!