Google’s PageRank Algorithm in Python

Have you ever asked yourself how google ranks the pages when you search something on google.com?
If yes, have a look at PageRank algorithm definition. I’ll not go into much details here, but to give you an idea, the World Wide Web can be seen as a large graph, consisting of pages as nodes and links as edges between those nodes. The graph looks something like this.

The webpage relevance is measured by the number of incoming links that lead to it. Furthermore, the relevance increases if the incoming link to the page comes from a well known page like e.g. Wall Street Journal.

btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.

Graphs can be represented as matrices, thus the WWW(all webpages and links in between) can also be represented as a matrix. This WWW matrix is then multiplied by initial vector which consists of default probabilities of an user landing to every single webpage at any given moment. This process is repeated until the resulting vector, with rank for every single page, converges. It is usually between 50 and 100 times.

‘Nuff said. Here’s the code…

# import some stuff
import numpy as np
import scipy as sc
import pandas as pd
from fractions import Fraction

# keep it clean and tidy
def float_format(vector, decimal):
    return np.round((vector).astype(np.float), decimals=decimal)

# we have 3 webpages and probability of landing to each one is 1/3
#(defaultProbability)
dp = Fraction(1,3)

# WWW matrix
M = np.matrix([[0,0,1],
        [Fraction(1,2),0,0],
        [Fraction(1,2),1,0]])

E = np.zeros((3,3))
E[:] = dp

# taxation
beta = 0.7

# WWW matrix
A = beta * M + ((1-beta) * E)

# initial vector
r = np.matrix([dp, dp, dp])
r = np.transpose(r)

previous_r = r
for it in range(1,100):
    r = A * r
    print float_format(r,3)
    #check if converged
    if (previous_r==r).all():
        break
    previous_r = r

print "Final:\n", float_format(r,3)
print "sum", np.sum(r)

The output would be:

[[ 0.333]
[ 0.217]
[ 0.45 ]]
[[ 0.415]
[ 0.217]
[ 0.368]]
[[ 0.358]
[ 0.245]
[ 0.397]]
.
.
.
[[ 0.378]
[ 0.225]
[ 0.397]]
[[ 0.378]
[ 0.232]
[ 0.39 ]]
[[ 0.373]
[ 0.232]
[ 0.395]]
[[ 0.376]
[ 0.231]
[ 0.393]]
[[ 0.375]
[ 0.232]
[ 0.393]]
Final:
[[ 0.375]
[ 0.232]
[ 0.393]]
sum 1.0

The output tells us that the third page would be ranked first, first page ranked second and the second page third as a result of your search.

Long story short, it seems all nice and simple, but try to multiply over billion pages vector with billion x billion matrix for 50 times 🙂 That’s one of the main reasons why google became Google at first place.

Enjoy the source and have fun. Cheers!

Category: Algorithms, Backend, Big Data, Data Science, Python, Source code
Tag: Algorithm, Google, Implementation, PageRank, Python

Published by dzhamzic

bits & notes View all posts by dzhamzic

Google’s PageRank Algorithm in Python – Following my Passion December 24, 2016 at 11:55 am

[…] Source: Google’s PageRank Algorithm in Python […]

LikeLiked by 1 person

Reply
Keith Safford January 18, 2017 at 3:38 pm

I think line 21 s/b E[:] = dp instead of E[:] = defaultProbability

LikeLiked by 1 person

Reply
1. dzhamzic January 19, 2017 at 10:57 am
  
  You are right Keith! Good eye. Thanks!
  
  LikeLiked by 1 person
  
  Reply

Share this:

Published by dzhamzic

Leave a comment Cancel reply