Bloom Filter Example in Python

Jan 5, 2017

Algorithms, Big Data, Data Mining, Data Science, Mining of Massive Datasets, Python, Source code

#datamining, #datascience, Algorithm, Bloom, Python

The title might also have been, “how to reduce 10 Gb of data to 1 single Megabyte”. BigData is only going to get bigger in the future. Our challenge, among others, is to find efficient methods and algorithms to (quickly) deal with wast amounts of data, extract meaningful information and to find ways how to efficiently persist all these data.

That is where the Bloom filter comes in…

The Bloom Filter is a powerful method of “reducing” data to single bits what makes it possible to store large amounts of information in only few megabytes. I’m talking here of a scale of 10.000 times reduction in size. Furthermore, since the bloom filter is basically one binary array, combined with an hash function, the search is in O(1), if hashing is implemented right. That makes it very useful for new IoT devices with small memory amounts, as well as for tiny PC’s like Rasperry PI and alternatives. The application scope doesn’t end there, the bloom filter can also be used for high-load backends, windows in streaming application, caching engines and many more…

In this post i’ll show you how to implement simple, yet powerful, spam filter in python using bloom filter.

btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.

Let’s say we have a list of spammer emails from which we don’t want to accept any incoming data. Here is one small example…

# example list of spam emails
email_list = ["spammer1@gmail.com", "spammer2@viagra.com",
              "spammer2@gmail.com","imnotAspammer@gmail.com",
              "checkmywebsite@hotmail.com", "dark_net@yougotme.com"]

But the problem is we can’t store millions and millions of spammer emails, while storage on our device is very limited, and we would need gigabytes of memory to store such amounts. No Problem, we can store all those email to bit array, where each bit corresponds to one email.

blog_drawing
The array of bits below is a bloom filter for the spam emails above.

('11101100001')

Bloom filter is simpler as you might think. I’ll try to explain it on spamming filter example.

Step 1: For every email in your spam-email list:
– calculate the sum of unicode values of each character in that email.

# calculate unicode sum of characters for every email-address
email_unicode_sum_list = [sum([ord(char) for char in email]) for email in email_list]
email_unicode_sum_list
Out[244]:
[1757, 1870, 1758, 2324, 2674, 2158]

Step 2: Select two random numbers (a and b) up to the maximum value of unicode value from the email_unicode_sum_list. These numbers are latter used as coefficients in blooms hash function ( h_x = (a*x + b) % c ). The parameter c is the hashtable size @see next step.

# choose 2 random numbers up to maximum unicode value
# and set them to be hashing coefficients
a = random.randint(1, max(email_unicode_sum_list)-1)
b = random.randint(2, max(email_unicode_sum_list)-1)

Step 3. Initialize bloomfilter bitarray, which is basically one hashtable, with size of next prime number of the length of your spam email list. We are choosing the next prime number so that our hash-table(blooms bitarray) has at most 0.75 load factor, in order to prevent collisions.

# since we have fixed size of input, the hashtable could have 1.0 load factor
# yet, 0.75 load has much less colisions
# choose next prime number from the number of elements in email list
hashtable_size = find_next_prime(int(round(len(email_unicode_sum_list)*1.25)))
hashtable_size
Out[246]:
11

The size of our bloom filter is going to be 11 bits.

# initialise bloom filter
bloom_filter = bitarray(hashtable_size)
# set all bits to 0
bloom_filter[:] = False
print "Bloom Filter:", bloom_filter

Bloom Filter: bitarray(‘00000000000’)

As you can see, at the moment our filter is empty (all positions in the array are False)

Step 3: Go over each value in email_unicode_sum_list list, and do hashing of the unicode value and set True for given hash index.

# hash emails
hash_list = [my_hash(a,b,unicode_sum,hashtable_size) for unicode_sum in email_unicode_sum_list]

# update bloom filter
[update_bloom(index, bloom_filter) for index in hash_list]
# check updated bloom filter
bloom_filter

So our spam emails are stored in the following bit-array or bloom filter:
‘01110010110’.

For checking if the email is spam or not, you just hash an incoming mail and check if the corresponding index contains True or False.

Here’s an overview of unicode values and bloom filter indexes for every email in our spam list.

[’email’->’unicode sum’, ->hash index]
[(‘spammer1@gmail.com’, ‘->’, 1757, ‘->’, 6),
(‘spammer2@viagra.com’, ‘->’, 1870, ‘->’, 2),
(‘spammer2@gmail.com’, ‘->’, 1758, ‘->’, 1),
(‘imnotAspammer@gmail.com’, ‘->’, 2324, ‘->’, 9),
(‘checkmywebsite@hotmail.com’, ‘->’, 2674, ‘->’, 8),
(‘dark_net@yougotme.com’, ‘->’, 2158, ‘->’, 3)]

So there you have it. Now you can install your email spam-filter service on your wrist watch 😉

‘Nuff said. Here’s the code.

	# coding: utf-8

	# In[259]:
	import numpy as np
	from bitarray import bitarray
	import random


	# In[260]:

	## define some helper methods

	# @param ca,cb: hashing coefficients
	# @param val: value to be hashed
	# @param prime: size of hash table
	# @return: hash
	def my_hash(ca, cb, val, prime): return (a*val + b) % prime

	# set True to @position in @bitar:bitarray
	def update_bloom(position, bitarr): bitarr[position] = 1

	# find next prime number
	def find_next_prime(n): return find_prime_in_range(n, 2*n)

	def find_prime_in_range(a, b):
	for p in range(a, b):
	for i in range(2, p):
	if p % i == 0:
	break
	else:
	return p
	return None

	# test
	print find_next_prime(19+1)


	# In[261]:

	# example list of spam emails
	email_list = ["spammer1@gmail.com", "spammer2@viagra.com",
	"spammer2@gmail.com","imnotAspammer@gmail.com",
	"checkmywebsite@hotmail.com", "dark_net@yougotme.com"]


	# In[262]:

	# initialise bloom filter
	bloom_filter = bitarray(hashtable_size)
	# set all bits to 0
	bloom_filter[:] = False
	print "Bloom Filter:", bloom_filter


	# In[263]:

	# calculate unicode sum of characters for every email-address
	email_unicode_sum_list = [sum([ord(char) for char in email]) for email in email_list]
	email_unicode_sum_list


	# In[264]:

	#The coefficients a and b are randomly chosen integers less than the maximum value of x.
	#c is a prime number slightly bigger than the maximum value of x.

	# h_x = (a*x + b)%c

	# choose 2 random numbers up to maximum unicode value
	# and set them to be hashing coefficients
	a = random.randint(1, max(email_unicode_sum_list)-1)
	b = random.randint(2, max(email_unicode_sum_list)-1)

	print a,b


	# In[265]:

	# since we have fixed size of input, the hashtable could have 1.0 load factor
	# yet, 0.75 load has much less colisions
	# choose next prime number from the number of elements in email list
	hashtable_size = find_next_prime(int(round(len(email_unicode_sum_list)*1.25)))
	hashtable_size


	# In[266]:

	# hash emails
	hash_list = [my_hash(a,b,unicode_sum,hashtable_size) for unicode_sum in email_unicode_sum_list]


	# In[267]:

	# assign hash_values & unicode-sums to emails
	[(email,"->",unicode_sum, "->", hashp) for email, unicode_sum, hashp in zip(email_list, email_unicode_sum_list, hash_list)]


	# In[269]:

	# update bloom filter
	[update_bloom(index, bloom_filter) for index in hash_list]
	# check updated bloom filter
	bloom_filter


	# In[271]:

	## now do some tests
	# for every email in the list check if it has already been seen by bloom filter
	for email in email_list:
	if bloom_filter[my_hash(a,b,sum([ord(char) for char in email]),hashtable_size)] == True:
	print email, "has ALREADY been seen";
	else:
	print email, "has NOT been seen";


	# In[ ]:

view raw

BloomFilter.py

hosted with ❤ by GitHub

You can also find the source @https://github.com/dzenanh/mmds.

Enjoy the source. Cheers!

Bloom Filter Example in Python

Share this:

Leave a comment Cancel reply