The title might also have been, “how to reduce 10 Gb of data to 1 single Megabyte”. BigData is only going to get bigger in the future. Our challenge, among others, is to find efficient methods and algorithms to (quickly) deal with wast amounts of data, extract meaningful information and to find ways how to efficiently persist all these data.

That is where the Bloom filter comes in…

The Bloom Filter is a powerful method of “reducing” data to single bits what makes it possible to store large amounts of information in only few megabytes. I’m talking here of a scale of 10.000 times reduction in size. Furthermore, since the bloom filter is basically one binary array, combined with an hash function, the search is in O(1), if hashing is implemented right. That makes it very useful for new IoT devices with small memory amounts, as well as for tiny PC’s like Rasperry PI and alternatives. The application scope doesn’t end there, the bloom filter can also be used for high-load backends, windows in streaming application, caching engines and many more…

In this post i’ll show you how to implement simple, yet powerful, spam filter in python using bloom filter.

*btw… If you would like to go deeper into the topic of big data mining, find out more about this algorithm, and many others, check out this book! Mining of Massive Datasets. It is the single best source for big data mining and machine learning for massive datasets.*

Let’s say we have a list of spammer emails from which we don’t want to accept any incoming data. Here is one small example…

# example list of spam emails email_list = ["spammer1@gmail.com", "spammer2@viagra.com", "spammer2@gmail.com","imnotAspammer@gmail.com", "checkmywebsite@hotmail.com", "dark_net@yougotme.com"]

But the problem is we can’t store millions and millions of spammer emails, while storage on our device is very limited, and we would need gigabytes of memory to store such amounts. No Problem, we can store all those email to bit array, where each bit corresponds to one email.

The array of bits below is a bloom filter for the spam emails above.

('11101100001')

Bloom filter is simpler as you might think. I’ll try to explain it on spamming filter example.

**Step 1**: For every email in your spam-email list:

– calculate the sum of unicode values of each character in that email.

# calculate unicode sum of characters for every email-address email_unicode_sum_list = [sum([ord(char) for char in email]) for email in email_list] email_unicode_sum_list Out[244]: [1757, 1870, 1758, 2324, 2674, 2158]

**
Step 2**: Select two random numbers (a and b) up to the maximum value of unicode value from the email_unicode_sum_list. These numbers are latter used as coefficients in blooms hash function ( h_x = (a*x + b) % c ). The parameter c is the hashtable size @see next step.

# choose 2 random numbers up to maximum unicode value # and set them to be hashing coefficients a = random.randint(1, max(email_unicode_sum_list)-1) b = random.randint(2, max(email_unicode_sum_list)-1)

**
Step 3**. Initialize bloomfilter bitarray, which is basically one hashtable, with size of next prime number of the length of your spam email list. We are choosing the next prime number so that our hash-table(blooms bitarray) has at most 0.75 load factor, in order to prevent collisions.

# since we have fixed size of input, the hashtable could have 1.0 load factor # yet, 0.75 load has much less colisions # choose next prime number from the number of elements in email list hashtable_size = find_next_prime(int(round(len(email_unicode_sum_list)*1.25))) hashtable_size Out[246]: 11

The size of our bloom filter is going to be 11 bits.

# initialise bloom filter bloom_filter = bitarray(hashtable_size) # set all bits to 0 bloom_filter[:] = False print "Bloom Filter:", bloom_filter

Bloom Filter: bitarray(‘00000000000’)

As you can see, at the moment our filter is empty (all positions in the array are False)

Step 3: Go over each value in email_unicode_sum_list list, and do hashing of the unicode value and set True for given hash index.

# hash emails hash_list = [my_hash(a,b,unicode_sum,hashtable_size) for unicode_sum in email_unicode_sum_list] # update bloom filter [update_bloom(index, bloom_filter) for index in hash_list] # check updated bloom filter bloom_filter

So our spam emails are stored in the following bit-array or bloom filter:

‘01110010110’.

For checking if the email is spam or not, you just hash an incoming mail and check if the corresponding index contains True or False.

Here’s an overview of unicode values and bloom filter indexes for every email in our spam list.

[’email’->’unicode sum’, ->hash index]

[(‘spammer1@gmail.com’, ‘->’, 1757, ‘->’, 6),

(‘spammer2@viagra.com’, ‘->’, 1870, ‘->’, 2),

(‘spammer2@gmail.com’, ‘->’, 1758, ‘->’, 1),

(‘imnotAspammer@gmail.com’, ‘->’, 2324, ‘->’, 9),

(‘checkmywebsite@hotmail.com’, ‘->’, 2674, ‘->’, 8),

(‘dark_net@yougotme.com’, ‘->’, 2158, ‘->’, 3)]

So there you have it. Now you can install your email spam-filter service on your wrist watch 😉

‘Nuff said. Here’s the code.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

# coding: utf-8 | |

# In[259]: | |

import numpy as np | |

from bitarray import bitarray | |

import random | |

# In[260]: | |

## define some helper methods | |

# @param ca,cb: hashing coefficients | |

# @param val: value to be hashed | |

# @param prime: size of hash table | |

# @return: hash | |

def my_hash(ca, cb, val, prime): return (a*val + b) % prime | |

# set True to @position in @bitar:bitarray | |

def update_bloom(position, bitarr): bitarr[position] = 1 | |

# find next prime number | |

def find_next_prime(n): return find_prime_in_range(n, 2*n) | |

def find_prime_in_range(a, b): | |

for p in range(a, b): | |

for i in range(2, p): | |

if p % i == 0: | |

break | |

else: | |

return p | |

return None | |

# test | |

print find_next_prime(19+1) | |

# In[261]: | |

# example list of spam emails | |

email_list = ["spammer1@gmail.com", "spammer2@viagra.com", | |

"spammer2@gmail.com","imnotAspammer@gmail.com", | |

"checkmywebsite@hotmail.com", "dark_net@yougotme.com"] | |

# In[262]: | |

# initialise bloom filter | |

bloom_filter = bitarray(hashtable_size) | |

# set all bits to 0 | |

bloom_filter[:] = False | |

print "Bloom Filter:", bloom_filter | |

# In[263]: | |

# calculate unicode sum of characters for every email-address | |

email_unicode_sum_list = [sum([ord(char) for char in email]) for email in email_list] | |

email_unicode_sum_list | |

# In[264]: | |

#The coefficients a and b are randomly chosen integers less than the maximum value of x. | |

#c is a prime number slightly bigger than the maximum value of x. | |

# h_x = (a*x + b)%c | |

# choose 2 random numbers up to maximum unicode value | |

# and set them to be hashing coefficients | |

a = random.randint(1, max(email_unicode_sum_list)–1) | |

b = random.randint(2, max(email_unicode_sum_list)–1) | |

print a,b | |

# In[265]: | |

# since we have fixed size of input, the hashtable could have 1.0 load factor | |

# yet, 0.75 load has much less colisions | |

# choose next prime number from the number of elements in email list | |

hashtable_size = find_next_prime(int(round(len(email_unicode_sum_list)*1.25))) | |

hashtable_size | |

# In[266]: | |

# hash emails | |

hash_list = [my_hash(a,b,unicode_sum,hashtable_size) for unicode_sum in email_unicode_sum_list] | |

# In[267]: | |

# assign hash_values & unicode-sums to emails | |

[(email,"->",unicode_sum, "->", hashp) for email, unicode_sum, hashp in zip(email_list, email_unicode_sum_list, hash_list)] | |

# In[269]: | |

# update bloom filter | |

[update_bloom(index, bloom_filter) for index in hash_list] | |

# check updated bloom filter | |

bloom_filter | |

# In[271]: | |

## now do some tests | |

# for every email in the list check if it has already been seen by bloom filter | |

for email in email_list: | |

if bloom_filter[my_hash(a,b,sum([ord(char) for char in email]),hashtable_size)] == True: | |

print email, "has ALREADY been seen"; | |

else: | |

print email, "has NOT been seen"; | |

# In[ ]: | |

You can also find the source @https://github.com/dzenanh/mmds.

Enjoy the source. Cheers!