Bloom filter is a probabilistic data structure based on hashing. It can help us answer whether a key is present in a given set. The Bloom filter is based on a set of bits and hash functions. As opposed to a hash table we don’t store a key-value pair, we only store several bits that equal the number of hash functions. The Bloom filter cannot accurately answer the question of representing a key as a bit set because bits can match other keys. This image below must help you figure out the base idea.
First, we insert "key one” into our bit set and mark two bits as 1. Then we insert another key "key two” into our bit set and also mark other bits as 1. When we want to check for another key in our set of bits, we see that they are already populated, even though the key is not present in our filter. It is called a false positive occurrence.
Our implementation of the Bloom filter will take two arguments. First, an expected number of elements that the Bloom filter will store. Second, a probability of the false positive occurrence.
We can compute the optimal size of the bit set by the function -((n * ln(p)) / (ln(2) * ln(2)))
where n is the expected number of elements and p is the probability of a false positive occurrence.
For the calculation number of hash functions, we can use the function m / n * ln(2)
where m is the size of the bit set, and n is the expected number of elements.
type Bloom struct {
m uint32 // m is size of bitset
k uint32 // k is number of hash functions
bitset []bool // bitset represent an array of bits
hashes []hash.Hash32 // hashes is preset hash functions
}
// New create a new instance of the bloom filter
// n is expected number of elements
// p is false alarm probability
func New(n int, p float64) *Bloom {
m := M(n, p)
k := K(m, uint32(n))
bitset := make([]bool, m)
hashes := make([]hash.Hash32, k)
var i uint32
for ; i < k; i++ {
hashes[i] = murmur.New32WithSeed(i)
}
return &Bloom{
m: m,
k: k,
bitset: bitset,
hashes: hashes,
}
}
// M calculates size of bitset
// n is an expected number of elements
// p is a false alarm probability
func M(n int, p float64) uint32 {
return uint32(math.Ceil(-((float64(n) * math.Log(p)) / (math.Pow(math.Log(2), 2)))))
}
// K calculates number of hash functions
// m is a size of bitset
// n is an expected number of elements
func K(m, n uint32) uint32 {
return uint32(math.Ceil(math.Log(2) * float64(m) / float64(n)))
}
I used murmur3 hash functions. If you are interested in how this hash function works, you can read this Wikipedia article.
Our implementation of adding the key into the Bloom filter is very easy. We find all the positions of our key in the same way as in the hash table. We have to calculate a hash and take the remainder of the division by the bit set size.
// Add the key to the bloom filter
// k is the key of the element
func (b *Bloom) Add(k []byte) {
for _, h := range b.hashes {
h.Reset()
_, _ = h.Write(k)
idx := h.Sum32() % b.m
b.bitset[idx] = true
}
}
The contain operation is similar to the add operation we must check all possible occurrences have positive values. If all the occurrences are true we will return true otherwise false.
func (b *Bloom) Contains(k []byte) bool {
for _, h := range b.hashes {
h.Reset()
_, _ = h.Write(k)
idx := h.Sum32() % b.m
if !b.bitset[idx] {
return false
}
}
return true
}
We can union different Bloom filters, but we have a few restrictions. Our Bloom filters must have the same size and number of hash functions. Also, keep in mind that if the sum of the elements is greater than the expected sum, we will lose accuracy.
// Union two different bloom filters with the same size and number of hash functions
func (b *Bloom) Union(a *Bloom) (err error) {
if b.m != a.m {
return errors.New("the bloom filters have the different sizes")
}
if b.k != a.k {
return errors.New("the bloom filters have the different number of hash functions")
}
n := int(b.m)
for i := 0; i < n; i++ {
if b.bitset[i] || a.bitset[i] {
b.bitset[i] = true
}
}
return nil
}
I hope my article helps you figure out how the Bloom filter works.
The implementation of the Bloom filter in Github
The article about the Bloom filter in Wikipedia.