Shocker: Big Data derives its name not just from the size.
The datasets of Big Data are larger, more complex information bits fetched from new data sources. These massive volumes of data can be used to address business problems more intelligently. However, traditional data processing algorithms fall flat to handle this magnitude.
Deterministic data structures like HashSet do the trick with smaller amounts of data. But when we have to deal with something like streaming applications, those structures cannot process everything in one pass and assist in incremental updates.
That is why we need more space-efficient and fast algorithms. Thus, probabilistic data structures are a great fit for modern Big Data applications.
With that said, let’s have a look at probabilistic data structures and algorithms as well as their common use.
Deterministic data structures are common for a techie. Thus, we often bump into Array, List, HashTable, HashSet, etc. The latter is suitable for a wide variety of operations including insert, find, and delete (provided you have specific key values). As a result of such operations, we get deterministic or accurate results.
However, probabilistic data structures work according to their name. Probabilistic data structures cannot give you a definite answer, instead, they give you a reasonable approximation of the answer and a way to approximate that estimate.
These data structures are a great fit for a large data set. The most prominent examples of operations may include identifying some unique or frequent items. To complete the operation, probabilistic data structures use hash functions to randomize items.
Because they ignore collisions, they keep the size constant. Yet, this is also the reason why they cannot give you exact values. The higher the number of hash functions is, the more accurate determination you get.
The main use cases of probabilistic data structures include:
Examples of probabilistic data structures are as follows:
Let’s have a look at the most widely used data structures within this realm.
The Bloom filter is an implementation of a probability set, invented by Burton Bloom in 1970. This approximate member data query structure allows you to compactly store elements and check if a given element belongs to the set.
In this case, you can get a false positive (the element is not in the set, but the data structure says it is), but not a false negative. Bloom's filter can use any memory size, predefined by the user. And the larger it is, the lower the probability of a false positive.
The operation of adding new elements to the set is supported. However, you can’t delete the existing ones.
Bloom filter allows you to perform three kinds of operations:
When the structure flags the element as Found/Present, there is a small chance that it’s lying. But if we’re talking about the Not Found/Not Present category, the Bloom filter boasts 100% accuracy plus space-saving perks.
Its hype can be attributed to the fact that Bloom filters have this powerful combo of simplicity and multi-purpose nature. In layman’s terms, they support operations similar to the hash tables but use less space.
Apache Cassandra, for example, benefits from these structures to process massive amounts of information. This storage system taps into Bloom filters to find out whether an SSTable has data for a specific partition.
HyperLogLog is a beautiful, yet simple algorithm for handling cardinality estimation. It excels when dealing with sets of data with a huge number of values.
Let’s say, you have a massive dataset of elements with duplicate entries. The latter is taken from a set of cardinality n and you are required to find n, which is the number of unique components in the set.
This can be helpful when identifying the amount of Google searches performed by end-users in a day. If you try to squeeze all the data into the memory, you’ll need storage proportionate to the number of Google searches per day.
Thereby, the HyperLogLog data structure turns the data into a hash of random numbers that represent the data's cardinality, allowing it to solve the problem with as little as 1.5kB of RAM.g
HyperLogLog operates by predicting an approximate count of distinct elements through a function known as APPROX_DISTINCT.
Operations:
The counting-minimal sketch is another efficient algorithm for counting streaming data. This data structure boils down to keeping track of the count of things. Therefore, by performing this algorithm, you can find out how many times an element is met in the set. You can also easily test if a given member has been observed before.
Just like with Bloom Filters, Count-min sketch saves a lot of space by using probabilistic techniques. To implement a counting mechanism, you need to use a hash function.
Overall, the count-min structure works great whenever you’re looking for just approximate counts of the most essential elements.
Operations:
As for the prominent applications, AT&T leverages the structure in network switches to analyze traffic in memory-constrained environments. The structure is also implemented as part of Twitter's Algebird library.
Working with Big Data is a challenge itself, let alone digging up answers from it. In this case, the best you can count on is an approximate answer. Probabilistic data structures allow you to conquer the beast and give you an estimated view of some data characteristics. You lose the accuracy of results, but get an enormous amount of storage space in exchange for that.