When shoppers went crazy for the good deals on Alibaba’s Single’s Day sale, its engineers were facing the challenge of hotspot data hashing…
Access to hotspot data is an issue often encountered on distributed cache platforms, for example with seller promotions on Alibaba’s e-commerce platforms.
Tair is Alibaba’s in-house elastic cache platform, used extensively throughout the company. It enables high availability and uniform access to data, providing high performance, scalability and reliability. Tair has been tailored to reduce performance bottlenecks through unique storage and access mechanisms that tackle hotspot data identification, reading, and writing.
Tair is an elastic cache platform, a type of distributed cache platform developed for cloud computing.
A distributed cache stores data in a cluster. Data sets are distributed (or “partitioned”) across a series of nodes, with each node responsible for a portion of the cache data, while a uniform access interface is provided collectively.
The increased focus on distributed cache technology that came with the rapid development of cloud platforms led to the development of the elastic cache platform. Elastic platforms emphasize dynamic scalability (support for transparent service expansion) and high availability (ability of the cluster to cope with node failure).
Distributed cache platforms use data hashing for data fragmentation and routing.
Many distributed cache platforms apply the consistent hashing algorithm. Originally conceived by researchers at MIT in 1997, Tair uses an improved version of the algorithm proposed by Amazon in 2007.
The consistent hashing algorithm divides the hashing space into equal-sized data partitions called virtual nodes. These are then stored on cache nodes, the number of which must be greater than or equal to that of the virtual nodes. Each cache node is assigned a different quantity of data partitions depending on its capacity.
A hash function maps the data key value requested by a client to a server location. This location is marked as a token. The token value is again hash mapped to a partition identifier.
After the client obtains the partition identifier, the client searches the partition server mapping table of the cache node to retrieve the data partition, and then accesses the data therein.
A limitation of this approach is that each individual data key invariably maps to a fixed DataServer. This causes issues when multiple clients need to access the same data key, as shown in the figure below:
A single DataServer node, therefore, becomes a bottleneck for writing and reading a single data key. This cannot be resolved by expanding nodes horizontally.
Alibaba constantly deals with hotspot data in the form of seller promotions. Improving the performance and reliability of the whole cache platform in terms of how hotspot data is managed is, therefore, a necessity.
Tair’s hotspot data hashing solution targets hotspot recognition, hotspot reading, and hotspot writing to allow for horizontal node expansion.
The read/write solutions rely on the server first being able to correctly identity hotspots. The solution also relies on the provision of a corresponding client API to support the marking of a specific data key or keys of a Namespace as hotspot keys in advance.
When the DataServer receives requests from a client, each worker thread processing these requests generates statistics. The data structures used by worker threads for hotspot statistics are those in ThreadLocal mode and completely unlocked.
A hotspot recognition algorithm uses an elaborately designed data structure combining multi-level weighted LRU link and HashMap. It produces full request statistics and allows accurate recognition of both queries-per-second (QPS) hotspots and traffic hotspots, the latter occurring where the QPS is low but the data itself is large. This approach relies on the server being capable of processing the requests with sufficient efficiency.
At the end of each sampling period, the worker thread transfers statistical data structures to the statistical thread pool for background analysis and processing. The task of producing statistics runs asynchronously in the background, without occupying the processing resources for normal data requests.
The method used in the solution establishes an isolated HotZone, which is a storage zone on the DataServer that guarantees access to hotspot data.
The multi-level cache architecture is shown as below:
No weight relationship exists between the HotZones across DataServers, and the same reading hotspot data is stored in each HotZone. A client request for the hotspot data key is randomly implemented on the HotZone of any DataServer.
In this way, a single-point hotspot request is hashed onto several nodes, or even the whole cluster.
Each client is assigned to a single HotZone. After a client has initialized and before it makes a request, it obtains the node information and complete data route table for the whole Tair cluster, as well as the total number of HotZones configured across all DataServers (the HotZone “node range”). Then, the client randomly selects a HotZone as its fixed writing & reading zone. This selection does not change so long as the number of DataServers and hash machines remains unchanged.
The client receives the hotspot key from the server side, which is effective for a limited period. During this period, when the client accesses the key, it will first attempt to access the hotspot data from the HotZone node.
The HotZone node and the DataServer node, which stores the source data, form a two-level cache model which is known to the client. If the data is unavailable on the HotZone, the client will then request the source data node; meanwhile the obtained data is stored asynchronously to the HotZone node. Simply use an application in the Tair client to recall the interface and retrieve data by the standard method.
The process of hotspot feedback and recognition is transparent, as is multi-level cache access. The consistency of cache data in the HotZone is guaranteed by the expiration time set when initializing the client. The expiration time is determined by the maximum tolerance time for cache data inconsistency of a service.
If the hotspot feedback locally stored by the client has expired, the data key reads data from the source DataServer node. If the key is still in hotspot status on the server side, the client will receive the hotspot feedback packet again.
Because the local hotspot feedback information stored by each client has a unique failure pattern, there is never an instance in which all requests are returned back to the source at the same moment. Even if all requests are returned back to the source, only one reading of the source data is required and the maximum number of reads is simply the number of applied machines.
If the key is found not to be hotspot after returning back to the source, the client will return to its normal access mode.
Writing hotspots cannot be address using multi-level caches, because they can cause data inconsistencies. If the working machine crashes, data may be written to the local cache but not asynchronously updated to the source DataServer.
Instead, writing hotspots are processed on the server side by means of request combination.
The writing requests in the hotspot key, on the IO thread, are distributed to a special hotspot for combined thread processing. This thread combines writing requests for a certain period of time according to the key, and then the timing thread submits the combined requests to the engine layer according to the preset combination period.
The result is not returned to the client by means of the combination process. It is collectively returned after the requests are combined and written into the engine. This avoids multiple issues including data inconsistencies, reading of old data, and false writing.
A combination period can be configured in the server side and a validity period can be dynamically edited.
The writing hotspot solution is totally transparent to the client and requires no editing operation.
Stress tests on a cluster using an LDB storage engine found that a single key combination can achieve a per-key QPS numbering in the millions (combination period of 1 ms, unlimited number of combinations). However, for practical online clustering, the period is limited to 0.1 ms and the maximum number of combinations at a time is limited to 100.
In this way, the maximum QPS dropped on the engine layer for a single key can be kept to 10,000 or below (depending on the access frequency of an application).
The packet processing of the Tair client is completely asynchronized, meaning that the combination operation for hotspot requests does not block processing of other requests.
The only negative impact is the increased RT when the client has a writing request to the hotspot key. However, the impact is negligible using the existing configuration (0.1 ms).
(Original article by Liu Huan刘欢)
First hand, detailed, and in-depth information about Alibaba’s latest technology. Follow us: www.facebook.com/AlibabaTechnology