In this article, Thai Bui describes how Bazaarvoice leverages Alluxio as a caching tier on top of AWS S3 to maximize performance and minimize operating costs on running Big Data analytics on AWS EC2. The original article can be found on Alluxio's engineering blog.
This article aims to provide the following takeaways:
Bazaarvoice is a digital marketing company based in Austin, Texas. We are a software-as-a-service provider that allows retailers and brands to curate, manage, and understand user-generated content (UGC) such as reviews for their products. We provide services such as collecting and hosting reviews, programmatic management of UGC content, and deep analytics on consumer behavior.
Our globally distributed services analyze UGC content for over 1900 of the biggest internet retailers and brands. Data engineering at Bazaarvoice requires handling data at massive internet-scale. Take 2018 Thanksgiving weekend, for example, the combined 3-day holiday traffic generated 1.5 billion product page views and recorded $5.1 billion USD in online sales for our clients. Within a month, our services host and record web analytics events for over 900 million internet shoppers worldwide.
To keep up with the web traffic, we host our services via Amazon Web Service. The big data platform completely relies on the open source Hadoop ecosystem, utilizing tools such as:
The data generated by these various services is optimally formatted for analytics, such as Parquet or ORC, and is eventually stored in S3.
Using S3 to store our data enables us to increase our storage capacity effortlessly, but at the cost of incurring a performance bottleneck. Accessing sizable data from S3 is bottlenecked by the connection’s bandwidth. Since S3 is a managed service by AWS, it is not possible to write custom code to optimize to our specific environment, contrary to other open source technologies being utilized. One approach would be to provision more hardware to process data from S3 in parallel. But rather than spending more money on hardware, we decided to find a more ideal solution to solve this performance bottleneck.
We realized that not all data access needs to be fast because the workloads typically involved a subset of data. However this selective subset is constantly changing as newer data is added. This led us to come up with a tiered storage system to accelerate our data access, with the following goals:
With these criteria in mind, we decided to use Alluxio to accelerate our data access. As a key component in the tiered storage system, Alluxio was picked as it is highly configurable and relatively cheap to reconfigure operational-wise.
We integrated Alluxio with Hive Metastore as a basis for the tiered storage S3 acceleration layer. Clusters can be elastically provisioned with support for our tiered storage layer.
In each job cluster (Hive on Tez & Spark) or interactive cluster (Hive 3 on LLAP), a table could be altered to have more relevant data configured to use Alluxio, while having the rest of the data served by S3 directly.
Since the data is not cached in Alluxio unless it is accessed via a Hive or Spark task, there’s no data movement. Since updating a table configuration is extremely cheap, adapting this pattern of changing queries has allowed us to be very nimble.
For example, let us look at our page view dataset. A page view is an event recorded when an Internet user visits a page in our client’s website. The raw data is collected, aggregated hourly, and converted into Parquet format. This data is stored in S3 in a year, month, day, and hour hierarchical structure:
s3://some-bucket/
|- pageview/
|- 2019/
|- 02/
|- 14/
|- 01/
|- 02/
|- 03/
|- pageview_file1.parquet
|- pageview_file2.parquet
The data is registered to one of our clusters via Hive Metastore and available to be used in Hive or Spark for analytics, ETLs, and other various purposes. An internal system registers the data and updates the Hive Metastore directly for every new dataset that it detects.
# add partition from s3. equivalent to always referencing “cold” data
ALTER TABLE pageview
ADD PARTITION (year=2019, month=2, day=14, hour=3)
LOCATION ‘s3://<bucket>/pageview/2019/02/14/03’
Our system is also aware of the tiered storage configuration provided for each specific cluster and table.
For instance, in an interactive cluster where analysts are analyzing the last few weeks of trending data, the
pageview
table with partitions less than a month old is configured to read directly from the Alluxio filesystem. An internal system reads this configuration to mount the S3 bucket to the Alluxio cluster via REST API and automatically takes care of promoting or demoting tables and partitions using the following Hive DDLs:# promoting. repeating query will cache the data from cold->warm->hot
ALTER TABLE pageview
PARTITION (year=2019, month=2, day=14, hour=3)
SET LOCATION ‘alluxio://<address>/mnt/s3/pageview/2019/02/14/03’
# demoting. 1 month older data goes back to the “cold” tier.
# this protects our cache when people occasionally read older data
ALTER TABLE pageview
PARTITION (year=2019, month=1, day=14, hour=3)
SET LOCATION ‘s3://<bucket>/pageview/2019/01/14/03’
As more recent data arrives, older data is demoted, requiring the tiered storage configuration to be updated. This process happens asynchronously and continuously.
Our Alluxio + ZFS + NVMe SSD read micro benchmark is run on an i3.4xlarge AWS instance with up to 10 Gbit network, 128GB of RAM, and two 1.9TB NVMe SSDs. We mount a S3 bucket on Alluxio to perform 2 read tests. The first test copies 5GB of Parquet data using the AWS CLI into the instance’s ramdisk to measure only read performance.
# using AWS cli to copy recursively to RAM disk
time aws s3 cp --recursive s3://<bucket>/dir /mnt/ramdisk
The second test uses the Alluxio CLI to copy the same Parquet data to ramdisk. This time, we perform the test 3 times to get the cold, warm, and hot numbers as shown in the chart above.
# using AWS cli to copy recursively to RAM disk
time ./bin/alluxio fs copyToLocal /mnt/s3/dir /mnt/ramdisk
Alluxio v1.7 with ZFS and NVMe takes about 78% longer to perform a cold read when compared to S3 (66s vs. 37s). The successive warm and hot reads are 2.5 and 4.6 times faster than reading directly from S3.
However microbenchmarks don’t always tell the whole story, thus we will take a look at a few real-world queries executed by actual analysts at Bazaarvoice.
We rerun the same queries on our production interactive cluster to compare the tiered storage system against S3 by itself. The production cluster consists of 20 i3.4xlarge nodes running Hadoop 3, Hive 3.0 on Alluxio 1.7, and ZFS v0.7.12-1.
Query 1 is a single table deduplication query with 6 columns to group by. It processes 95M input records and 8G of Parquet data from a 200G dataset.
Query 2 is a more complex query, involving a join on 4 tables with aggregations across 5 columns. It processes 1.2B input records and 50G of Parquet data from several TB datasets.
The two queries simplified for readability and their query plans are shown below. Compared to running the queries directly on S3, the tiered storage system speeds up query 1 by 11x and query 2 by 6x.
# query 1 simplified
SELECT dt, col1, col2, col3, col4, col5
FROM table_1 n
WHERE lower(col6) = lower('<some_value>')
AND month = 1
AND year = 2019
AND (col7 IS NOT NULL OR col8 IS NOT NULL)
GROUP BY dt, col1, col2, col3, col4, col5
ORDER BY dt DESC
LIMIT 100;
Alluxio, ZFS, and the tiered storage architecture have helped us save a significant amount of time for analysts at Bazaarvoice. AWS S3 is easy to scale in capacity and by augmenting it with a tiered storage configuration that is nimble and cheap to adapt, we can focus on growing our business and scaling storage as needed.
Today at Bazaarvoice, the current production configuration can handle about 35TB of data in cache and half a petabyte of data on S3. In the future, we could simply add more nodes to grow the cache size or upgrade hardware for quicker localized access.