If you have to design a system that can serve 1B+ ad requests/day or ~200 GB/hour, then most likely your choice is going to be some fancy big data technology. And of course, that makes sense too. These numbers would seem like a puny human against the hulk of a tool like Kafka, Spark, Flink, Hadoop and what not. It's easy to design that way.
Recently AdPushup has served 1.2B+ ad feedback requests and 425M unique impressions per day. It was a proud moment for the engineers as the in-house system that was designed 2 years ago, had now auto-scaled almost 10x. But the journey to develop such a system was a long string of experiments.
Back in 2016, when I joined AdPushup, we didn’t have any concrete data pipeline. The priority was to build and stabilize the A/B testing tool suite. For reporting, we were relying on Google (GAM, Adsense) and other partners. However, frequent reporting requests and the need for informed decision making, pushed the need for an in-house reporting setup.
Being new to AdTech reporting, the team had a vague idea about the important data points that could be useful for the publishers. So a flexible approach was strategized and all the available data points were recorded. The client-side script started sending data in JSON format on a number of events like, ad view, click etc. Every hit/request had around 20–30 data points.
A schema-agnostic system/tool was required and thanks to Microsoft’s venture accelerator plan, ample amounts of credits were at the team’s disposal for experimentation.
The ELK stack was the first choice as it was red hot at that time and it provided the exact freedom from schema that was needed. So a small cluster of Elastic Search was setup and logs were poured into it. It worked great initially, but it didn’t scale as expected.
The single dimension aggregation was working very smoothly. However, the performance degraded as aggregation across multiple dimensions was tried.
As the data size and number of incoming requests increased, the overall performance of the cluster went down. Adding more nodes to the cluster helped initially, but after a point performance stopped improving.
Failure with ELK (sort of) compelled us to consider other solutions. So instead of an in-house solution, we started looking for some kind of a managed or white-label solution. The team’s googling skill landed us on keen.io (luckily!) and it turned out to be a brilliant tool. It had all the benefits of the ELK stack plus so much more:
Completely managed tool. No infra maintenance needed.
Auto schema discovery.
Works great for all kind aggregation queries.
Easy to setup and use.
BUT, the benefits didn’t warrant the bill. The estimated charge was somewhere around $13k/month. The long-term costing was way too high for an early-stage startup to bear.
Reminded of the free credits from Azure, the team planned to use the full-fledged Big Data stack — Kafka+Spark+Hive, the whole shebang. Fortunately, a managed solution for this was found in Azure — Azure HDInsight. A short POC was conducted and the performance seemed convincing. Every single component of the setup was both vertically and horizontally scalable, plus there was practically zero management. No bottlenecks were foreseen. But again the cost of that cluster didn’t make sense, given the small data size.
The estimated charge was around $7.5k/month, that was more than double the cost of total infrastructure on Azure.
While all of this was going on, a requirement for setting up a highly available key-value data store arose. The idea was to serve some configuration across the globe with minimum possible latency. CDN was out of option as the team had anticipated complex computing at the edges. So a setup with web servers and local datastores at multiple geographical locations was designed.
For datastore, Redis, MongoDB, ElasticSearch and Couchbase were benchmarked. After a few tests, Couchbase (CB) seemed to be the most suitable option. Installation was pretty straightforward, cluster management was mostly self-managed and it had quite a mature and stable, out-of-the-box data replication mechanism — XDCR.
After a few iterations, the final design had an auto-scalable webserver cluster (Azure VMSS) and a two-node Couchbase (CB) cluster at 4 prominent locations, US, UK, APAC and IN. For traffic distribution, we used AWS Route53’s geographical routing. It helped in distributing traffic based on the geographical location of the incoming DNS query. Two separate CB buckets were created to hold configurations in the central CB cluster. These CB buckets were further replicated to all four edge locations. The replication strategy was based on the geographical distance between the regions.
Getting documents with a specific key was fast as expected. The total application response time was under 40ms which was acceptable. Total cost incurred was around $2.1k/month.
There was a need for a monitoring system as well in order to keep an eye on the application logs. So we re-used the same setup and created a third CB bucket to store the logs. For a unified dashboard view, all the logs were required to be present at a single data center. Therefore the logs were XDCR’ed from all regions to the central CB cluster (black dotted lines in the previous image).
We created a dashboard to view the latest logs, their details and counts. Couchbase Views were used for this aggregation. The logger system turned out to be quite useful. AND… fortuitously, our in-house custom logging system was up and running with minor additional cost (moving logs between the regions).
That’s when it dawned upon us to try to re-use this for our reporting pipeline as well!!
We started by adding a new Java Servlet endpoint on the existing web servers to collect the raw data from clients. Request data was stored in a newly created CB bucket in each region. A java application to stream data using a CB View was created and deployed in each region. A list of fixed data points was formed and the application aggregated them in-memory. It would keep reading from the CB view, aggregating till the end of an hour and pushed the aggregations into a central SQL database.
It also compressed the raw data concurrently into .tar.gz format and stored it into cold storage for later on-demand processing. Post a successful cycle, checkpoints were updated, previous hour logs were cleared and processing of the next hour’s data was initiated.
IT WORKED!! To our surprise, this was successful in terms of cost efficiency and reporting goals.
The additional monthly cost incurred was of the crunching servers (fixed $1.3k) + the extra bandwidth of incoming data (~$300) + SQL Server (fixed $1.1k). So, a feasible, easy-to-maintain data pipeline was possible under $5k/month.
We were able to get the complete reporting up and running in under 1 month. We supported 21 dimensions and 42 metrics with the flexibility of adding more data points any time in future. One of the biggest advantages of this shared infrastructure was low cost and low maintenance. Due to the simplicity of this design, we were able to build and maintain the pipeline with a team of 2 engineers only!
After testing the setup for almost two quarters, rigorously matching numbers and fixing data discrepancies, we went live in August 2019.
The summary of the setup is below. Every line color has a meaning, so do read between the lines!