When building applications, it's common to be coordinating frontends, databases, worker queues, and APIs. Traditionally, queues and APIs are kept separate; queues handle long-running or CPU intensive tasks and APIs serve quick responses to keep applications snappy.
Here at Paragon, we were faced with the question: how do we build a performant API for a queue that handles all our customers' tasks?"
For those not familiar with Paragon, Paragon provides a visual builder for creating APIs and workflows. Users can build cron jobs and API endpoints in minutes that connect to databases, 3rd party APIs and services, and logic or custom code for routing requests and transforming data. With that context, we had to build a worker queue that could support the following use cases:
Sounds like a Herculean feat, right? Spoiler alert: it was. This blurs the lines between a worker queue and an API, and there were no common engineering paradigms existed to draw from. As you can imagine, the security and performance implications of these product requirements kept our engineering time busy for some time.
Due to the complexity, performance, and security requirements of our platform, we've had to innovate on the API + worker queue construct a bit. We created Hermes (our API messaging service) and Hercules (our workflow executor) to solve these problems.
Hermes accepts API requests, sends them to Hercules for execution, waits for the specified workflow and its steps to complete (or fail), then sends back a response to any awaiting clients. They're entirely separate services, but they communicate together to receive, schedule and execute work.
One might think that the added complexity and latency between submitting jobs to a worker queue and waiting for a response might have slowed down the API. We were quite pleased to find out the opposite: our APIs got much faster, particularly when processing large arrays of data.
Thanks to Hercules' ability to self-monitor and autoscale, we can distribute work across processes and run them in parallel. Additionally if a branch of steps fail, the others can continue to run successfully without terminating the request adding more consistency and reliability for workflows.
With that in mind, here are some of the things we thought about or learned while building a resilient, scalable system.
Things will break. That's a truth we all must accept when building systems. Code won't work as expected, memory will leak, APIs won't return the expected responses, upgraded dependencies will introduce new issues, etc etc. We're dealing with all of these unknown unknowns as well as the fact that we're running arbitrary code from our users that may fail.
To ensure jobs don't take down other jobs, it's important to run them in isolation. This means:
As most agile teams do, we have multiple environments for testing different versions of our code. This means in one dev environment there may be zero jobs running while in another there could be thousands.
To ensure we're able to meet traffic demands while still optimizing resources, here are a few key infrastructure elements we've employed.
Our users' workflows may run in milliseconds, minutes or days depending on their configuration. We're often deploying new features multiple times a day, meaning servers terminate to pull the latest images, and jobs execute in separate processes across a variable amount of servers.
This means workflow state can't be stored in memory otherwise other machines wouldn't have access to them.
Steps may need data from previously executed steps in a workflow that ran some unknown time ago. Additionally, jobs should only be executed once, meaning whether there's one processing server or thousands, there should be no race conditions amongst workers.
These are all common scenarios for a worker queue. The solution for all of this is straight-forward:
Our distributed architecture comprises of multiple microservices with their own data stores, some of which having multiple data stores they read and write to. A single API call to one microservice may trigger a dozen calls to other services, reads and writes, encryption / decryption methods, etc which can lead to hundreds of milliseconds, if not seconds, of overhead.
Given that a workflow can have any amount of steps, every 100 milliseconds added to a step can lead to an unideal user experience.
Caches can have huge performance improvements across systems when implemented correctly. One thing to consider here is that in a containerized n-scaled microservice architecture, in-memory caches, while easy to implement and very performant, aren't always the best choice.
If a microservice has 10 instances behind a load balancer each running with in-memory caches implemented in front of a database, the cache will miss nearly every time. Thus, we use Redis as a shared cache between instances.
Additionally, we considered the request volumes and read to write ratios. Given that our workflows execute on average thousands of times more than they're deployed, that was the very first thing we cached, which has in turn saved us millions of database queries, API calls between microservices, and seconds responding to API calls.
No system is perfect, but we're proud of what we've built at Paragon. Hercules does all of the heavy lifting while Hermes does all the talking. They can scale up and down independently of each other to meet demand, and when a process fails, other parts of the system are both unaware and unaffected.
We'd love to hear your thoughts on our implementation! Happy coding.