Operating a multi-tenant serverless cloud platform comes with a unique set of challenges.
There are many articles out there that deal with standard optimization techniques like SSR, PWA, tree shaking and similar. This article is similar, but also different as it focuses on things unique to serverless environments, in this case, AWS Lambda specifically.
My name is Sven, and Iâm one of the founders of Webiny. Donât know Webiny?âââItâs a serverless CMS powered by React, GraphQL, and Node.
Prep work
To get out measurements, and help us identify the problem(s) we will be using webpagetest.org to run our requests, gather timings data and get a better understanding what a user is seeing, and experiencing, on their end.
What we will be looking at is the âfirst viewâ, meaning how do the load times look for a user that has never visited the page before. This is important, as the browser cache can hide many of the bottlenecks.
Load timingsâââIdentify the problem(s)
From the chart at the top of the page, the most meaningful metric for us was the âTime to Start Renderâ. If you look closely, youâll see that it took almost 2 seconds just to start the render of the page đ±. This is due to the nature of a single page app (SPA). Basically, you first need to download a massive JS bundle (1), which then the main thread needs to process (2) before the page can display some content.
(1) Download JS bundle. (2) Wait until the main thread processes the JSÂ bundle.
However, thatâs just part of the story. Once the main thread processes the JavaScript bundle, it actually fires off several API calls to the API Gateway. At this stage, the user sees the notorious spinner, and he doesnât actually see any content so far. Here is a filmstrip of the visual events:
What we see here is a poor user experience. The user sees an empty page for about 2 seconds, and then a spinner for another second. That additional second of viewing the spinner is caused by the API requests that follow once the JS bundle is loaded. The API requests are needed to retrieve the content and actually render the content.
If this was a normal VPS, the time cost of these API calls would mostly be predictable, but when dealing with serverless, you have the infamous âcold startâ, and to make matters even worse, in case of Webiny Cloud Platform, the Lambdas are part of a VPC, which means we need to initiate an ENI for each new Lambda instance, this increases the cold start time drastically.
Here are some timing metrics for booting a Lambda inside a VPC and outside a VPC:
Image is taken from https://www.freecodecamp.org/news/lambda-vpc-cold-starts-a-latency-killer-5408323278dd/
Conclusion, there is a 10x increase in the cold start time when Lambda is inside a VPC (ouch! đ€).
There is also another cost thatâs bundled inside the API timings, thatâs the latency. Since itâs my browser (the client), thatâs executing the API requests, the browser needs to transverse from my computer, over the Internet, all the way to the origin, and back. That is repeated for each APIÂ request.
Challenges
Based on the above we identified a few challenges that we need to tackle:
- Improve the speed of API requests, or reduce the number of API requests that block rendering
- Reduce the size of the JS bundle, or make the bundle not a requirement for rendering the page
- Unblock the main thread
Potential approaches:
- Optimize code to run fasterâââhigh effort, high cost, questionable gain
- Increase lambda RAM sizeâââlow effort, medium to high cost, low gain
- Something else?
We actually opted in for the third option. What if we donât need the API requests at all, and what if we donât need the JS bundle at all? This would eliminate all our pain points.
Our first idea was to generate a static HTML snapshot of the rendered page and serve that to the user.
A failed attempt
Webiny Cloud, which is the serverless infrastructure, based on AWS Lambda, that hosts Webiny websites, it had the ability to detect bots and then instead of serving a JS version of the page, it would reroute the request to a Puppeteer instance which would render the page using headless Chrome, and then serve back the HTML to the bot. The main reason for that was SEO, as many bots donât read JavaScript. So we had the idea of actually using that functionality and serve back the same output to regular users.
This usually works well for a non-JS environment, but when you try serving a pre-rendered content to a client with a real JS enabled browser, the page renders, but then once the JS files load, the React components donât know where to mount. This causes a big pile of errors in the console, therefore this solution wasnât really helpful in our case.
Introducing SSR
The benefit of SSR is that all the API requests stay within a local network, as they are handled on a machine, or a function, running inside a VPC, there is no latency when communicating with a Lambda, like the one between the client and the origin server, unless there is a cold start in question.
An additional benefit is that we get back an HTML snapshot, to which React components know how to mount, once the JS files are loaded.
And finally, we donât need that big JS bundle and the API calls to display the page. It can be loaded asynchronously, and it wonât block the main thread.
Basically, SSR solves most of our challengesâŠwell, kind of.
At this point we got to here:
No more API calls, and we actually get to see our page before that big JS bundle is downloaded. But if you look closely at the first request, it took almost 2 seconds to return a response, for the document, from the server. Letâs take a deeper look into this.
Time to First Byte problem
Whatâs happening here is that we are starting a node server, doing the SSR, with all the API requests and JS processing and then returning the final output. However, the problem is that on average this takes around 1-2s.
Our SSR server needed to do all that work, before returning the first byte of the response to the user. This causes a very long wait time for the first byte. Itâs almost the same amount of work, just that work is not happening on the client side, but on the server side inside the SSR workflow.
Wait. You said server ? Isnât this supposed to be serverless? We sure did try doing SSR in a Lambda function but it turned out to be a very intensive process (you need to drastically increase the allocated memory size to get more CPU resources), plus the cold starts we talked about before... So for now, the ideal setup is using the node server to download the siteâs SSR bundle and render it.
Back to the SSR results: looking at the film strip, the timing is not much different than what we had when we were doing client-side rendering.
Blank screen for 2.5 seconds đĄ
Although it might not look like weâve achieved any progress, we actually have! We got an HTML snapshot with all the content, which is ready to be hydrated by React, and there is no need to do any API calls since all the required data is already injected into the HTML.
The only problem is that generating this HTML snapshot takes too long. At this point, we could either invest more time into optimizing SSR or simply cache the output and serve the snapshot from something like Redis cacheâââwhich is exactly what we did.
SSR Caching
Once a user visits a Webiny website, we first check a centralized Redis cache to see if we have an existing HTML snapshot, and serve it from the cache. On average this brought down the âtime to first byteâ to a range of 200â400ms. This is where we actually started seeing massive improvements in speed.
Right from the first view, the user gets to see the page content under a second.
Letâs have a deeper look at the waterfall chart now:
The red line shows the 800ms mark, which is when the page content is fully loaded. You can also see that the JS bundles were loaded at around 1.3s mark, but that didnât affect the time in which the user got his content. The same applies to the API calls and the main thread processing, both are no longer required to display the page content.
Note that the timings on the JS bundle, API calls, and main thread still matter, as that is the time by which page becomes âinteractiveâ, but for crawlers and userâs perception of âspeedââââit doesnât matter.
In case this was a âdynamicâ page, say it was displaying a signed-in user in the header. The SSR would load a generic page, meaning the one where the user is not signed in, and only after the page, JS bundle, and API calls are processed (time to interactive), the header would change and display the signed-in user.
A few weeks later we discovered that our proxy wasnât closing the client connection at in right place when it was triggering the SSR as a background process. A single-line fix brought down the TTFB to 50â90ms mark and visual complete to ~600ms.
However, there is another problem âŠ
There are only two hard things in Computer Science: cache invalidation and naming things.
â Phil Karlton
The cache invalidation is definitely true. Either you have a very short TTL and refresh the cache often, introducing infrequent long page load times, or have a mechanism to invalidate the cache based on certain events.
In our case to get around this problem we actually introduced a short TTL or 30s, but also we added the option to serve stale cache, and at the same time refresh the content in the background. This way we offset any latency and cold-start issues that Lambdas might introduce.
This works as follows: a user visits a Webiny website, we check the HTML cache, and if there is an existing snapshot, we will serve that. The snapshot can even be several days old. What we do is we serve that old snapshot to the visitor in those few 100ms, and in parallel, we trigger a job to generate a new snapshot and replace the old cache. That job usually takes just a few seconds, as we also introduced a mechanism to always have a bucket of pre-warmed lambdas, so we donât pay the big cold-start cost when generating new snapshots.
This way we always serve from the cache and the content is refreshed on subsequent visits, in case the cache is older than 30Â seconds.
This is definitely an area where we will introduce additional improvements, for example, we are looking at adding the option to automatically refresh the cache every time a user publishes a page. However, this is not a silver bullet.
For example, say you have a homepage that displays the last 3 blog posts, and you create and publish a new post, technically the cache will be generated for that new post only, but the homepage will still be stale.
We are still investigating the best approach, but so far the focus was on sorting out the performance challenges. At this point, we believe we did quite a good job.
Summary
Our starting point was a client-side rendering, where, on average, the visual complete metric was 3.3s. Now the visual complete is ~600ms. And whatâs also important, there is no more spinner.
SSR is the key player here, but without proper caching, you are just moving timing metrics from the client to the server, so the end time to âvisually completeâ wonât change much.
SSR has the additional benefit of offsetting the CPU bottleneck on older mobile devices, which is something we have not measured in this test, but the current implementation we have should get around that problem as well.
Overall, doing SSR is hard, and with serverless on top, it makes it even harder. The solution requires code changes, additional infrastructure, as well as an intelligent caching mechanism; but the benefits are great, and most importantly, your users will appreciate them.
Let me know if you decide to give Webiny a spin. You can host your custom apps on the Webiny managed platform, and youâll get all these amazing performance improvements out of the box. In case you have any questions or feedback to share, drop me a message on twitter @SvenAlHamad, or use the comment form below.