part 1Â : new challenges to observability
part 2Â : 1st party observability tools from AWS [this post]
part 3Â : 3rd party observability tools
part 4: the future of Serverless observability
In part 1 we talked about the challenges serverless brings to the table. In this post, let’s look at 1st party tools from AWS
Out of the box we get a bunch of tools proÂvidÂed by AWS itself:
- CloudÂWatch for monÂiÂtorÂing, alertÂing and visuÂalÂizaÂtion
- CloudÂWatch Logs for logs
- X-Ray for disÂtribÂuted tracÂing
- AmaÂzon ElasÂticÂSearch for log aggreÂgaÂtion
CloudWatch Logs
WhenÂevÂer you write to stdÂout, those outÂputs are capÂtured by the LambÂda service and sent to CloudÂWatch Logs as logs. This is one of the few background processing you get, as it’s proÂvidÂed by the platÂform.
All the log mesÂsages (techÂniÂcalÂly they’re referred to as events) for a givÂen function would appear in CloudÂWatch Logs under a sinÂgle Log Group.
As part of a Log Group, you have many Log Streams. Each conÂtains the logs from one conÂcurÂrent exeÂcuÂtion (or conÂtainÂer) of your funcÂtion, so there’s a one-to-one mapÂping.
So that’s all well and good, but it’s not easy to search for log mesÂsages in CloudÂWatch Logs. There’s curÂrentÂly no way to search the logs for mulÂtiÂple funcÂtions at once. Whilst AWS has been improvÂing the serÂvice, it still pales in comÂparÂiÂson to othÂer alterÂnaÂtives on the marÂket.
It might sufÂfice as you start out, but you’ll probÂaÂbly find yourÂself in need of someÂthing more soon after.
ForÂtuÂnateÂly, it’s straightÂforÂward to get your logs out of CloudÂWatch Logs.
You can stream them to Amazon’s hostÂed ElasÂticÂsearch serÂvice. But don’t expect it to be a like-for-like expeÂriÂence with your self-hostÂed ELK stack though. Liz BenÂnett wrote a detailed post on some of the probÂlems they ran into when using AmaÂzon ElasÂticÂsearch at scale. Please give that a read if you’re thinkÂing about adoptÂing AmaÂzon ElasÂticÂsearch.
AlterÂnaÂtiveÂly, you can stream the logs to a LambÂda funcÂtion, and ship them to a log aggreÂgaÂtion serÂvice of your choice. I won’t go into detail here as I have writÂten about it at length preÂviÂousÂly, just go and read this post instead.
You can stream logs from CloudWatch Logs to just about any log aggregation service, via Lambda.
CloudWatch Metrics
With CloudÂWatch, you get some basic metÂrics out of the box. InvoÂcaÂtion count, error count, invoÂcaÂtion duraÂtion, etc. All the basic telemeÂtry about the health of a funcÂtion.
But CloudÂWatch is missÂing some valuÂable data points, such as:
- estiÂmatÂed costs
- conÂcurÂrent exeÂcuÂtions : CloudÂWatch only report this for funcÂtions with reserved conÂcurÂrenÂcy
- cold starts
- billed duraÂtion : LambÂda reports this in CloudÂWatch Logs, at the end of every invoÂcaÂtion. Because LambÂda invoÂcaÂtions are billed in 100ms blocks, a 102ms invoÂcaÂtion would be billed for 200ms. It will be a useÂful metÂric to see alongÂside InvoÂcaÂtion DuraÂtion to idenÂtiÂfy cost optiÂmizaÂtions)
- memÂoÂry usage : LambÂda reports this in CloudÂWatch Logs too, but it’s not recordÂed in CloudÂWatch
You get 6 basic metrics about the health of a function.
There are ways to record and track these metÂrics yourÂself, see this post on how to do that. OthÂer providers like IOPipe (more on them in the next post) would also report these data points out of the box.
You can set up Alarms in CloudÂWatch against any of these metÂrics, here are some good canÂdiÂdates:
- throtÂtled invoÂcaÂtions
- regionÂal conÂcurÂrent exeÂcuÂtions : set threshÂold based on % of your curÂrent regionÂal limÂit
- tail (95 or 99 perÂcentile) latenÂcy against some acceptÂable threshÂold
- 4xx and 5xx errors on API GateÂway
And you can set up basic dashÂboard in CloudÂWatch too, at $3 per month per dashÂboard (first 3 are free).
X-Ray
For disÂtribÂuted tracÂing, you have X-Ray. To make the most of tracÂing, you should instruÂment your code to gain even betÂter visÂiÂbilÂiÂty.
Like CloudÂWatch Logs, colÂlectÂing traces do not add addiÂtionÂal time to your function’s invoÂcaÂtion. It’s a backÂground proÂcessÂing that the platÂform provides for you.
From the tracÂing data, X-Ray can also show you a serÂvice map like this one.
X-Ray gives you a lot of insight into the runÂtime perÂforÂmance of a funcÂtion. HowÂevÂer, its focus is narÂrowÂly on one funcÂtion, the disÂtribÂuted aspect is severeÂly underÂcooked. As it stands, X-Ray curÂrentÂly doesn’t trace over API GateÂway, or asynÂchroÂnous invoÂcaÂtions such as SNS or KineÂsis.
It’s good for homÂing in on perÂforÂmance issues for a parÂticÂuÂlar funcÂtion. But it offers litÂtle to help you build intuÂition about how your sysÂtem operÂates as a whole. For that, I need to step away from what hapÂpens inside one funcÂtion, and be able to look at the entire call chain.
After all, when the engiÂneers at TwitÂter were talkÂing about the need for observÂabilÂiÂty, it wasn’t so much to help them debug perÂforÂmance issues of any sinÂgle endÂpoint, but to help them make sense of the behavÂiour and performance of their sysÂtem. A sysÂtem that is essenÂtialÂly one big, comÂplex and highÂly conÂnectÂed graph of serÂvices.
With LambÂda, this graph is going to become a lot more comÂplex, more sparse and more conÂnectÂed because:
- instead of one serÂvice with 5 endÂpoints, you now have 5 funcÂtions
- funcÂtions are conÂnectÂed through a greater variÂety of mediums — SNS, KineÂsis, API GateÂway, IoT, you name it
- event-driÂven archiÂtecÂture has become the norm
Our tracÂing tools need to help us make sense of this graph. They need to help us visuÂalÂize the conÂnecÂtions between our funcÂtions. And they need to help us folÂlow data as it enters our sysÂtem as a user request, and reachÂes out to far corÂners of this graph through both synÂchroÂnous and asynÂchroÂnous events.
And of course, X-Ray do not span over non-AWS serÂvices such as Auth0, or Google BigÂQuery, or Azure funcÂtions.
But those of us deep in the serverÂless mindÂset see the world through SaaS-tintÂed glassÂes. We want to use the serÂvice that best addressÂes our needs, and glue them togethÂer with LambÂda.
At Yubl, we used a numÂber of non-AWS serÂvices from LambÂda. Auth0, Google BigÂQuery, GrapheneDB, MonÂgoÂLab, and Twillio to name a few. And it was great, we don’t have to be bound by what AWS offers.
My good friend Raj also did a good talk at NDC on how he uses serÂvices from both AWS and Azure in his wine startÂup. You can watch his talk here.
And finalÂly, I think of our sysÂtem like a brain. Like a brain, our sysÂtem is made up of:
- neuÂrons (funcÂtions)
- synapsÂes (conÂnecÂtions between funcÂtions)
- and elecÂtriÂcal sigÂnals (data) that flow through them
Like a brain, our sysÂtem is alive, it’s conÂstantÂly changÂing and evolvÂing and it’s conÂstantÂly workÂing! And yet, when I look at my dashÂboards and my X-Ray traces, that’s not what I see. Instead, I see a tabÂuÂlatÂed list that does not reflect the moveÂment of data and areas of activÂiÂty. It doesn’t help me build up any intuÂitive underÂstandÂing of what’s going on in my sysÂtem.
A brain surÂgeon wouldn’t accept this as the priÂmaÂry source of inforÂmaÂtion. How can they posÂsiÂbly use it to build a menÂtal picÂture of the brain they need to cut open and operÂate on?
I should add that this is not a critÂiÂcism of X-Ray, it is built the same way most observÂabilÂiÂty tools are built.
But maybe our tools need to evolve beyond human comÂputÂer interÂfaces (HCI) that wouldn’t look out of place on a clipÂboard (the physÂiÂcal kind, if you’re old enough to have seen one!). And it actuÂalÂly reminds me of one of Bret Victor’s semÂiÂnal talks, stop drawÂing dead fish.
NetÂflix made great strides towards this idea of a live dashÂboard with VizcerÂal. Which they have also kindÂly open sourced.
Conclusions
AWS proÂvides us with some decent tools out of the box. Whilst they each have their shortÂcomÂings, they’re good enough to get startÂed with.
As 1st parÂty tools, they also enjoy home field advanÂtages over 3rd parÂty tools. For examÂple, LambÂda colÂlects logs and traces withÂout adding to your funcÂtion invoÂcaÂtion time. Since we can’t access the servÂer anyÂmore, 3rd parÂty tools canÂnot perÂform any backÂground proÂcessÂing. Instead they have to resort to workarounds or are forced to colÂlect data synÂchroÂnousÂly.
HowÂevÂer, as our serverÂless appliÂcaÂtions become more comÂplex, these tools need to either evolve with us or they will need to be replaced in our stack. CloudÂWatch Logs for instance, canÂnot search across mulÂtiÂple funcÂtions. It’s often the first piece that need to be replaced once you have more than a dozen funcÂtions.
In the next post, we will look at some 3rd parÂty tools such as IOPipe, DashÂbird and ThunÂdra. We will disÂcuss their valÂue-add propoÂsiÂtion as well as their shortÂcomÂings.
Like what you’re reading but want more help? I’m happy to offer my services as an independent consultant and help you with your serverless project — architecture reviews, code reviews, building proof-of-concepts, or offer advice on leading practices and tools.
I’m based in London, UK and currently the only UK-based AWS Serverless Hero. I have nearly 10 years of experience with running production workloads in AWS at scale. I operate predominantly in the UK but I’m open to travelling for engagements that are longer than a week. To see how we might be able to work together, tell me more about the problems you are trying to solve here.
I can also run an in-house workshops to help you get production-ready with your serverless architecture. You can find out more about the two-day workshop here, which takes you from the basics of AWS Lambda all the way through to common operational patterns for log aggregation, distribution tracing and security best practices.
If you prefer to study at your own pace, then you can also find all the same content of the workshop as a video course I have produced for Manning. We will cover topics including:
- authentication & authorization with API Gateway & Cognito
- testing & running functions locally
- CI/CD
- log aggregation
- monitoring best practices
- distributed tracing with X-Ray
- tracking correlation IDs
- performance & cost optimization
- error handling
- config management
- canary deployment
- VPC
- security
- leading practices for Lambda, Kinesis, and API Gateway
You can also get 40% off the face price with the code ytcui. HurÂry though, this disÂcount is only availÂable while we’re in Manning’s EarÂly Access ProÂgram (MEAP).