I have been laser-focused on observability recently and thinking deeply about this topic. Today, there are a ton (a literal ton!) of third-party observability tools. Each one offers its take on resolving the observability problem.
The more I think about it, the more we need to stick to the basics and work with the tools provided by our cloud providers. The reason for this is quite simple - serverless systems are designed to be hyper-scalable. In an event driven system running at scale, data moves very quickly, and when something does go wrong, it happens so fast that by the time we notice, it has probably spiralled out of control, or the truth is buried in a mountain of log data.
Anytime something goes wrong, we will be looking back in time to work out what happened. We will be forensically dissecting the data we have available to decide the root cause and the eventual resolution. We need to understand the following:
- Why it went wrong?
- How it went wrong?
- What events or transactions went wrong, and how can we identify them for recovery?
In the cloud, systems must be designed to recover from failure, which can be manual or automatic; ideally, recovery will be automated. To do this, the primary thing we need is enough data to observe what our system did leading up to the problem, during the problem and after it.
We need the ability to observe what happened in the past. That is the goal of systems being Observable.
What makes Good Observability?
In the context of looking back when something has gone wrong, the number one thing you will need is great logging! Without logs, you will never learn about what went wrong - so we need to be thinking about how we can make our logs work better:
- Capture meaningful messages about every phase of transaction processing
- Be able to identify which micro-service log data came from
- Identify the source code emitting the log entries (function, module, line)
Our Logs must Tell a Story
I encourage teams I work with to think of their logs as “Telling a Story”. The log entries should be meaningful about what the code is doing without being too verbose. I also encourage emitting log entries for a Lambda function to indicate the Lambda lifecycle of when processing - Started, Completed or Failed. This logging is often placed into middleware or framework code so that it happens constantly. The less you need to rely on developers sticking to standards, the better!
Logs must be structured
It is one thing to write logs that “Tell a Story”, but we need meta-data to be emitted with every log message, or we won’t be able to do enough searching, grouping and slicing and dicing of our logs when it matters. To do this, our logs must be emitted in a structured JSON format which is recognisable by all log aggregation tools available today on the market - even Cloudwatch!
The image below shows how fields are extracted by Cloudwatch Insights, which turns your log properties into database table attributes that you can filter on, group by and sort. This makes understanding how something went wrong a breeze and enables you to use stats
functions to create new metrics!
Don’t just emit technical metrics. Think about business KPIs and design metrics around those.
Logging is not enough. Metrics are powerful and can be monitored to raise alerts to notify operational staff or developers that something needs to be looked at. With metrics, we can apply Anomaly Detection Using CloudWatch anomaly detection, which uses AI models to learn about your system’s behaviour and then alert when the behaviour changes. This is useful because problems occur due to metric levels being too high, but the absence of a metrics is just as important!
The third pillar of Observability is tracing - visualising a transaction’s path through your system is powerful, and tracing unlocks this capability. Alongside visualisation, tracing also provides performance metrics so you can monitor the speed of your functions and different processing sub-sections of your transaction processing.
Combining all three pillars - Logs, Metrics and Traces, will provide you with Good Observability of your system when things go wrong.
I have deliberately focused on achieving observability without using third-party tools, mainly because there are so many! But also, each tool offers up different capabilities, so you need to be able to observe your system from the very start and allow time to choose a third-party observability tool if you feel this is required to meet your needs.
We need to remember when running an Event Driven system at scale, it will be the detail and the story your logs tell that will help you understand what happened. So don’t get hung up on needing “immediate feedback” because by the time you react to an alert, seconds will have passed, and a few hundred or thousand transactions will have gone past already. So be prepared to break out your forensic kit and uncover the story of the transactions that went wrong with nicely detailed logs.
This should be easy to do now!