Recently, I spoke at two meetups on ”Unlocking Serverless Observability”, which is an important topic, so I have decided to publish an article about the demo project I put together to help illustrate - When Serverless Goes Wrong.
A Working Sample
Supporting my recent talks and this article is a working (and non-working) example of an Event Driven data integration Architecture that can be found on my GitHub page. The README is comprehensive and explains how to install and run the performance tests if you want to see it yourself.
WARNING: Executing a performance test even with AWS Lambda may incur charges on your AWS Account. Given the generous free tier, this is unlikely, but you have been warned.
About the Project
Alice works for Widgets Incorporated, a popular online retailer of technology widgets. Alice keeps the lights on for the e-commerce website running on a large EC2 Instance on the AWS cloud. Alice’s most important job is ensuring the weekly sales export runs for the sales and marketing team.
One day, Max from sales and marketing seeks out Alice and demands something to be done about making sales data available more than just weekly! In the same conversation, he also shares that they are changing delivery providers and they will need to integrate into Ace Courier’s scheduling API to coordinate deliveries.
Alice is very worried about adding more processing to the existing e-commerce platform. It is at capacity, and Widgets Inc. doesn’t have a large budget to do much more about it! Alice digs through documentation wikis and stack overflow, trying to think how this problem can be resolved. When she was ready to give up, someone replied to her question about stack overflow. Widgets Inc.’s e-commerce solution has a lesser-known notification webhook feature that can be turned on through configuration.
Excitedly she turns to Serverless Land to learn more about Event Driven Architecture so she can build a serverless event hub to handle live order notifications. With this simple configuration change, she knows she can deliver live sales data for Max in Sales and Marketing and do a live integration with Ace Courier’s scheduling API.
On Serverless Land, Alice finds a GitHub reference to an Event Driven pattern which will be useful to solve her exact problem. The pattern she found on Serverless Land is for Java, but that is okay because the existing sam template will accelerate the IaC aspect of building out the serverless event hub in Python.
How Alice Built it?
Alice used AWS Lambda power tools for logging and tracing and simplifying her code. For the Delivery API Lambda, she used the tenacity library, which is great for adding retries to any function in Python to ensure the lambda recovered from any failures.
Alice used AWS Lambda Powertools for Python which has an excellent structured logger. She made good use of this and set the
service name to enable logs to be grouped by each service processing data. Alice leveraged the
correlation_id_path log setting, which enables json-path extraction of attribute values from the AWS Event being processed to automatically populate the
correlation_id so she could trace transactions across the entire distributed stack. Alice also listened to my Getting Started tips by making sure the logs tell a story and using the
FAILED status values as a
status JSON attribute for key logs to assist in tracing.
Alice did quite well and completed the build reasonably quickly - she had good logging and retries for API integration failures which are all great considerations for building a resilient event-driven architecture.
So What Went Wrong?
One of the key things Alice didn’t consider in her build, which is common for many teams I work with, is understanding the transaction Volumes that the architecture will be processing. This is critical to understand since Serverless is hyper-scalable and will handle almost any traffic you throw at it! Alice also didn’t consider the potential limits of the external Courier API she was interacting with - understanding the constraint of ALL your downstream systems is just as critical! If you know your volumes and downstream limits - you will quickly learn up-front at design time whether any scalability boundaries will be problematic.
When Serverless Goes Wrong
To demonstrate the problem when Scalability Boundaries clash, I built Alice’s demo architecture and introduced a deliberate Scalability boundary through a slow-api service representing Ace Couriers.
The Demo Project Constraints
There are 3 components to this project:
- NotificationFunction - this takes the payload from the API, injects a correlation_id and forwards it to EventBridge as a notify-order event.
- DeliveryFunction - This is a simple function that takes the detail of the event, removes the meta-data from the body and sends this to the configured API endpoint. This lambda uses the tenacity library for retries in the code with a wait time between 3 and 8 seconds on any failure from calling the “slow API”.
- SlowAPI - This is an API that has been set up to mimic an unstable real-world API and has been setup with the following constraints:
- 20% of the API Calls will fail immediately.
- The API will take between 0 and 2 seconds to return a response when it runs successfully.
The SlowApi lambda is triggered by an API Gateway route with an API Key and usage plan set to restrict the number of transactions through it to 10 per second: BurstLimit = 10, RateLimit = 10. The NotificationHandler is also triggered by an API Gateway with no API Key and no rate limiting.
There is a configured artillery test file that can be used to run a performance test with the following characteristics:
- For 1 minute test will hit the Notification API at approx. 2 transactions per second
- For the next 2 minutes, transactions will ramp up to 5 transactions per second
- For the next 10 minutes, transactions will ramp from 4 per second to 15 per second and then remain at this level which is a level that exceeds the SlowAPI transaction limit.
There are 2 branches on the demo project -
no-flow-control is what Alice built.
flow-control is the same architecture but with flow control introduced to ensure the event integration flow is controlled.
What Happened During the Performance Test?
The Slow API worked quite well - it had errors, but these were planned errors, approximately 20% of all requests. So the graph below is a fairly solid representation of a 20% error rate with the green line wavering around the 80% success mark as planned. Nice to see part of the system worked!
As you can see from the Delivery Lambda Success dashboard graph below, the Delivery Lambda was fine for the first seven (7) minutes before the success rate started to drop. What is impressive from a Lambda perspective is that at the 8-minute mark, we have a success rate greater than 90%, whilst we have errors sitting at nearly 50 in that minute. This speaks to the automatic resilience that is built into the AWS Lambda asynchronous invocation mechanism. Sadly as time continues, the success rate bottoms out at approximately 20% at the 17-minute mark with a peak error rate of 1535 failed lambda invocations. This graph picture is something you will see a LOT on the metrics dashboard for your lambda functions when errors are occurring. The important thing to look at is the bottom number of the range for the % Success (where it bottoms out) since it may bottom out at 97%, which from an overall system health perspective, is still quite healthy.
When you have scalability boundaries clashing, what you will see is a steady growth in your overall Lambda Concurrency metric, as shown below. This graph peaks at 788, just shy of my account level limit of 1,000 concurrent invocations. A common reaction as people see this is to urgently reach out to your AWS Technical Account Manager and very quickly increase the account concurrency so that you have more headroom. Unfortunately, when you have this reaction, you only create a larger pool of concurrency for the out-of-control lambda function to draw from, which will make the graph peak much larger and effectively make your bill slightly larger.
When you have no flow control built into your system, there is nothing much you can do at this point but ride out the storm or deactivate lambda function triggers to stop your system from creating bigger problems for the downstream systems. This is the key architectural component missing for this event-driven architecture - Flow Control. The overall invocation count metric graph tells a very similar story, as expected.
Why Did this happen?
The scale madness happens because we have a Scalability Boundary between the Delivery API Lambda and the Ace Courier System. Anytime we have a hyper-scalable system talking to a less scalable system or a system of unknown scale, this is a Scalability Boundary. This is where scale differences occur and cause problems. As Serverless Architects, we need to be able to identify these boundaries and design our architectures to ensure our system remains resilient and working at all times. This boundary can also be considered a domain boundary. We must know all interactions beyond our bounded context and manage failure carefully to avoid losing data due to a Retry Storm, which is what causes the graph peaks and high concurrency load.
How to fix
To resolve scalability boundary clashes, we need to add a flow control buffer that can provide a retry mechanism on failure. The best example is using an SQS queue as the buffer element coupled with maximum concurrency configuration on the Lambda trigger’s Event Source Mapping. This combination is the most powerful flow control mechanism available on AWS today.
How does Flow Control help?
The slow API in our system with flow control is behaving as expected. The success rate is wavering around the 80% mark by design. The error rates are much lower - this is because of the introduction of flow control, which controls the number of concurrent calls into the Slow API endpoint. As you can see, the Slow API (or Ace Couriers System) is much calmer and not spiraling out of control like before. A graph like this is not ideal because there are errors, but it’s not showing a system that is completely out of control, so there is time to work to correct the problem. In this case, the error rate is because we coded it in, but in a real system with transient errors like this, we need to look at logging to determine the issue and then deploy a fix.
The Delivery API Success Graph shows the switching of success to error and back again - but notice the lowest value on the Y-axis is around the 98% mark. So although there are errors, it is not disastrous (yet).
With flow control in place, the concurrent lambda execution is way different - it is a very steady flat line which is what we expect to see in a healthy system. It is hovering around a total concurrency of 20 across the entire system, which is very healthy!
Invocations are also okay, as you can see from the graph.
In these last 2 graphs, you can see 2 anomalous spikes. Something weird happened with the metrics for the Notification function. At 13:55, cloud watch stopped recording metrics for this function and instead recorded all counts in 2 batches at 14:12 and again at 14:28. Outside of these 2 anomalies, you can notice the steady, even pace at which the lambda invocations happen through the flow control mechanism provided by the SQS Queue.
Why is Structured Logging so Important
One of the great things Alice did well was to use AWS Lambda Powertools for the Logging utility, which emitted structured logs in JSON structure. Part of the logging utility enables capturing the transaction correlation_id along with function details from the lambda context and a defined service name, like the entry shown below.
The additional fields enable Alice to search for everything that happened to a single transaction to Tell the Story of what happened. An example output of a correlation_id story is shown below:
A feature I get all my teams to do is to emit a START, COMPLETED, and FAILED log message as a status. This is powerful to query logs over a time frame to determine whether everything was processed correctly or not. Often this message is built into the Lambda handler framework itself through middleware components so that developers don’t have to do anything to get this happening. In this way, we can compare the counts of START messages in the notify-handler service to the COMPLETED status messages in the delivery-handler service.
As you can see, for the No Flow Control test - the actual success rate was terrible.
Where the Flow Control test had a successful outcome.
As you can see from this experiment, adding Flow Control between external communication points within each bounded context of your distributed system makes for a more successful event processing flow. The other points that come to mind that are critical for building Event-Driven Architecture for success are:
- Understand your expected Volumes up-front and model them. Understand what your real Transaction rate per second (TPS) is.
- identify the rate limits and constraints of systems you interface to. It’s important to understand where the scale clashes will occur.
- Add flow control to your Bounded Domains to assist with scalability boundary clashes.
- Use structured logging to enable forensic analysis when things go wrong.
- Emit some simple messages to highlight the beginning of lambda execution and another to highlight whether it was successful. Use these to create metrics for comparison and to measure the success of your system.
- Think about Business KPI metrics - I keep saying this but do think about them - they are also important, and the absence of a business metric can indicate something not working correctly.
Observability is important, and it’s not that hard to get it right. With some planning and the useful techniques highlighted here, you can quickly measure your serverless system’s success!