Watch for Scalability Boundaries

Mar 19, 2023 5 min read

Scalability boundaries are the number one enemy of Serverless. Know how to watch out for them and what to do when they go wrong.

Watch for Scalability Boundaries

One of the biggest problems with deploying code and building services on a highly scalable platform is when you need to interact with an external service which may not be at quite the same scale. This is what I call a Scalability boundary, a place where different scales meet and are likely to clash.

The Importance of Flow Control

Serverless systems are high scale systems, and in my experience, high scale event driven architectures need flow control. Flow control in enterprise architectures typically refer to SQS Queues or Kinesis Streams on AWS. A place where transactions are persisted while they wait to be processed by one or more consumers. The key is to tune the maximum concurrency on the Event Source Mapping Lambda trigger so that your system throughput at the scalability boundary point does not exceed the external integration endpoint’s rate limit. The Event Source Mapping maximum concurrency setting is essential in ensuring Lambda execution throttles do not cause issues with events being lost or going straight to your Dead Letter Queues which can happen when you only use Lambda concurrency as a throttle control. Where Event Source Mapping triggers are not available you will need to rely on Lambda reserved concurrency setting to control transaction throughput and clients will need to handle any throttles which occur. On AWS, if you are using Eventbridge, you can also leverage API Destinations, a purpose-built integration service with in-built controls for rate limiting.

Know your Volumes!

Understand your Volumes

This is why I keep saying - “Know your volumes”! Calculate your volumes and understand your system throughput. Putting together a volumetric model in an Excel spreadsheet will help you plan for performance testing and give you a reference point to the “whys” of the system architecture and service configurations. Volumetric models are an important architectural tool many architects do not think about and ultimately end up with scalability boundary clashes. Volumetric models also inform the future you about the "whys" involved in creating your solution today.

There are a few different ways scalability boundaries become problematic for Serverless systems. The most common scalability boundary problems are listed below.

Connection Oriented Services with finite connection limits.
External API services that are rate limited.
Slow external API services, essentially behaving like a rate-limited API.
Not understanding how services work along with quotas and limits that apply.

Understanding how the Services you use work is critical

But it Worked in Testing?

I have had this experience when integrating into an external partner API secured using OAuth as the authentication mechanism. In this instance, the API, which was poorly documented, had authentication session limits where each API key was limited to 10 tokens. When the token number exceeded ten (10), they instantly invalidated the oldest token. This all worked fine in development and testing, but as soon as the system operated at scale, there were problems and work was needed to change how we dealt with OAuth API tokens.

If you are building an API, ensure you have good documentation!

What happens when Scale Clashes?

When two systems clash at scale, the lower volume service will be overrun and become extremely slow or totally non-responsive, depending on how bad the scale difference is. I have seen this often when a redis cache that is not properly sized. When Redis gets overrun, what usually happens is the connection processing sub-system is super busy, and Redis spends most of its CPU trying to service connection requests resulting in low performance for actual cache queries and lookups. With Redis running slow, your lambda code trying to use it will increase in execution duration due to wait times. Often, your lambda execution will timeout, and you will lose the lambda event unless you have designed your system for failure recovery. When I witnessed this problem with Redis, it took some time to trace it to Redis since I saw timeouts and throttles elsewhere in my system, not particularly where Redis was involved, so you end up chasing your tail for a while before you get to the fact that the Redis cluster simply wasn’t right-sized!

Lambda Duration is a Key metric to Monitor and anomaly detection across this metric is awesome!

When problems occur at your scale boundaries, lambda duration typically spikes up. I monitor lambda duration in cloud watch dashboards and activate anomaly detection on this metric. Duration is a key metric to monitor since this will reflect when heightened latency affects your system, which generally means a scalability clash is occurring.

API latency, the time to execute an API call, is the biggest enemy of extreme-scale solutions like AWS Lambda. You need to understand how to control this aspect and monitor for problems.

Resolving Scalability Clashes

In AWS, controlling scalability boundaries comes down to flow control. This means using a buffered event mechanism such as an SQS Queue or a Kinesis stream. These two managed services will create a persistence layer and store events until they can be processed and smooth out any spikes in your traffic.

You control the flow of events by setting the maximum concurrency on the Event Source Mapping of your Lambda trigger which will ensure the maximum Lambda concurrency does not go too high. Where an Event Source Mapping trigger is not available you can set a specific reserved concurrency value on your lambda functions which will cause the event trigger to throttle due to the lambda configuration denying the execution. This means the trigger source will need to be looked at for specific behaviour controls when lambda throttles the invocations.

It comes down to knowing how your services work at scale and how you can take control of that and smooth out the transaction spikes when they occur at dangerous levels. Learn about the services you are working with, run them at scale and understand how they fail. It is the only way to build a fault-tolerant system which is the best kind to build in the cloud with Serverless services.