Serverless is all about using Managed services and not worrying about how and where the code runs, simplifying a whole load of problems. Managed Services come with many configuration options, which is the key to controlling how they behave - knowing how these options work is super-important. Fortunately, you don't need to understand ALL the options - but you do need to explore the main configuration options or "dials" to get your solution off to the best start! Often we get so wrapped up in solving the problem we run off and start building. When you do this, everything works perfectly in development. Still, as soon as you go live and volumes grow, problems start appearing. Often, this is due to misconfiguration or lack of understanding of how the service behaves.
Designing your system for zero failure is impossible in the cloud, and you need to change your focus. You need to acknowledge that you have no control over the availability or reliability of the services you use to build your solution. Instead, you need to architect for recovery. Designing for failure and architecting for recovery are fundamental design principles you need to embrace for building serverless systems. Use tests of different use cases for each service you want to use to learn how it works and how to ensure your solutions are successful. Once you get comfortable with this concept, you are well on your way to success in the cloud.
Know Your services, know how they will fail, and know how to recover!
The AWS Standard Simple Queue Service (SQS) is a good example. It is a good Event Driven Architecture component for many use cases. To build with SQS, you need to know how to configure the Visibility Timeout, which determines the time delay from a message being sent to a receiver (e.g. AWS Lambda) and the message becoming available for processing a second or subsequent time. If you set this value too low, the same message will be processed multiple times, which may cause problems if your processing code or downstream consumers are not idempotent (able to process duplicate messages without causing side effects). The following diagram shows how this timer works in the lifecycle of an SQS message with SQS pollers that read from the queue.
If you set the Visibility Timeout too small, your messages will be processed more than once. The correct value for this configuration item depends on the processing time you expect for the message receiver. If you are interacting with a downstream service, you will need to take into consideration the latency of that destination. Latency is the most important metric to understand when building Serverless solutions. It is the One that causes unexpected problems. Typically you set the value to the maximum processing time for your message.
With Serverless you trade infrastructure and operational complexity with service configuration. From this example you can see how important it is to understand how to configure the services you use, and why you need to Know Your services!