In the AWS Serverless world, we are often constrained by Managed Service limits. A common limit causing friction is the API Gateway timeout, which is 29 seconds. Any process you trigger through an API Gateway route must be completed before this time expires, or the API Gateway will return an HTTP 502 Bad Gateway error indicating a backend service timeout. This same constraint applies to routes triggered via the API Gateway V2 Web Socket service. What makes this timeout more confusing is that the back-end integration has already started and will continue processing until it finishes, so if the client re-launches the task, thinking it simply terminated prematurely, you will end up with multiple long-running tasks being executed for no reason!
There are a variety of strategies you can use to resolve this problem. The solutions revolve around disconnecting the task execution and completion steps, making the client/server interactions asynchronous. To resolve this, we can set up a database and a polling API to enable a client to submit a task using one API and then create another API for the client to poll for task completion. This is quite wasteful of compute resources and can be seen as a simple solution, but it can be complicated to implement and get exactly right.
This is where web sockets can come in handy. We can use a web socket as a bi-directional communication channel between our client application and backend processes, and we end up with an interaction like the following high-level diagram.
With web sockets in the picture, we can set up an end-to-end event-driven loop from client to server and back to client. This solution can be very compute efficient and does not require the client to maintain a loop and continually poll for the resolution of the long-running task.
Did you know you don't need a DynamoDB table to use Web Sockets? Nearly every Web Socket article I have seen includes a DynamoDB or other data store. In this use case we are covering here, there is no need for a DynamoDB. You only need a data store when you want to Broadcast notifications from your backend service to collections of connected web sockets (we will cover this use case in another article).
How does it all work?
- The Client sends a Start Task message to the Web Socket
- The Web Socket API sends a Start Task message to the Long Running Task, and when this has been done, it immediately acknowledges the message from the client.
- The Client does not wait for the Long Running Task to complete - it waits long enough for the Web Socket infrastructure to acknowledge the message as successful. It can happily do something else, knowing it will be notified when the task is done.
- The running of the Long Running Task is a "Set and Forget" action for the client and the Web Socket API.
- The Long Running task will execute, and when it is finished, it will send a notification back to the Client who started it.
The Detailed Solution Design
The following diagram shows a detailed flow diagram of the service interactions for executing a long-running task and being notified of its outcome via an API Gateway web socket interface. I have colour-coded the arrows for the services' Asynchronous (blue) and Synchronous (green) interactions. Notice there is no data store required for this use case (DynamoDB is not required).
The solution revolves around AWS EventBridge as the event routing mechanism to ensure the web socket API request executes the correct task. I like EventBridge for this purpose, but you could also use SNS or SQS to launch asynchronous tasks or execute a Lambda directly using the async invoke SDK. The key driver in the architecture is the dotted BYO Long Running Task, where you can plug in your long-running lambda task to be executed. Each long-running task will be executed and the result from the task will be forwarded using Lambda destinations to the task-notify queue which will enable the task result to be sent back to the initiating WebSocket connection. The design above shows a single Lambda function, but since we use EventBridge as the driver, you can configure as many long-running tasks as you need.
Building the Solution
I have created a CDK construct called SocketTasks that creates the entire solution shown in the architecture diagram. The construct accepts a list of tasks, each with one or more task types. The type is defined as an array, allowing multiple lambda tasks to be created and managed. Each task "type" definition creates an EventBridgeRule to trigger the task (or tasks).
Using the above properties, you can submit an array of tasks to be executed using the task properties:
- type: defines EventBridge rules matching one or more values in the detail-type event attribute. Adding multiple strings into the type array will configure a multi-match rule in EventBridge, giving your task more than a single trigger.
- func: refers to an instance of the AWS CDK Lambda function construct.
The lambda function you provide must accept an EventBridge event, which contains the task data using the EventBridge message fields, which are shown below:
- detail-type: Defines the type of task triggered by the web socket message.
- source: This field contains both the submitted task-id from the original message and the connection-id of the web socket connection submitting the task. These two fields are used in sending the response to the originating caller. The task-id value is required to provide the context of the task response being sent via the web socket response so the caller can match the response to the original request.
- detail: The detail field contains the input payload for the submitted task.
Socket Task Construct Components
The SocketTask construct creates all the services shown in the architecture diagram above. It will add the Lambda destination configuration to each task's provided Lambda function to send success or failure messages to the SQS queue. You will notice there is no database required in this particular design. The connectionId of the WebSocket submitting the task is passed through the components, enabling the task-notify handler to send the response to the correct web socket.
This is the web socket entry point where the route is defined, which passes the client task data to the task-submit lambda.
This Lambda packages the task data into a message sent to the Task Bus, triggering the actual Lambda function (or functions) to run your task(s).
This is an EventBridge event bus with rules defined to trigger one or more long-running tasks. The construct enables passing in your own EventBus instance so you can incorporate the design into an existing solution.
BYO Long Running Task (or Tasks)
This specific long-running task will be started by an event bus message being sent to it. This Lambda will be invoked Asynchronously by the EventBridge service, and the success or failure of the task will be forwarded via Lambda Destinations to the task-notify Queue. There is no restriction on the lambda function you specify other than it can be invoked by an EventBridge message, and it does not have a Lambda destinations configuration.
The lambda function should return the response you expect to be returned to the initiator. The function provided to the L3 construct does not need to have Lamdba destinations configured - the construct does this for you (construct magic). You also do not need to worry about the connectionId detail to return the task response to the task initiator - this is all handled transparently by the construct components.
The benefit of using Lambda Destinations for the Asynchronous Lambda invocation by Event Bridge is that the actual context of the Lambda failure is included in the message payload submitted to the SQS queue. Following is an example of a failed task and the message submitted into the SQS queue for the notification.
The key section is the requestContext, which contains information about what happened. The requestPayload, with the input message and the requestResponse, contains the actual response from the Lambda function. This is all great information for troubleshooting (in case of errors).
In the following example, you will notice the Task resulted in an unhandled error, and the requestContext shows its condition as RetriesExhausted, indicating that the task failed multiple times and further retries were not possible (this is something I love about LambdaDestinations).
Receives the success or fail message from the long-running task and will trigger the task-notify lambda. The queue is the destination for both successful and failed tasks. The notifier will send the result to the initiator via the web socket channel. The construct enables you to pass in your own SQS instance in case you have existing security policies or constraints that need to be incorporated.
This lambda is responsible for taking the messages from the SQS trigger and processing the Lambda Destination message to craft a response to the client requesting the task via the web-socket Gateway. The message will be sent to the listening client via the web socket connection.
Key Design Decisions
Several key decisions are needed to make the process described in the previous section work. One of the main key decisions is deciding on the message format between the client and backend services. In the Web Socket Primer article, I provided links to common Web Socket Sub-Protocols that are published. These Sub-Protocols are a description of Message Formats and what they mean. For your web socket implementation to be successful, create a message design so everyone knows how to talk and what they are doing!
Here is the request message the client sends to the web socket to start a long-running task.
action: routes the message to the task-submit lambda function for processing, this is essential for Websocket API routing to the specific handler. task: Object defining the task. id: Client-assigned identifier for the task being requested. This is so the client can keep track of requests and match up the responses when they return. Web socket communications are asynchronous and tasks will take different times to run so the responses will not come back in any particular order. type: This is the key to identifying the long-running task to start. data: This is a custom data object defining input data for the long-running task to use.
Here is the response message sent from the server to the client.
task_id: The value the client sends in the request
status: Either Success or Fail, indicating if the task succeeded.
response: Data sent as a part of the Lambda destination handling from the long-running task lambda.
statusCode: Contains an HTTP statusCode for tasks.
body: Contains a JSON object in an escaped string form. The string can be turned into a JSON object using JSON parsing functions in many popular programming languages. The body is the actual Lambda response received in the lambda destinations message.
In this article, I introduced a use case for WebSockets that does not use a DynamoDB table for sending notifications. I also introduced a method for executing long-running tasks asynchronously using AWS Lambda and provided a custom CDK construct to leverage this solution in your cloud. This solution requires a connected WebSocket environment for executing tasks, so your client needs to be able to keep the connection open for the time it takes to execute.
Another key design process you must consider with WebSockets is the actual communication protocol between client and server components. This needs to be defined, or you may end up with a big ball of mud.
Check out Serverless DNA Constructs on GitHub, I will add more constructs over time for different use cases to help accelerate your projects. I welcome feedback, so do create issues and discussions to help drive what gets developed next!