Monitoring and Caching OpenAI Requests with Cloudflare AI Gateway
Streaming AI
Generative AI models generally expose APIs to interact with them. Using these APIs we can build AI enabled applications. Once you start building these applications, you will notice that the APIs response times are quite large, around 10-30 seconds and sometimes much higher than that for complex applications. Luckily some of the AI models support streaming. By leveraging this feature, you can send the data incrementally from the APIs rather than waiting for the entire data set. When building serverless applications on AWS with Generative AI capabilities, we can leverage AWS Lambda’s response streaming functionality to build snappy, responsive applications.
Introduction to Lambda Response Streaming
AWS Lambda, one of the core serverless offerings from AWS, is a powerful compute engine and has a variety of use cases. AWS announced a new functionality in April 2023 called response streaming. What is it? Instead of waiting for the lambda to complete execution and then receiving the response data, you can now send the data instantly to the client as it becomes available. This enables your user to have a much better user experience. Check out the announcement page here. Here’s one of the use cases of how this functionality can improve user experience. Here is an example of the difference response streaming can make for AI Applications.
Generative AI and Lambda Response Streaming
From a programming language perspective, python is dominating this space with libraries like langchain making it easier to build AI based applications. Naturally, building AI applications on AWS is one of the progressive paths, and Lambda being the core of serverless we will definitely consume it, but this is where it gets tricky. Although python is one of the standard programming languages supported by AWS Lambda, response streaming is not directly supported on python runtime. Lambda response streaming is supported on NodeJs runtimes and custom runtimes. So, we need to build a custom runtime to enable response streaming on a python environment.
Lambda Custom Runtime
Before creating an effective custom runtime environment there are some concepts you should familiarise yourself with. When you trigger a lambda, it sends an event to a Lambda Runtime API. There are 2 main APIs, one for polling the next event and one for sending the response to the invoked event. And a bootstrap file controls the invocation of these APIs. It is the interface between the lambda service and the runtime API. Here’s a detailed explanation of what happens inside a lambda invocation
What happens inside an AWS Lambda
If you are new to building a custom runtime environment, check out this blog here.
Building a custom runtime on AWS Lambda
The documentation around using a custom runtime environment is poor. Here’s the official AWS Documentation link for implementing response streaming. The key aspect is to change the way the lambda response API is triggered. The API call needs to be updated with certain headers to let Lambda know to stream the data. Post that, you need to have a handler file which can send the response data in chunks.
The custom runtime allows you to control the bootstrap file, which is used to connect with the Lambda Runtime API and modify the required parameters.
Every language has a generative function capability which can be used to send the data in chunks rather than waiting for the entire data. In python, return
statement is used to send the complete data whereas yield
is used in generative functions to send data in chunks.
Streaming Bedrock on Python runtime
In this blog, we will be walking through the steps and configurations required for creating a response streaming lambda which is streaming the data from Amazon Bedrock. We will be using the cohere
base LLM as it is the only LLM in bedrock that supports streaming. Before getting started check out this blog on Amazon Bedrock.
Getting started with Amazon Bedrock
For impatient readers, here’s the repo link to clone and try it out https://github.com/antstackio/a2lrs
Now, let’s get into the code. There are 4 files required to complete this functionality
- bootstrap
- Dockerfile
- lambda_function.py
- requirements.txt
1. bootstrap
This file allows you to have control over the runtime api. This is where we will change the parameters of the lambda response api, thereby allowing streaming to happen.
#! /usr/bin/env python3.11
import os, sys, requests,json
# Get Lambda Environment Variables
AWS_LAMBDA_RUNTIME_API = os.getenv("AWS_LAMBDA_RUNTIME_API")
LAMBDA_TASK_ROOT = os.getenv("LAMBDA_TASK_ROOT")
# Set path to import your handler function
sys.path.append(LAMBDA_TASK_ROOT)
from lambda_function import handler
# API URL Templates and Headers
INVOCATION_URL = f"http://{AWS_LAMBDA_RUNTIME_API}/2018-06-01/runtime/invocation/next"
RESPONSE_URL_TEMPLATE = "http://{}/2018-06-01/runtime/invocation/{}/response"
HEADERS = {
"Lambda-Runtime-Function-Response-Mode": "streaming",
"Transfer-Encoding": "chunked",
}
# Run a loop to get the lambda invocation events
while True:
# When Lambda invocation event is received fetch the event
response = requests.get(INVOCATION_URL, stream=True)
# Extract event data
event_data = json.loads(response.text)
# Extract invocation id
invocation_id = response.headers.get("Lambda-Runtime-Aws-Request-Id")
# Create response URL
response_url = RESPONSE_URL_TEMPLATE.format(AWS_LAMBDA_RUNTIME_API, invocation_id)
# Post the response
requests.post(response_url, headers=HEADERS, data=handler(event_data, None))
What’s happening in this file? The bootstrap file continuously polls for the next AWS Lambda invocation and takes action when a new event is received. We need to extract the event data
and invocation id
. ICYMI, the invocation id the unique identifier for every lambda execution. Then we format the response URL to have the unique invocation id as part of the URL. This makes sure that the responses are sent to the right invocation. Finally, we post the response to the formatted response url. In the sample shown, we are calling the handler function directly. The handler function is designed to send the data in chunks and since the headers of the response URL are set to stream, the data will be streamed for the lambda invocations.
2. Dockerfile
This file allows you to create the custom image required for the runtime. The code for the file can be found here
The docker file is straightforward. We install the ssl module, python3.11 and move the bootstrap file and the lambda handler file to their respective locations. To know more about what exactly happens in the an AWS Lambda, check out the blog here
3. Lambda_function.py and requirements.txt
The requirements file contains the required libraries for the lambda handler. To access the bedrock apis, the lambda needs to have the latest boto3 library. The explanation for the lambda handler file is as follows.
First we import the required packages and initialise the bedrock client
import json, boto3
client = boto3.client("bedrock-runtime")
Then, inside the handler function, we call the bedrock api
response_stream = client.invoke_model_with_response_stream(
body=json.dumps({"prompt": event["body"], "max_gen_len": 512}),
modelId="meta.llama2-13b-chat-v1",
accept="application/json",
contentType="application/json",
)
In our example, we are using the meta.llama2-13b-chat-v1
model as it supports streaming. For other supported models you can login to the AWS Console and use the playground in the Bedrock console to see the schemas, or you can refer to the detailed documentation here.
After the bedrock api is initialised, we need to capture the response stream from the api. That is done as follows
status_code = response_stream["ResponseMetadata"]["HTTPStatusCode"]
if status_code != 200:
raise ValueError(f"Error invoking Bedrock API: {status_code}")
for response in response_stream["body"]:
json_response = json.loads(response["chunk"]["bytes"])
if llmType == "meta.llama2-13b-chat-v1":
print(json_response["generation"])
yield json_response["generation"].encode()
elif llmType == "cohere.command-text-v14":
if "text" in json_response.keys():
if json_response["text"] != "<EOS_TOKEN>":
yield json_response["text"].encode()
Here, we check for a valid status code, then load the chunks. The code is specific to the cohere response schema and will vary for other LLMs. As we receive the response stream, we convert the string to bytes by encoding it and yield
the response. With this implementation, the lambda handler function will continuously respond with the incoming bedrock stream.
Deployment
We are going to use the SAM CLI for deployment. The SAM CLI takes care of packaging the image and creating the ECR required for the docker images and eases the entire deployment process.
CustomImageLambda:
Type: AWS::Serverless::Function
Metadata:
Dockerfile: Dockerfile
DockerContext: ./custom-runtime
Properties:
PackageType: Image
Timeout: 900
MemorySize: 1024
Policies:
- AdministratorAccess
FunctionUrlConfig:
AuthType: NONE
InvokeMode: RESPONSE_STREAM
CustomImageUrlPublicAccess:
Type: AWS::Lambda::Permission
Properties:
FunctionName:
Ref: CustomImageLambda
FunctionUrlAuthType: NONE
Action: lambda:InvokeFunctionUrl
Principal: "*"
Using SAM, we will create a lambda function, lambda url, enable response stream for the lambda url and create a public access permission for the lambda. There are 2 main resources used for the custom runtime, the lambda function (AWS::Serverless::Function) and the public access permission for the lambda function (AWS::Lambda::Permission). To use AWS::Serverless::Function, you need to add the serverless transform at the beginning of the cloudformation. If for any reason you don’t wish to use the serverless transform, then you need to split the AWS::Serverless::Function into AWS::Lambda::Function and AWS::Lambda::Url. The custom-runtime folder mentioned in the DockerContext has the bootstrap, Dockerfile, lambda_function.py, and the requirements.txt.
Once the setup is complete, run the sam build
and sam deploy
commands to deploy the resources in your AWS account. Once the deployment is complete, open the lambda from the AWS console to get the lambda url. And we’re done, use the lambda url in your client implementation to stream the data from lambda.
Note :
- Lambda Response Streaming can only be viewed through a client that is configured to receive streamed data. The AWS Lambda Console does not support this and you will not be able to view it from the console.
- Lambda Response Streaming needs to be consumed via a Lambda URL and cannot be consumed via the API Gateway.