Benchmarking AWS Lambda SnapStart
AWS Lambda is AWS's Function-as-a-Service product that allows us to arbitrarily execute function code based on certain trigger events such as HTTP(s) requests or AWS's other products such as DynamoDB updates.
It allows us to perform "serverless computing" where we do not have to setup servers continuously listening to serve requests, but rather execute request handler functions "on-demand" whenever requests arrive. This means that AWS is responsible for setting up the infrastructure and environment to execute our functions to serve incoming requests, and we are free to focus on our business logic rather than infrastructure maintenance.
Context
The details of serverless architecture can be quite complex, so I will leave the details out in this post. What is important to know is the general workflow: whenever there is a new incoming request, AWS first needs to construct an isolated environment for our function to execute in. Lambda utilizes Firecracker and microVMs to achieve this. After setting up the infrastructure (virtual or otherwise), the runtime environment needs to be initialized (for example, the JVM is required to execute java functions), after which our function executes.
As expected, there is quite a bit of overhead incurred from the time of receiving the request to the actual function execution due to these processes, which is called "Cold Start".
Cold Starts
Cold Starts have been a particular problem in the serverless computing industry, as it has been shown that in many cases, the cold start time exceeds the actual function execution time (and typically, the billable time is function execution time).
For users, this translates to an increased latency between function invocation and completion, which can be problematic for high-performance workloads. For service providers (such as AWS), this translates to lower utility of their resources used for billable time.
Warm Starts
The converse, is of course, warm starts. This is the ideal scenario which results in the fastest time from receiving the request to the function execution. Typically, when functions finish execution, service providers don't immediately "teardown" the execution environment and infrastructure used by the invocation, and there can be a delay of up to 10-15 minutes before teardown.
This is so that if there is another function invocation before the teardown actually happens, we can skip all the overheads that resulted in cold starts and simply execute the function in the existing execution environment, drastically reducing the time taken to serve the request. This is known as a "warm start", and the ideal time that we want to get as close to as possible when reducing cold starts.
Snapshots
One way to reduce cold starts is to load "snapshots" of a function. When a function is invoked for the first time, we can take a snapshot of it with the execution environment already initialized and store it somewhere. On subsequent invocations, instead of setting up the environment etc., we simply load the snapshot from the storage and it is now ready to serve requests. This is what the SnapStart option does for AWS Lambda functions.
Experiment Methodology
I made a simple program to send requests to two identical hello world functions connected to an APIGateway, one with SnapStart enabled and the other with SnapStart disabled. We then measure the roundtrip delay time from request send time to response received time, which is 1 sample measurement. We perform this measurement 100 times with an inter-arrival time (interval between requests) of 10 minutes, which I determined was enough for a cold start to happen.
The client program resides in an EC2 instance in the same region as the functions to minimize latency delays, ensuring any performance differences we see are likely due to SnapStart.
Experiment Results
Here are the results of the experiment:
SnapStart Disabled | SnapStart Enabled | |
---|---|---|
Median | 921.00 | 836.50 |
95 Percentile | 1076.35 | 1054.25 |
We can see that SnapStart Enabled has a better median latency as well as 95th percentile. There is a 9.17% improvement in median latency performance from SnapStart Disabled to SnapStart Enabled.
CDF Plot:
The CDF also shows us that functions run faster with SnapStart Enabled for the most part. However, the results are still a far cry from AWS's claims of "up to 10x faster". One possibility is that our function is too simple and might not be benefitting from SnapStart too much.
Running another experiment with a 100MB filler to see if snapshotting benefits with larger function image sizes yielded similar results:
SnapStart Disabled | SnapStart Enabled | |
---|---|---|
Median | 993.50 | 951.50 |
95 Percentile | 1216.05 | 1079.80 |
There was only a 4.22% increase in median latency performance going from SnapStart disabled to enabled, so there wasn't any significant improvements.
CDF Plot:
Nevertheless, the latency performance is still consistently better with SnapStart Enabled than Disabled.
Closing Thoughts
While using SnapStart definitely improved our function latencies, it wasn't quite the exceptional increase in performance that one would expect. One possible reason could be that our function is too simple (hello world function), and there may be other optimizations happening for simple functions such that the latency difference is minimal.
Comments