Understanding AWS Lambda Coldstarts

Within the community of AWS Lambda users, we frequently hear the chime of users complaining about their coldstarts. Yet, no data has really been presented on what’s happening and when. Users have told us they keep their code warmed, but that coldstarts still happen. When these do happen, function invocation latency may spike anywhere from hundreds of milliseconds to dozens of whole seconds for users. Such delays can impact interactive serverless applications built on Lambda, and understanding these in the context of an application’s latency requirements can be vital for some users.

Well, we’ve dug into when and how coldstarts happen.  Initially, I did some quick logging and (incorrectly) tweeted that it seemed to be about 3.5 hours between coldstarts. It wasn’t terribly wrong, but with more data comes greater accuracy.

Today, IOpipe is collecting metrics from millions of lambda invocations across various memory tiers, execution durations, and two languages (Python and NodeJS). Yet, for coldstarts, our data tells us that there’s no variance. Coldstarts happen, with few exceptions, 4 hours from the creation of a host VM.

max-uptime-timeslice

This graph shows anonymized data from several functions over a period of 11 days, showing the MAX value for the Linux kernel’s reported uptime. There is a clear indicator here that hosts do not typically exceed 15000 seconds of uptime. Although I have not (yet) seen an acknowledgement from AWS, we saw Lambda host uptimes as high as 8 hours on 9/9/2016. Why?

We began tracking this metric for users back in early August. It happened that there was an event on 8/12/2016 that was acknowledged by Amazon. See the following chart:

coldstart-comparison

That day, Amazon reported increased errors and longer execution durations on Lambda in us-east-1. We had less data then, but the results are pretty clear in the following chart: There was a massive increase in the duration of host VMs running Lambda. My guess then, and today, is that some failure occurred with the alarms intended to reap VMs at the 4 hour mark.  Instead, we saw executions reaching the 7.5 hour mark, presumably depleting resources available to the Lambda service. It is presumed that a similar event or maintenance window occurred on 9/9/2106.

Whatever the reasons, a good rule of thumb for now seems to be 4 hours.

Start a free IOpipe Trial

Register here.

Leave a Reply

Your email address will not be published. Required fields are marked *