Back to blog home page

Serverless Performance Shootout

Posted by on Aug 10, 2017

In this post, I compare the performance of three function-as-a-service providers: AWS Lambda, Google Cloud Functions and Microsoft Azure Functions. I’ll walk through the serverless function providers we examined, the process we used for comparison, and draw some conclusions based on the tests that I ran.

First, a little background.

A couple months ago, while preparing for a presentation at AWS Community Day SF, I did some quick performance analysis of Backand‘s AWS Lambda logs. While I was able to get some great data on what the Node.JS performance environment looks like, and memory consumption as a function of runtime, I felt there was a lot more to be explored in this area. Specifically, the research raised the following two questions:

  1. If I’m composing a serverless web app out of multiple serverless functions, how does the hot-cold nature impact my application’s performance?
  2. How does AWS Lambda compare with some of the newer serverless function offerings?

And so, the Serverless Shootout was born.

Function-as-a-Service Contenders

When designing the tests, I opted to only analyze the three most popular serverless function providers – AWS Lambda, Google Cloud Functions, and Microsoft Azure Functions. While there are other offerings available, such as IBM OpenWhisk and Oracle Functions, focusing on AWS, Google, and Microsoft keeps things simple. Let’s review each of the contenders.

AWS Lambda is the first serverless function provider we took a look at. Being familiar with AWS Lambda, I’d already worked with Lambda functions fairly extensively (see a serverless function interface I developed for Amazon Alexa). AWS Lambda was launched at re:Invent in 2014, and in many ways kicked off serverless development as a concept. Building upon containerization approaches popular in services like Docker, AWS Lambda gives you an environment in which you can run individual functions implemented as small self-contained applications. AWS Lambda offers a number of different platforms for your functions, and has triggers into many AWS services, including HTTP triggers through AWS API Gateway.

Google Cloud Functions was launched in early-access mode in February of 2016, and has since been released to beta. It ties in closely with the rest of the Google Cloud platform, which has been moving towards feature parity with AWS for a while now, as they continue to improve their serverless offering. Similar to AWS Lambda, it provides the capability to run small applications in response to triggers in various Google Cloud services, and offers an HTTP trigger method natively. Currently, only Node.js is supported.

Microsoft Azure Functions was launched in March, 2016. It brings serverless functions to the Microsoft Azure platform. Similar to both AWS Lambda and Google Cloud Functions, Microsoft Azure Functions offers a number of different triggers that integrate with different Azure services, and like Google, they have built HTTP triggers into their UI. Microsoft supports several frameworks, including C#, F# and Node.js.

Comparison Methodology

Having decided on which providers to compare, the next task was to decide how to compare them. Each platform offers disparate sets of functionality, different user interfaces, and different runtime environments – even when working within the same language! As such, I quickly realized that a straight apples-to-apples comparison wasn’t going to be strictly possible. I , therefore, looked at focusing my testing on the aspects that I could reasonably compare on a 1:1 basis. This meant focusing on what I consider the most interesting metric – round-trip HTTP performance time.

As all three contenders offered Node.js as an option for their serverless functions, I chose that as my platform and wrote a very simple serverless function that simply performs string concatenation of a static message and a single body parameter. Though all three platforms offered different versions of Node.js (6.10.2 vs 6.11.1 vs 6.5.0), I figured by keeping the code very simple I could be reasonably assured that I’d eliminated any performance perturbations due to differing node versions.

I then wrote a quick Ruby script that would hit each of these functions 10,000 times, recording the round-trip HTTP request times for each provider. I added a pause from 1 to 1,200 seconds every 100 calls in order to test the “Hot vs Cold” performance. As serverless functions can sit idle for long periods of time, most providers won’t architect your functions in a manner that makes them immediately available. This is actually a major benefit of serverless development, as you’re only paying for the resources you actually use, instead of paying for constant availability. However, this means that you have one of two situations you face when calling your serverless function:

  1. The “Hot” scenario. Your function has run recently, and the same machine instance on which it last ran is available to run it again. This results in the shortest round-trip time.
  2.  The “Cold” scenario. Your function has been idle long enough for the function provider to tear down the resources used on the last call. The next call to your function will result in a new machine being instantiated to run your serverless function. Obviously, this will have a longer round-trip time due to the need to provision function resources prior to executing the code.

With the remainder of the platforms varying so wildly, my primary goal was to test this hot-vs-cold scenario in running serverless functions. The reason I made this my primary goal was because it stands to reason that with all else being equal (machine capabilities, function platforms, services offered, and so on), this was the primary discriminating factor between the three services – and also the factor with the highest impact on the performance of a serverless web application, simply due to compounding delays in request time.

Code Used

As I mentioned above, I wrote very simple serverless functions that would do some simple string manipulation before returning the result to the user. Here is an example of that code (pulled from Google Cloud Functions):

I’m not solving the traveling salesman problem here! This code simply reads a “message” parameter, then concatenates it with a static string before returning success to the user. The goal is to minimize the differences between the platforms, and while there’s the potential for some unforeseen side effects above (particularly due to the use of code objects for the response on both Google and Microsoft’s part – see the caveats section below), I do not feel they were significant enough to invalidate the test.

After I wrote the three serverless functions, and had the HTTP endpoints set up, I moved on to the test script. I used Ruby, but the script language shouldn’t matter – if the code to invoke each function was the same, then we could safely ignore any peculiarities in the platform due to the same code being used to exercise all three providers. I wrote the script in two parts. First, a function to call the HTTP endpoint for each function. Here’s an example calling an AWS Lambda function:

The above code uses Ruby’s built in Net::HTTP library to initiate a POST request to each endpoint. It provides a JSON body containing the message parameter, then executes the request. Finally, it outputs the response body to the console, to provide a visual indication that some activity had taken place. After creating a similar function for each of the providers, I then wrote a simple loop that ran these functions 10,000 times:

As you can see, the code above uses text file output to track time results, and makes use of Ruby’s internal timing library to gauge the round-trip time with a simple time difference, which is calculated in seconds. Every hundred calls it pauses for up to 1,201 seconds, which should be sufficient to get at least 100 data points on Hot vs. Cold behavior. I then modified the above loop to also exercise Microsoft and Google’s offerings (writing those to separate files). The test was ready to go.

As a quick note, this script had minimal error handling. I was focused on what I needed to do to keep the script running at all costs. As such, all the script would do when an error was encountered is output the error to the console and skip the result. I’ll touch on how we might improve this later.

For my purposes, though, I was only concerned with failures in a general sense. My goal was performance, and if a function failed I wasn’t too interested in why, because the script – with a pause of up to 20 minutes every 100 calls – would take 33 hours to run in the worst case. As I wouldn’t be sitting at my machine watching the console the entire time, I didn’t devote much thought to error analysis. This will come up later.

Now that the script was written, it was time to execute the test. I ran the test for more than 24 hours, minimizing the things I was doing on the machine at the time (see Caveats, below), and crossed my fingers, hoping I would get some interesting data. I divided my conclusions into two categories: subjective and objective.

User Experience Comparison

Take these with a large grain of salt. What I find confusing, another developer may find blindingly obvious, and vice versa.

In any case, here are my observations:

  • Google Cloud Functions had the best UI. I was up and running with a serverless function that could be hit via HTTP trigger in well under 10 minutes. This was not the case with AWS Lambda and Microsoft Azure Functions.
  • Microsoft Azure Functions had the largest number of configuration options when compared with Google Cloud Functions and AWS Lambda.
  • Microsoft Azure Functions had by far the most complex UI. The number of times I tried to do something and promptly hit a wall was markedly higher than on either the Google or AWS platforms. Some of that may have been due to platform familiarity, but when the first thing new Azure users see is a formidable blue screen with indecipherable icons listed under a hamburger menu, it’s not exactly user-friendly.
  • AWS Lambda had the most complex configuration setup, and this is entirely due to the need to configure AWS API Gateway for HTTP request triggers. As I mentioned above, I was able to get a function running on Google Cloud in under 10 minutes, complete with HTTP trigger. Microsoft was similar, but took additional time due to the user interface. With AWS Lambda, while I was able to get a function up and running very quickly (about 10 minutes as well), the API Gateway integration took over an hour and a half to resolve. This is mostly due to poorly-documented return requirements – your response needs to be structured as JSON and include a status code in the response, and if you don’t include the appropriate elements the only thing you’ll ever see is “Internal server error” or a similarly unhelpful message.
  • Most obscure logs also goes to AWS with its CloudWatch integration, with Microsoft Azure a very close second. CloudWatch splits your logs up based on time period and does not allow direct export from CloudWatch, which means that I needed to first export the logs to S3 before writing a script that could pull all the logs down, unzip the nested archive files, and finally parse all of the results from the logs. Microsoft at least offers the Azure Storage Explorer, which makes following a similar process on their platform a bit easier due to the built-in local query and export functionality.

Performance Comparison

Let’s get to the meat of the test: performance analysis. I took the execution results from each of the providers, and generated histograms of function performance time. Here’s a graph comparing the run times across 10,000 calls for the three providers:

10k calls to functions-as-a-service providers - performance

In the above graph, the X axis defines a series of run-time buckets (in seconds). The Y axis is a simple count of the functions calls that fell into each bucket for each provider. The yellow bars represent Microsoft Azure, the red Google Cloud, and the blue AWS Lambda. Already we can draw some observations – for example, it’s obvious that the Microsoft Azure runtime histogram shows better performance over both Google and AWS, and that while Google and AWS share similar performance characteristics, AWS has tighter grouping around its modes while the Google runtimes are a bit more evenly distributed. Let’s take a look at each provider in isolation.

AWS Lambda

aws lambda performance

The AWS Lambda histogram is strongly bimodal, almost quadrimodal. The primary modes are at about 0.16 seconds and 0.26 seconds, with smaller but significant peaks at 0.21 and 0.32 seconds. The distribution of the data is very tight around these peaks, which means there isn’t much variance in the performance, while remaining within the “Hot” or “Cold” mode (there is some, but not as much as you see with Google below). We’ll take a closer look at these numbers later on.

Google Cloud Functions

google cloud functions performance

Once again, when we take a look at Google Cloud functions the hot and cold peaks are patently obvious, coming in at 0.16 seconds (hot) and 0.22 seconds(cold). The thing I found interesting, though, was the much higher trough between both peaks, meaning that function run times are more likely to fall between the hot and cold peaks, rather than being tightly coupled.

This can be a benefit or a detriment, depending on how you perceive the run times. With AWS, you can be reasonably assured that you’ll get a somewhat even split between hot and cold. With Google, you’re more likely to have to plan on the “cold” run-time being your average/worst case, simply due to the broader distribution of the results.

 

Microsoft Azure Functions

microsoft azure functions performance

Microsoft’s histogram shows the fastest runtime of all three providers, with a mode of right around 0.11 seconds. The interesting (read: frustrating) thing about this graph is that it is clearly a long tail with a bias towards a short run time. This means one of two things: either Microsoft doesn’t have the hot vs. cold problem that both Google and AWS share (unlikely, in my opinion, and without knowing too much about Azure’s internal architecture), or my test’s cache-busting 20 minute wait wasn’t long enough to catch the spin-up times for Azure machine instances.

By the Numbers

Let’s do more concrete analysis. Below is a direct look at the success rates of calls to each provider:

Chart 1: Success vs Failure

 

Provider Calls completed Calls > 1 sec Calls not completed
AWS Lambda 9,999 138 (1.38%) 1
Google Cloud Functions 9,999 115 (1.15%) 1
Microsoft Azure Functions 9,612 168 (1.68%) 1 (see below) 388

This table takes a holistic view of the calls as a collection of data points, and asks a few simple questions. The first question is strictly related to how many calls succeeded versus how many failed. As we’re working with HTTP communications, we have the potential for a request to disappear into the internet backbone, either timing out or never returning. From this perspective, Microsoft is at a disadvantage with 3.9% of calls failing. I found this odd, and after running the test again I wasn’t able to produce the same rate of failure, so the odds are that this just happened to be an aberration when I ran my test (likely due to my Azure Function having been in the wild for less than an hour), so this result is probably not as interesting as it seems at first. Note that I’ve left the original results in above, and you can likely safely ignore

What is far more interesting, in my opinion, is taking a look at the aggregate number of calls that took in excess of one second. With modes on all three providers well under 300 milliseconds, regardless of the hot or cold nature of the machine instance, I settled on one second as my “excessive response length.” Once again, Microsoft is trailing both Google and AWS in this metric, but only by about 30 calls or so. Given the faster average runtime of Microsoft Azure over both Google and AWS, we can probably call this result a wash.

Chart 2: Statistical Analysis

Provider Mean round trip Median round trip Mode Standard Deviation*
AWS Lambda 0.243 0.213 0.160 0.192
Google Cloud Functions 0.250 0.218 0.233 0.164
Microsoft Azure Functions 0.162 0.111 0.126 0.386

The next step was to apply my journeyman level statistical skills to the problem to see what we could find. The table above shows some basic statistical comparisons between the two. These are really interesting primarily as data points for comparison – the most useful statistic here is very likely the Median Round Trip value, as that represents the halfway point between the shortest and the longest running times for each data set.

Microsoft clearly comes out the winner here, with a median round trip time that is just about half that of both AWS and Google. The mean, which is the arithmetic average, provides another useful comparison, but this is going to be subject to skewed results due to large values on the right-end of the histogram, when HTTP requests either time out or simply take far, far longer than average.

The mode is interesting to the extent that it tells us what the most likely case is for each provider, as it simply counts the most commonly-occurring value in each data set – and based on this we can see that AWS Lambda hit the hot case the most often, while Google Cloud Functions hit the cold case more often. But the real interesting element is the difference between the two modes, so I included the mode primarily as a conversation point.

The standard deviation gives you some information on the variance in the results, and based on that we can see that Microsoft had by far the most variance in its runtimes. However, it’s important to note that the standard deviation is primarily designed to describe normal distribution curves, and all of our histograms clearly represent non-normal distributions (AWS and Google are bimodal, while Microsoft is a long tail), so it is hard to draw anything truly meaningful from this data point other than using it as a very rough gauge of the variance in runtimes.

Chart 3: Hot vs Cold

Provider Mode 1 Mode 2 Difference
AWS Lambda 0.16 0.26 0.1
Google Cloud Functions 0.16 0.22 0.06

Finally, we’ll take a look at the data point that originally kicked off my performance comparison – hot versus cold performance. Using the histograms, I determined approximate modes based upon the runtimes recorded, and then calculated the difference between them. The result of this comparison is that Google Cloud Functions has about a 40 millisecond lead time on AWS Lambda when it comes to comparing the machine instance ramp-up time. This is quite a small difference, but you need to consider the impact in aggregate – if I have written a serverless application that makes 25 serverless calls during a page load, my worst-case performance difference between AWS and Google will be a full second. Given that the numbers are so close, though, and that Google had such higher variance between its modes, we can only use these as observation points rather than scientific proof.

Conclusion

My initial goal was to set out to compare hot-vs-cold performance for three serverless function providers, as well as gather some aggregate statistics. The above data shows that while Microsoft Azure Functions had the best average runtime performance, there was also a higher error rate when compared to both Google Cloud Functions and AWS Lambda.

In addition, while Google Cloud Functions has a greater distribution of results  between its hot and cold modes, AWS Lambda is much more tightly organized around each of these modes, resulting in more predictable performance.

In their worst case scenario, AWS Lambda will add a second to the load  time of a page with 25 server calls compared to Google (as each call adds on average 40 milliseconds when the machines are cold). However; Google is more likely to hit that worst case scenario than the much more consistent AWS.

Given the above, the final observation I have to make should be no surprise: you’re going to face trade-offs when choosing a serverless function provider.

Serverless Shootout Video

This post is based on a webinar I gave on Aug 2, 2017. You can watch the entire webinar video.

Notes and Caveats

After the above analysis, I’m sure I’ve offended some data scientists and mathematicians who happened across this post. Let me mollify those of you who are looking for holes in my tests by providing as many of them as I can myself:

  • To be clear, I am not a data scientist. I took several statistics courses throughout my education, and have a firm grasp of the difference between correlation and causation, but I make no claims about the accuracy of the methods I used. Mostly I am looking to drive a discussion, and I am hopeful that someone will take the above as a challenge to prove me wrong and generate a more “scientifically accurate” data set.
  • This test was run from a macbook pro that I use as my development machine, meaning there are other processes and programs running in the background as the test executes. I did my best to minimize those differences by killing open applications, turning off automatic updates, and so on, but this test can likely be improved by running it on a dedicated machine that does literally nothing else.
  • The requests made by this test were conducted over a consumer-level internet plan – specifically a 300 Mbit Comcast cable internet connection. This means that the test is subject to excessive traffic on my local node, when all my neighbors get home and turn on the latest episode of House of Cards on Netflix. As such, the results could be influenced by increased network traffic between my location and the node that feeds my connection into Comcast’s network.
  • Additionally, I conducted this test on a 5GHz wifi connection. This can add some latency, and will likely never be as fast as a wired connection due to wifi interference from those same neighbors watching those same episodes of House of Cards, but is also compounded by the rest of my family making use of the network while the test is running. What this means, in my mind, is that the specific runtimes themselves are not the important element here, but rather the ratios between them.
  • I wasn’t able to express the bimodality of Microsoft Azure Functions. I suspect this is due to my test design – my simple functions are great candidates for result caching, and it is entirely possible that my cache-busting wait time was insufficient to catch the machine instance activity window in Azure. If I was to conduct this test again, I’d add an element of randomness on the serverless function side to ensure there’s no caching taking place, and increase the wait time between each batch of calls to catch the spindown time for machine instances on all platforms.

In short, you can take all of the points above together to mean that I recognize that for true scientific accuracy, this kind of test needs to be run on an isolated server in a rack with a direct connection to the backbone and more complex serverless function code in order to eliminate all mitigating factors. I don’t really have the resources at my disposal to do this, however, and as such I’d be very interested in speaking with anyone looking to replicate this test in a more isolated/pristine environment.

 

  • Can you share your entire test script, minus the real URLs please. I’d like to test with OpenWhisk.

    • Matt Billock

      Sure, you can see it in CodeShare here: codeshare.io/aYWKjN

  • Chris Bilson

    “… it’s obvious that the Microsoft Azure runtime histogram shows better performance over both Google and Microsoft”. Was the last “Microsoft” supposed to be AWS?

    • Matt Billock

      Yep, absolutely – thanks for the catch!

  • Great post!

    Any plans to schedule a round 2? I feel like all the players in this space are making improvements, and it would be interesting to see the same tests run in say 6 months…

    • Matt Billock

      Absolutely – we’re working on a test that’s more representative of traditional usage (multiple threads with several concurrent requests) – we’ll publish that here once we’ve conducted the research!

  • I’m from the Azure Functions team.

    You can inflict a hard cold start on Azure by touching the file, but you’re right that we do a lot of games to prevent you from being in a cold start situation, so it’s probably not apples to apples or a production scenario.

    Error rate is weird. If you run it again and see funny business, message me and I can root cause it.

    Thanks for the insightful post.

    • Matt Billock

      Thanks for the tips, Chris – appreciate the info on force-triggering a cold start!

      Error rate was definitely weird. I re-ran the test a couple times afterwards, and I wasn’t able to reproduce the high number of incomplete calls. I mentioned that above, but it could get lost in the wall of text!

      • Martin Viau

        You might want to edit up the error rate aspect then, if it was a 1 off, and you we’re not able to reproduce it. Blips happen, and it seems unfair to Azure to keep that data up here, as most people will only “scan” your article, spot the table, error rate data, but won’t read on that it was effectively an anomaly, and leave with the wrong take-away.

        • Matt Billock

          I added a note, but I did leave the original error count there. Regardless of it being a one-off, the rest of these results are still based on this data set, so I left the original value marked as a strikeout with a note to see the explanatory text.

  • Stephen Moore

    Nice work! An added test dimension to consider is calling and running lambda functions concurrently. Curious to know what the results of this might be on top of what you’ve established. Thanks.

    • Matt Billock

      Yep, this is scheduled for the next iteration in this investigation!

  • Martin Viau

    Thanks for this, much insightful / interesting.

    May I ask how you conclude that 25 calls to load a page on AWS will add 1 second to page load time? I would assume you are calling the functions asynchronously / concurrently (i.e. via ~Ajax/XHR over JS) and so those 40ms wouldn’t “add up linearly” as all those functions would pay the cold start cost “at the same point in time” so feeling more like a single ~40ms (plus of course concurrency overhead). I have a dashboard that calls ~12 different AWS Lambda endpoints and it feels ~realtime as they load concurently (it’s quite a beauty to watch the D3.JS components all come alive ~simultaneously).

    • Matt Billock

      You hit it – it was basically assuming 25 calls in serial at a hit of 40 ms apiece – as I wanted to assume a true worst case.

  • Arie H

    Interesting read. Wonder what the total cost was for each provider in terms of $. Although price changes might not be linear as you scale to more complex functions, basic price for the “simple” function is another measurement of effectiveness.

    • Matt Billock

      Total costs for Google and AWS – $0
      Total cost for Microsoft Azure – about 17 cents

  • I don’t know if I believe any of these benchmarks; they are far too quick. They seem to reflect the performance of the actual function call and not the entire HTTP trigger, function call, response lifecycle. Did your benchmark cache the HTTP response? It literally takes seconds to get a response from an AWS Lambda via API Gateway and it’s a known issue. Amazon has responded to the community by explaining the cold/warm start and then also explaining how Lambdas with more RAM will run faster (get more network i/o). They also explained a bit about API Gateway and CloudFront and how the more requests you make the faster it gets. So the idea is through a lot of usage things get faster.

    Funny enough, the way most people I know get better performance from AWS Lambda over HTTP is to spin up an EC2 instance and handle HTTP requests to then invoke Lambda functions. Somewhat defeats the point of “serverless.” At least a little bit.

    I love serverless functions and APIs and that’s how I basically want to write everything…But the current state of it is not great for performance (I don’t think it’s “horrible” but it’s also not “great” and often not “good”). Hence why I’m here off a Google search. I’m trying to see what others have to say. I’m quite surprised by this blog post.