Why serverless is still in its adolescence
FaaS Architecture (Work-in-Progress) ordered from business logic (BL) to operational logic (OL). Source.
Researchers Erwin van Eyk and Johannes Grohmann, recently brought out a paper that identifies performance bottlenecks for vendors that offer Functions-as-a-Service (FaaS). According to Van Eyk and Grohmann, FaaS is the most advanced kind of serverless architecture available at the moment. As Jexia is a FaaS provider, I arranged a two hour Q&A with Grohmann and Van Eyk, discussing identified performance hurdles and the need of benchmarking to analyze the very young market of FaaS.
First a bit about your background. What is actually your interest in serverless architecture?
Johannes Grohmann: "Our research group concentrates on performance development for software systems. That's why we are an active member of SPEC RG, concentrating on different performance-relevant topics from an analytical point of view. With the growing attention for serverless and FaaS platforms, SPEC RG also made the decision to start to identify performance-relevant challenges for Functions-as-a-Service. The report we now chat about is one of the first results coming from this working group."
To ensure we are aligned about the description of FaaS, how would you explain its meaning?
Johannes Grohmann: There are still a lot of contrasting point of views on the exact definition of FaaS. Our research group discussed this issue in an earlier paper as well. In general serverless computing describes a paradigm, where all operational concerns, such as deployment and resource provisioning, are delegated to a cloud platform with a pay-per-use model. Function-as-a-Service (FaaS) is a form of serverless computing, where you execute certain functions of your application in a serverless environment.
What led you to the decision to first focus on FaaS?
Erwin van Eyk: So we began analyzing serverless computing already two years ago. Back then the definition of “serverless” was applied almost in the same way as FaaS. And that was also one the first topics we wanted to pinpoint on: what is serverless? And, in which way is FaaS related to serverless?After this first analysis, we chose to stay focussed on FaaS as it is the most mature one of the serverless cloud models.
Can you explain what exactly performance isolation is?
Johannes Grohmann: By performance isolation, we mean ensuring that different VMs/containers/Functions on the same physical resource do not (negatively) influence each other performance-wise. In FaaS-platforms, multiple functions are usually executed on the same physical host in order to efficiently utilize the given resources. However, if one executes, e.g., two independent, CPU-intensive functions at the same time on the same core, both will “steal” each others CPU time and therefore have an increased latency.Performance isolation refers to the degree that two functions that are executed on the same host are independent of each others resource usage. Therefore, function A should not be influenced at all by how many other functions B and C are running on A’s host and their resource usages.
Before diving deeper, one more thing about FaaS: in which situations/scenarios is FaaS in your eyes the best type of serverless computing to go? You say that in the meantime the serverless landscape already has matured more, so perhaps you see alternatives?
Erwin van Eyk: Using FaaS solutions is great for situations where you have to serve a bursty, ideally CPU-bound, workload. This of course is pretty generic, which makes FaaS applicable in many use cases. I don’t really have a specific scenario in mind, though the various vendors have been publishing quite a number of use cases.The landscape has indeed matured. You have for example offerings like databases (e.g. AWS Aurora), containers (AWS Fargate) all being marketed as serverless. These are not so much alternatives or replacements for FaaS, but rather complements, enabling more complex serverless applications.
The older concept of BaaS (Backend-as-a-Service) is often considered to be another type of serverless computing. To which extent do you consider this too as part of the serverless family and will it continue to compete with FaaS in the future, do you think?
Erwin van Eyk: What is and what isn’t serverless is indeed a bit of a controversial topic. In our initial vision we argue for a broad definition for serverless. A serverless service should in principle exhibit the following aspects:
(1) Granular billing:The service only bills the user for actual resources used to execute business logic. For example, a traditional VM does not have this characteristic, as users are billed hourly, not for resources that are actually utilized.
(2) Minimal operational logic:Operational logic, such as resource management, provisioning, and autoscaling, should be delegated to the cloud provider.
(3) Event-Driven:User applications (whether they are functions, queries, or containers) should only be active/deployed when they are needed; when an event requests it.
So with that definition, BaaS is indeed serverless. I don’t think there is a clear competition happening between BaaS and FaaS; both have their uses. Likely we will see platforms experimenting with variations on these two models.
Ok, onwards to the performance challenges you see regarding FaaS architectures. Can you describe them shortly?
Johannes Grohmann: We identify six major performance challenges:
(1) Overhead: The FaaS platform introduces some overheads (e.g., provisioning overhead when starting a new function instance) which might prevent adoption for some use cases.
(2) Performance isolation: We already talked about this one. Here, it is important to find a balance between efficiency and performance guarantees.
(3) Scheduling policies: Once a function event is triggered, a request has to be scheduled to a specific function instance. This provides great room for optimization. However, you have to consider that the scheduling is done in an online fashion and therefore introduces additional overhead.
(4) Performance prediction: Many techniques have been proposed to predict the performance of traditional software systems. It is unclear, how they can be adapted to FaaS platforms.
(5) Engineering for Cost-Performance: The pay-per-use pricing model of serverless platforms seems calibrated for a moderate number of requests per second. For higher workload intensities, dedicated VMs can turn out to be cheaper. Here, more complex pricing models might be investigated in order to relate the performance to its cost.
(6) Evaluating and Comparing FaaS Platforms: As the trend of serverless platforms is somewhat new, there is a lack of standardization and benchmarks of FaaS platforms, which prevents making informed decisions when it comes to evaluating FaaS offerings. This is one of the main issues that we as the SPEC FaaS research group are currently working on.
FaaS function executions in theory (left) and in practice (right).
Regarding the first mentioned: besides provisioning overhead, you point to request overhead and function lifecycle management & scheduling overhead. As stated in your paper, provisioning overhead is the dominant overhead when comparing all three, as FaaS platforms will need to deploy cloud functions prior to their use, and provision the underlying resources prior to deployment (such as containers or VM’s). How do you recommend vendors to tackle this overhead?
Johannes Grohmann: I think the key here is to prepare the underlying infrastructure as much as possible so that the actual provisioning tasks is as small and therefore fast as possible. This means, that as you already pointed out, VMs or containers should be readily available when a new function instance is about to be deployed.
Additionally, it is not wise to shut an instance immediately down, after finishing function execution. Instead, one should keep a “warm” instance running for a while, in case of additional requests. This is, of course, a trade-off again, where you have to consider the cost of letting “warm” instances stay and avoiding cold start function runs.
The typical lifecycle of a cold start and warm execution.
On many platforms, function runtimes are preemptively deployed and only the function code needs to be deployed during a cold start. As an example, AWS Lambda’s layers take this concept a step further. Here a function is defined as a set of layers.
For example, a function might use a Python runtime, add a numpy layer and a scikit-learn layer on top and then the actual function code itself might be fairly small. This enables Lambda to preemptively provision not only the Python runtime but also the numpy and scikit-learn layer as many different functions might build upon those layers. Erwin also discussed this issue in an article.
The anatomy of the runtime of a FaaS platform. Source.
Thanks for the explanation and link. The other mentioned overhead types, how do vendors overcome these?
Johannes Grohmann: Unfortunately, the involved vendors are not particularly forthcoming with information about their underlying implementations. To the best of our knowledge, the major open-source platforms currently do not employ any specialized optimization for these overheads. This is, again, due to the relatively new nature of the whole field.
Alright, we discussed the definition of performance isolation earlier. The challenge in this for vendors is to find a balance between efficiency and performance guarantees. What is your recommended best-practice to organize infrastructure to overcome this?
Johannes Grohmann: We neither think there are established best-practices for that issue yet. In general, finding balance between efficiency and performance also depends on the underlying infrastructure of the platform. Does it use containers, micro-VMs, VMs or a different concept altogether?
However, performance isolation has been a research topic for many years, as general virtualisation techniques such as VMs and containers do have the same problems. Therefore, there exist a lot of approaches for achieving performance isolation. Here, it would be interesting to investigate, how the existing solutions can be adapted to FaaS platforms.
Can you summarise shortly which approaches to achieve performance isolation are the most applied ones?
Johannes Grohmann: The authors of this paper give a nice overview of the topic. The first thing to do is to apply quotas, i.e., defining the limit of how much each virtualisation unit can consume of each resource. By applying hard quotas and permit overbooking one can try to ensure that the assigned quotas are always available.
However, there are still resources (e.g., disk), where it is not trivial to apply quotas. Additionally, hard quotas and no overbooking, of course, comes with the cost of lost efficiency, in the case that the guaranteed quotas are not used. Therefore, cloud providers usually do not apply such hard limits.
You spoke about scheduling policies as another performance challenge for FaaS providers. Did vendors already find multiple ways to be able to cope with this?
Johannes Grohmann: We do not have perfect knowledge about how the vendors deal with the given challenges, as the implementations are usually closed-source. However, we think that the issue of scheduling policies is still an open issue with lots of interesting research problems.The interesting part is that scheduling policies offer a lot of optimization potential by considering workflow deadlines, the location of input data and code, load balancing and/or co-located functions.
Therefore, using a (near-)optimal algorithm can lead to major cost savings. On the other hand, however, you have real-time constraints. More complex scheduling policies introduce a greater scheduling delay, which is to be avoided. Therefore, schedulers need to make fast decisions that still lead to satisfactory scheduling.
One way of optimized scheduling was presented by Cristina Abad, a co-author of this paper that describes the optimization. It’s an improvement for current open-source platforms at least. However, this is just the first step, and there is still room for more research in that area.
As you stated previously, many techniques have been proposed to predict the performance of traditional software systems. It is still unclear however, how they can be adapted to FaaS platforms. Which techniques seem to have most of the potential?
Johannes Grohmann: One way is to utilise architectural performance models. However, they need to model both the hardware and platform structure (the cloud-provider view) and the application (the cloud-customer view), which can be hard to gather at the same time.
We are currently working on automated extraction techniques from the cloud-provider point of view, that infers the information about the application from monitoring data. One can use machine learning models to try to predict the performance of individual functions and/or requests. We are also working on an algorithm trying to achieve that.
Alright thanks. If we further discuss the pay-for-what-you-use model, which in your eyes is a typical serverless architecture trait, you stated that this can turn out to be more expensive than VM’s in the case of higher workload intensities.
Johannes Grohmann: I can maybe try to give a concrete example: let’s assume, one function execution costs 0.001€ on a given FaaS platform. We have a (more or less) constant workload with 1000 requests/s, that are not very latency-sensitive and can, therefore, be handled in a sequential manner. This would turn out to cost around 1€ per second to run.However, I can also rent a VM, deploy my function and use it to serve the same requests. I need to rent a VM that is able to serve 1000 requests/s, which costs about 0,75 € per second to rent. In this case, using a VM would turn out to be cheaper and save about 0,25€ per second, serving the same number of requests.This can be due to the additional overhead spend by the FaaS platform in trying to estimate your resource needs, without having your information about the workload intensity, plus the additional scheduling overhead etc.
And then the solution you guys propose, exists of more complex pricing models. Although presumably more complex pricing models are not very transparent/user friendly for the consumer, how could that look like?
Johannes Grohmann: One way would be to introduce a concept like resource reservation. This concept is also already known from the domains of VMs. If I, as a customer know beforehand, which and how many requests I will expect in the following minute/hour/day, I can book resources in advance. These resources are then sold at a cheaper price, as compared to the spot instances (for VMs).
On the downside, I still have to pay for these resources, even if I do not use them. I would like to note, that this way of pricing and resource reservation should still be based on the actual request that I book in advance, as the operational concerns are usually abstracted in the serverless context.
Do you think serverless architectures can/should only be defined as serverless architectures if the pay for what you use model is applied? That would not be the case anymore if you book in advance, as you do not know for sure if in practice you really consume it. We need a broader definition?
Johannes Grohmann: I would not alter the definition. This proposal is first off only designed for a niche case of very big customers with constant load demands. They should still apply a rather conservative reserve, which they will definitely use and therefore is still pay-per-use. Additionally, the cost model proposed can be altered to just give a small discount on all used and early reserved/declared requests. All “normal” and non-reserved requests are charged at the normal rate. This way, it is pay-per-use in any case.
Then the last point you describe as a challenge: a lack of standardization and benchmarks of FaaS platforms. This is one of the main issues that your SPEC FaaS research group are currently working on. Now very recently, we had a related interview about another research paper trying to tackle this problem. I do not know if you read that paper, but can I ask to check if it is in line with the frameworks you are creating to compare serverless architectures?
Johannes Grohmann: I did not thoroughly go through the paper, but on first sight, it looks like the goals of this study are complementary to the benchmark we are working on. We have our focus on comparing FaaS platforms from a performance view and leave the functional offerings aside. Additionally, we want to provide a realistic benchmark suite, consisting of multiple realistic workloads and use-cases and want to ensure reproducibility. This way, we want to assure that the benchmark is not a snapshot comparison, but can be used to track the developments with respect to the performance of the individual FaaS platforms over time.
Great that your kind of initiatives are put in motion. With regards to the roadmap, how would you prioritize the various performance challenges?
Johannes Grohmann: We are currently prioritizing the last challenge, as our main focus is the creation of the benchmark. I want to use the opportunity to invite interested researchers and/or industrial partners to join our efforts and propose/discuss solutions for the aforementioned challenges.