Updating Azure Functions in Production

In this post, I’m going to address the “updating Azure Function Apps”. When you’re gone live in production and made a new change in the functions’ code, how would you update the function without bothering the users. There could be various scenarios for updating a function app, which depends on the Function App’s nature, responsibility and deployment strategy.

Problem definition

Generally speaking, some or all of the following features are required in production app upgrades:

  • Zero or minimum downtime
  • Zero or minimum action/data loss
  • CI/CD Pipeline integration

Depending on what the function app is meant to do and how it does it, we can decide on what’s really needed in our upgrade process. As a case in point, if we have a serverless web app which is being used by many users scattered globally, zero/minimum downtime would be the main concern. When users are in different time zones the app should stay up and running 24 hours a day and we can’t take it down for an update.

If you need the minimum downtime, the best way of updating the app would be through the deployment slots. The way that the slots work is pretty simple. Think of it as another Azure function app that can run side by side with your current app and could be swapped with it quickly with nearly zero downtime. In fact, the main function app is actually a slot named “production” and once you swap it with another named slot, it just changes the routings, addresses and settings instantaneously, so all the slots would stay up and running just as before. Additionally, we can easily roll back (by swapping again) if something went wrong.

At the time of writing this post, function slots are still in preview version. You can see how to create and use slots in the following blog post:
https://blogs.msdn.microsoft.com/appserviceteam/2017/06/13/deployment-slots-preview-for-azure-functions/

The deployment slots thing, works like a charm for most of the apps, but it’s not a silver bullet and has its own downsides.

In one of our sprints, I was supposed to do research on slots. As the Azure Function Apps are still pretty new though, I couldn’t find much useful documents and benchmarks on slots and the swap feature, so decided to benchmark it myself to see how reliable it is for our app.

Let me give you a little bit of background about how we use Azure Functions in our solution. The main app is a multi-tenant SaaS cloud app. Each tenant could define a few connectors that would monitor a file repository and submit the file/folder changes to the main cloud app. These connectors are actually Azure Function apps that are being deployed at the run-time, once a tenant creates a new connector. So, as you see in our scenario, function app deployments and consequently upgrades are happening at the run-time.

Connectors-FunctionApp

Another important point in our scenario, is that some of the functions in the function app, have got timer triggers, meaning that they’re running scheduled tasks. For instance, one of the functions is polling for the file/folder modification events on the file repository every 5 minutes and then puts the detected event data in an Azure Storage Queue so that another function picks it up through a queue trigger and process it. If you’re not familiar with timer and queue bindings/triggers, the following link would be a good start point:
https://docs.microsoft.com/en-us/azure/azure-functions/functions-triggers-bindings

As you might have concluded from what I said about the app’s mechanism, zero data loss is the main concern is our app, not the zero/minimum downtime. The app is monitoring the repository every 5 minute and we don’t care if this 5 minute gets 7 minute. Additionally, our app is not being directly used by our end users so having zero downtime may not be relevant in this case.

Upgrading through Function Slots

Now that we know about the requirements, it’s time to see what happens once we update our app. Suppose we’re using Azure function slots for deployments:

  • The main Function App V1 is already working on production slot
  • Then we deploy the V2 into a new slot
  • once the V2 is deployed and is up and running we swap the slots.

This process would be quick with almost zero downtime, but it’s likely to loose some data or operation during the swap.

There was not much articles/benchmarks on the swap thing so I did a test myself; I made a function with timer trigger that runs every 10 seconds and in each run goes through a loop and submits 10 items to an HTTP endpoint one at a time with 1 second delay. So the function would be submitting items all the time, every second. Each submitted item had an identifier so that I know it’s coming from what function and loop run. The identifier was something like A(3,2), which means A is sent from the 3rd function run and 2nd loop iteration. So the identifier format was like A(m,n).

Then modified my function app’s code, and this time changed the A(m,n) to B(m,n) to distinguish the submitted items, the deployed it into a new slot and swapped it with the main function. I was impressed with the swap as it took less than 2 seconds, but there was 2 missing item submissions during the swap. It seems normal to me, as during the swap, we’re actually changing the routes and may loose our network sockets and other IO related stuff. But it’s painful when we’re in production. Meaning that if a function is in the middle of something and we swap, we’re going to miss that item, which is not desirable.
FunctionSlotsSwapTest

Graceful app stop by handling the CancellationToken

Potentially, there is another solution for making the upgrades safe and graceful, without loosing any data/operation. That’s through handling the CancellationToken in the functions.

Take this sample function:

public static class ProcessItem
 {
     [FunctionName("ProcessItem")]
     public static void Run(
         [QueueTrigger("myqueue-items", Connection = "StorageConnectionSetting")]string myQueueItem,
         TraceWriter log,
         CancellationToken cancellationToken)
    {
         cancellationToken.Register(() =>
             {
                 log.Info("Function app is stopping");
                 log.Info("Stashing the item in a blob storage");
                 // The code for stashing item goes here
                 log.Info("Item stashed so we're good to upgrade");
                 log.Flush();
             });

         if(!cancellationToken.IsCancellationRequested)
         {
             // the code to process the item goes here
         }
    }
 }

As you see, we have a binding of CancellationToken type in the Azure Function. Once we add this binding to the method, we’d be able to know if the app is about to stop. As soon as the function app is stopping, the cancellation token would request for cancellation, and we can do our stuff to not loose anything as a result of the app stop (in our case the stop is caused by upgrade).

Sounds exciting, but the thing is: By default, it only gives you 5 seconds to do whatever needed to be done before the app stops. It’s good to know that the 5 second is not going to happen in practice and it would be less!

The good news is we can add this wait time by adding the “stopping_wait_time” setting in the “settings.job” file. (you need to create the settings.job file in the “site” folder in the Kudu console as shown in following screenshot)

Settings.job file that sets the stopping wait time to 2 minutes:

{
     "stopping_wait_time": 120
}
FunctionAppKudu-Settings-job
Add/Edit settings.job from Kudu console

So far so good, huh? But the bad news is however it increases the waiting time, but it’s not even close to our 2 mins. And it’s not reliable. Sometimes gives enough time to stop and sometimes not. If you try the function I mentioned above, sometimes hits the cancellation block sometimes not!!! It was not reliable at all. Didn’t have enough time to figure the cause and resolve it, so gave up on cancellation token thing and decided to take another approach to solve our problem.

Our App Mode approach

In our scenario, we decided to not go with function slot as it doesn’t add much value to our upgrade process and also couldn’t handle the CancellationToken as it was not reliable. Instead, implemented our own logic, which is like this :

  1. The function app works in 3 modes: Normal, PreparingForStop, SafeToStop. When the app starts, it’s in Normal mode and does its jobs as usual.
  2. Before starting upgrade, we call a function from the function app to signal that we’re about to upgrade.
  3. Once the app receives that signal changes its mode to PreparingForStop, tries to finish up all the current jobs quickly and wouldn’t start new jobs. If we need to stash any data to be processed later so that we can quit the current function quickly, we would do that at this stage.
  4. There is another function in the app whereby we poll the app mode. We would call this method till it returns the “SafeToStop” mode.
  5. As soon as the app turns to “SafeToStop” mode we upgrade the app.

This solution might not be the best solution but at least keeps everything under control and worked in practice.
That was how we handled our safe function app upgrades. Let me know if you have any comments or suggestions on that!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s