Maintaining scalable Cloud Systems in times of Unanticipated Peaks

How Microsoft Azure’s scalability and elasticity allowed Fabrik to respond quickly to a rapidly escalating increase in community engagement during COVID-19.

When immedia initially set out to build our Fabrik platform – a suite of ‘born-in-the-cloud' audience engagement tools and workflows - our development team elected to adopt a cloud-first approach in its development, using Microsoft’s Azure platform.

Some of the benefits we considered in selecting Azure include:

cost savings through economies of scale and a tenant-based system with shared resources;
the inherent scalability, redundancy and reliability of Microsoft Azure which enables our applications to be automatically adaptable to increases in resource demand;
the ability to take advantage of the monitoring and analytics of Microsoft Azure so that we can diagnose issues quickly and report accurately.
the ability to adopt an Agile approach when architecting and developing the platform so we can be responsive to our clients’ needs, as opposed to over-engineering a system based on superficial and assumptive requirements.

As the Fabrik codebase became more established and our service offering continued to diversify, our technical teams were able to add more and more features to our mobile apps, APIs and web-based tools aimed at empowering our clients to better serve their members with ease. As a result, the number of individuals using the Fabrik suite of applications in our platform has increased dramatically over time, and so in turn have our infrastructure requirements, which include PaaS (Platform-as-a-Service) offerings such as ASP.NET APIs and Azure SQL Databases.

With the onset of the COVID-19 pandemic in early 2020, our clients and their members heightened their communication with each other, leading to an increased reliance on their Fabrik applications.

Our technical teams therefore anticipated a rapidly-escalating increase in traffic to our platform, more unpredictable network traffic into our system, and more strain on our APIs and databases.

When this kind of demand occurs, the specific nature of the demand is difficult to predict and although we were confident that the Fabrik platform would perform based on Azure’s many elasticity and scalability services, in unprecedented and extraordinary times like this, our team made the call to be on high alert in case of any bottlenecks in our systems that might require manual intervention.

The primary impact was to our backend database system which was developed on Azure SQL Databases using the Elastic Pools deployment model. The elastic pools served us well because of the dedicated amount of resources that were allocated for our databases, but when the demand drastically increased, the allocated resources needed to be expanded appropriately to cater for the demand.

In addition, due to the number of people engaging simultaneously, our API gateway which is hosted on Azure App Services needed to be scaled out to more instances as well.

The following statistics indicate the response times of some of our API calls during peak days:

Fabrik API call response times during COVID-19 communication

Using Application Insights, we were able to identify specific API calls in our system that were taking more time than expected to process.

During the same period, the usage on our database was as follows:

Fabrik database usage during COVID-19 communication

Each spike in database usage indicates occasions when our elastic pool was being maxed out. Our infrastructure team mitigated this by increasing the capacity of the pool to meet the usage demand on the platform.

When our databases were strained, this impacted the response times of our APIs. App visitors would experience slowness when using some of the functionality in the apps.

After analysing this usage pattern, we were able to identify areas we could improve to better cater for this kind of unpredictable usage. We identified specific API endpoints that needed to be optimised and certain areas of the system that needed reworking to take advantage of the serverless capabilities of Microsoft Azure.

One of those services is the use of Cosmos DB for serving ‘chat’ related data, which will be beneficial in improving the overall load on our API. Cosmos DB allows us to separate reads and writes across multiple servers that are possibly distributed across multiple Azure region - as opposed to SQL Elastic Pools, where reads and writes happen on the same cluster. We are actively investing time in decoupling our APIs into a more microservice model, through the use of Azure Functions which, together with CosmosDB, will help in distributing the load on our APIs at a global scale.

Scaling up the database tier took minutes to resolve the issues experienced by people using the apps.

Throughout this entire scenario, we were able to take advantage of Microsoft Azure’s monitoring, alerting and log analytics capabilities which gave us access and visibility into the health of the system, highlighting areas that required attention and/or improvement at a glance.

Looking ahead, there are always going to be improvements we can make to our platform and applications to better serve our customers’ requirements and the requirements of their members. The monitoring we have in place allows us to make more informed decisions that keep improving scalability, reliability and availability, and that are more anticipative of the areas we need to be scaling up and out in response to unpredictable and unforeseeable circumstances.