Building an Automated Canary Analysis tool - part 1

April 1, 2019

I was always amazed by the concept of the miner's canary: It always felt a little cruel to use a live animal as an early warning system, but this method saved countless human lives from the early1900s until 30 years ago. What really amazed me when I first heard about it, was the idea of first exposing a "tiny version" of the real complex system to a potentially dangerous environment and risk a low impact before exposing the full blown system risking a huge impact. The need to deploy such an approach is driven by the lack of an adequate model that can drive accurate risk predictions based on telemetry: we do not know how to assess risk based on our perception of the environment so we actually test if a smaller version of our system survives the exposure, and only if the test is successful we commit to a full deployment. The term eventually leaked into the world of software systems as the "canary deployment pattern", also known as the canary release. In the same way miners would rely on an caged canary that would die before humans were seriously affected by toxic gases, the canary deployment pattern reduces the risks of introducing a new software version to the production environment, by progressively rolling out the new release starting with the exposure of a very small number of users before making available it to everybody; if the "canary dies", the release is rolled back and the negative impact of the new, problematic release is minimal, otherwise the confidence about a smooth release is high and we can proceed with full (progressive) deployment.

This incremental rollout of releases has been adopted for some time, much before the technical literature adopted the "canary" term. A key ingredient for the success of this deployment pattern is the selection of the metrics to capture during the deployment, and the appropriate assessment approach of these metrics in order to understand whether the new release is affecting the production environment in an negative way, i.e. introducing an anomaly that can disrupt the expected operational characteristics (e.g. SLOs of availability and performance, critical business metrics, etc.). This Canary Analysis is a simpler task for monolithic systems but it can be really complicated to model and execute in distributed software systems like a modern microservices-based architecture. The analysis of a simple system can be performed manually, but manual execution is not realistic on even the simplest distributed architectures. Handling progressive deployment (either Blue-Green/Red-Black or Canary) is already automated by modern tools (e.g. Spinnaker), but automation of the Canary Analysis process has not reached high efficiency yet.

A couple of years ago, I started working on exactly this problem, i.e. how can we implement an automated process that, with minimal or no human intervention, will effectively reason about the impact of the canary release, as the exposure progressively increases, in order to allow us to release with confidence or roll-back new versions of our microservices in a dynamic, continuous delivery/deployment environment? At the time, no open source implementation of an Automated Canary Analysis tool was available and the publicly available literature mentioning successful implementations did not reveal the specific approaches used. Let me define in more detail the problem:

How can we select the metrics that matter, i.e affect the impact of the component to the environment? A modern microservice may generate hundreds of metrics, surely not all of them are relevant, or at the very least, they are not equally important when trying to detect a potential anomaly caused by a new release.
How should we model the Canary Analysis process so that we can ask a machine to execute it and reason about the success of each progressive step during a Canary deployment? Automation is always beneficial, but in a highly distributed, large scale microservices environment, automation is the only practical approach for realizing full scale canary analysis for continuous delivery/deployment.
How can we compare the behavior of a canary release that receives a small percentage of production traffic to the steady state, previous version in production that handles full traffic (many orders of magnitude higher)?

This is how "Gloster" was born at Upwork, built by the Platform Team after multiple experiments and long brainstorming sessions. Gloster is an Automated Canary Analysis tool that uses telemetry (e.g. Prometheus or Atlas metrics) in order to determine which metrics matter to each microservice at the given point in time (using recent historical data) and whether the current status of the canary stack (release candidate versus the baseline version) is exhibiting an anomaly or not. Gloster was build as a separate moving part that we integrated into our CICD pipeline, that monitors new canary stacks and informs the CICD orchestrator whether each progressive stage is looking good (thus proceed to next stage or complete the Canary analysis step) or the new release needs to be rolled back and abandoned.

I am starting a series of posts on this effort and its results, aiming to share an interesting journey into the world of automated canary analysis. No live birds were used, but there was enough drama involved, and the experience acquired was very interesting (especially the failures and lessons learned). My next post will be about the first Gloster version, a very cumbersome implementation that was thrown back at me by the frustrated developers that used it :)