%20(1).png)
In part 2 I outlined how Gloster v1 was designed, why it failed and what were the lessons learnt. The experience lead to refining the problem definition by focusing on two main sub-problems:
The first version failed because the model used in the analysis required a lot of manual effort (trial and error) and it was proven too brittle, since significant changes to the service codebase required manual recalibration of the weights and thresholds.
The first idea that led to the automated approach was to use historical data for the anomaly detection reasoning. Instead of expecting the service owner to provide the important metrics by using weights, and to define the acceptable deviation of each metric value by setting upper and/or lower thresholds, we decided to rely on historical trend data for sensing anomalies.
The concept is to monitor multiple, different service instances in production, over long periods of time (for taking into account seasonal behaviors) and capture acceptable deviations for all the metrics of a service. What we are trying to achieve is to establish a baseline on how our telemetry fluctuates for identical systems, i.e. identically sized instances running the same software release and handling approximately the same production traffic. We can then use this information when performing a canary analysis for (a) identifying the most relevant metrics and for (b) guiding the system comparison algorithm. More details on the approach are explained below.
The second improvement focused on the comparison algorithm. We removed the threshold checks on the absolute values of the metrics and decided to introduce a distance function for comparing time series. The concept here is that instead of comparing the metrics of the release candidate service instance to the metrics of the baseline instance by using maximum allowed percentage deviation values, we calculate instead a distance between the two time series and then check whether the result is aligned with the historical behavior recently observed in Production.
There are many distance function options for comparing time series, with the simplest one being a Euclidean approach. After some research we chose to go with Dynamic Time Warping (check out this interesting article for more details applying time warping on time series) which measures the similarity between two temporal sequences that may vary in speed.
Based on these ideas, we introduced a background process that executes the ACA preparation phase, during which we capture historical information that will be used during the analysis. This process implements the following logic:
So, given a distance function D(m1, m2) that provides an estimate of the deviation (or similarity) between a time series m1 and time series m2, the output of this historical data capturing mechanism for a service with M metrics, will generate M time series, each containing values of the following format:
[D(metric_i_ts(random_instance, t0), metric_i_ts(other_random_instance, t0)), t0], [D(metric_i_ts(random_instance, t1), metric_i_ts(other_random_instance, t1)), t1], ... [D(metric_i_ts(random_instance, tn), metric_i_ts(other_random_instance, tn)), tn]
where t0 is the oldest data gathering time and tn the most recent time, D() the distance function and metric_i_ts(instance, tx) is a time series with the values of metric i, for the given 'instance' and for the time period 'start: tx - N minutes, end: tx). As explained above, we pick random pairs where 'random_instance' != 'other_random_instance' and we try comparing all instances in the service cluster over time, since we do not want to be biased by always picking the same pair of instances.
Our services typically generate 500 to 1,200 metrics. We need to minimize the dimensionality during the canary analysis for the following reasons:
As a first measure against using irrelevant metrics, we added static filters to Gloster for removing some truly irrelevant metrics, like static metrics related to configuration parameters or metrics that are highly correlated to other metrics that we want to keep in the mix. This mechanism is simple and efficiently weeds out some metrics that are common among all microservices and clearly will not contribute to the analysis. But this covers less than 1% of the filtering we need to do.
The lesson we learnt form the initial weight-based approach was that as each microsystem or its environment evolve, weights need recalibration and some metrics become irrelevant while others start having an impact when trying to detect anomalies. Based on this observation, it was clear that we needed an algorithm that dynamically adapts to the evolution of our services. And this is where the historical information comes into play. The hypothesis is that if we consider the historical distances of the metric time series between the identical service instances running in Production, the metrics that are relevant to normal behavior will be the ones that exhibit the smallest deviation / higher similarity (smallest distance values). A way to model that, is to calculate the mean value of the metric historical distances time series we capture (e.g. for the last week or month) and select as relevant the metrics with the smallest mean value, i.e. the metrics that are pretty stable across all instances on the average. We can further enhance the selection by also taking into account the distance variance (standard deviation), i.e. we care about the metrics with a deviation across instances that does not spike much.
Clustering can help to help implement this selection approach, and after some experimentation we decided to use Hierarchical Clustering. The algorithm requires as an input parameter the number of clusters required and the selection of this parameter cannot be automated in a straightforward way. Very low values will generate very large clusters, while very high numbers will result in very small cluster. Further experimentation revealed that for our microservices a value between 5 and 10 is ideal and the difference in the result is not significant as long as we remain within these boundaries. The diagrams below show two examples for 5 and 10 clusters. Note that we pick the cluster closer to (0, 0) as the most relevant one for our analysis.
Hierarchical clustering result with input=5 (number of target clusters)

Hierarchical clustering result with input=10 (number of target clusters)

As it is visible in the above diagrams, in the vast majority of the metrics we do not observe high variance, and these cases already have high mean values, so the clustering can be performed on a single dimension (mean, not variance) to practically get the same results.
We validated that the results of this selection approach are actually the most relevant variables by both reasoning about the physical representation of the metrics as well as running ACA experiments with the first cluster (nearest to (0, 0)) and comparing the score results to the full set of metrics or more than one clusters (always the ones closer to (0, 0)). The increase in accuracy was a real game changer though: picking the first cluster or the first 2-3 clusters removed all the noise of the irrelevant metrics, and adding more than one cluster in most cases decreased accuracy.
Based on the historical data and the distance function we refactored Gloster to execute each ACA process according to the following workflow:
During analysis, the score for each metric i is calculated as:
score = 1 (success), if D(metric_i_ts(RC), metric_i_ts(BL)) < historical_Distance_mean + historical_Distance_variance
score = 0.9 (borderline success), ... 1, if historical_Distance_mean + historical_Distance_variance <= D(metric_i_ts(RC), metric_i_ts(BL)) <= historical_Distance_Max
score = 0 (fail), if D(metric_i_ts(RC), metric_i_ts(BL)) > historical_Distance_Max
So, the score for each metric will denote an absolute failure if the distance between RC/BL is more larger than the maximum historical value. It will denote absolute success if the distance is lower than the (average + variance) of the historical value. And it will denote a partial success if the distance is higher than (average + variance) but lower or equal to the max historical value.
We use the same distance function (time warp) for both the historical process and the analysis, and we also align the two periods so that we compare distance values calculated with the same parameters. For example, we gather historical every 5 minutes and we execute the analysis every 5 minutes.
Gloster calculates the score for all the relevant metrics and then it will generate the aggregate score by calculating the average for all metrics. Since this happens on every analysis cycle, Gloster generates and stores in its database a new data point for the aggregate score time series every N minutes.
The distance function results depend on the value of the time series. Typically the historical distances are calculated on much higher values since we compare production instances that receive higher traffic percentages than what the canary stack is receiving. Since we compare the historical distances (production nodes) to the distances between RC/BL metrics (typically 1% to 25% of the production traffic) we frequently ended up comparing number in different orders of magnitude. That is why we need to normalize our time series and bring them to a common, standardized range (e.g. 0..1) before we calculate the distances; the approach is explained in a concise way in this article: "How to Normalize and Standardize Time Series Data in Python". For example, consider an http.rate metric which is affected by traffic received:
(a) the production instances during historical data capturing phase may get values like that (over a 5 minute period): http.rate(instance1) = [5, 10, 5, 10, 5], http.rate(instance2) = [10, 5, 10, 5, 10], and we normalize them to [0.5, 1, 0.5 , 1, 0.5] and [1, 0.5, 1, 0.5, 1], and then we calculate the distance.
(b) because of the 1% traffic, for the canary stack instances, we get a time series like this: [0.05, 0.1, 0.05, 0.1, 0.05] and after normalizing we get values in the same range, e.g. [0.5, 1, 0.5, 1, 0.5]. As the traffic percentage increases, the rate time series will potentially get higher values, but as we normalize we are getting back to the standardized values so the calculated distances end up in the same range.
This normalization allows us to always be able to compare canary data (regardless of the percentage of traffic) to the historical data.
Typically, the telemetry of a microservice (and potentially any system) contains metrics that are mostly without values (or zeroes). These are metrics that model rarely exhibited behaviors, like for example the rate of a specific error state, and in the context of the canary analysis they should be taken into account. The interesting fact about the approached described above, is that the clustering algorithm Gloster uses will place all these metrics in the first cluster, so they will be automatically included in the most relevant metrics set. The problem, however, is that when these metrics make up a significant percentage of the metrics in the set (we saw cases where they summed up to 60% of the set), so when included in the scoring algorithm they will smooth out anomalies. Since we cannot ignore them since they are signal for anomalies, Gloster will track them and keep them in the relevant metrics set, but it will not include their individual scores in the calculation of the Aggregate Score. This optimization increases the overall accuracy and sensitivity.
The approach explained above automates both the relevant metrics selection and the abnormal deviation detection by using a historical baseline. It is based on statistic analysis and it is dynamic. We noticed, however, that we have a small number of "golden metrics" that are always relevant and we want them included in the relevant metrics set. In some cases, for a handful of these golden metrics (e.g. Availability and Latency SLOs) there is a very clear definition of what a failure is by just observing their absolute values and comparing them to static thresholds. We extended the ACA anomaly detection algorithm to include a parallel deterministic branch and allowed the injection of specific golden metrics in the relevant metrics set. The approach is outlined in the diagram below.

Gloster injects by default some SLO-related golden metrics and it can automatically import the metrics and their thresholds from our monitoring and alerting infrastructure has been configured to track for any given service. When threshold-based tests are injected in the ACA process, they act as "flow breakers" and they will fail the analysis and set the Aggregate Score to 0, bypassing the statistics analysis flow (although individual metrics scores are available for review by the service owner).
Although Gloster v2 automates almost everything, it still allows some optional customization. As explained in the previous paragraph, the service owner can force include and/or exclude metrics to the output of the automated metric selection, allowing for more detailed fine tuning of the automation. The user may also define the parameters of the clustering algorithm and may define thresholds for specific "golden" metrics (upper and/or lower bounds). A typical approach used is to ask Gloster to import the alerting configuration and then remove/modify the alert thresholds, that are transformed to pass/fail tests in the context of the canary analysis.
There are other customization options that Gloster supports, including fine tuning of the automated algorithm (type of clustering, whether one or two dimensions are used in the clustering, advanced configuration of the scoring algorithm, etc.) and the type of telemetry server used (Atlas, Prometheus).
Adding a confidence value to the process is important for automation since the CICD Orchestrator needs some indication that the analysis has seen enough data to be confident that it can use the score to decide about the final outcome of the ACA. We experimented with various approaches but eventually we achieved very accurate results by observing the Aggregate Score time series and testing whether the plot has stabilized; this is a simple t Test on the two halves of the time series with a 0.05 threshold and we have empirically verified that this is a good model of the steady state status of ACA process. For longer period experiments (spanning more than an hour) we added a further optimization by using a sliding window, i.e. feeding only the last K data points of the score time series to the t Test, thus ignoring the older data points that may still exhibit some fluctuation.
By the definition of the scoring function, the score success threshold is 0.9 (or 90%). This was also an improvement towards automation of the process since we did not want to require the user setting a success threshold per service. However, by observing the data from countless experiments, we noticed that the resulting score average, max and min values vary significantly depending on the service monitored. In an effort to improve accuracy and make Gloster more adaptive, we introduced the following optimization:
Gloster consists of four microservices as outlined in the high level diagram below:

Gloster Maestro is responsible for the overall orchestration of all workflows. It listens to Deployment Workflow events (e.g. new canary stack created with ID, canary stack with ID deprovisioned, traffic increase to canary stack with ID, etc) and decides on starting or deleting an ACA process. It also accepts requests from the end user (via the Gloster UI) for starting non-automated analysis experiments (comparison of ad hoc instances). Maestro fully controls and coordinates the scheduling of recurring jobs, i.e. historical data capturing for tracked services, and ACA or manual analysis processes. It will attempt to spread scheduling over the available time window in order to improve resource utilization and avoid bottlenecks. It also handles the analysis configuration phase, being responsible to execute the clustering step and the score success threshold recalibration; it then triggers the analysis by passing the fully processed and finalized configuration to the Gloster Analyzer. It offers an API to the Gloster UI and a query API to the CICD orchestrator for checking on the process of the analysis processes.
Gloster Configuration is responsible for handling all the configuration parameters of a given service. It supports versioning and activation/deactivation of different profiles, handles the configuration datastore, performs validation and offers an API to the Gloster UI. Gloster Configuration should be used at least once for adding the specific service to the list of services tracked by Gloster.
Gloster UI is the front end interface to the user. It is mainly used for managing the configuration parameters (access to Gloster Configuration), browsing analysis results details or historical data insights (access to Gloster DB via the Gloster Analyzer API), and stopping ACA/manual analysis jobs or starting manual analysis jobs (access to Gloster Maestro). It sends to the browser a single page application that offers the GUI for all interactions with the user.
Gloster Analyzer is the tireless worker that periodically executes all historical capturing jobs and ACA/manual analysis processes. It interfaces with the telemetry infrastructure for retrieving the metrics required per each process, and stores all results to the Gloster database. It offers an API to the Gloster UI for accessing stored results. The Gloster Analyzer typically scales horizontally to higher number of instances than any of the other services for obvious reason (almost linearly to the number of services tracked and analyzed).
The following screenshots show examples of the analysis insights UI.


The ACA process was designed as part of our automated CICD pipeline, even though Gloster can be used independently for running ad hoc anomaly detection analysis jobs by monitoring. The process is initiated by the CICD pipeline (we rely on Jenkins/Blue Ocean for orchestrating the CICD flow) which triggers the Canary Stack deployment. The pipeline will monitor the Canary Stack readiness and if no failure or timeout occurs, it will increase the percentage of production traffic flowing to the Canary from 0% to the first step (usually 1%). Gloster listens to deployment workflow events and it will automatically start the analysis (ACA) when it sees an new Canary Stack receiving traffic. The pipeline will poll Gloster after a warm up period and it will wait for a minimum number of data points with a high Confidence value for deciding on the Aggregated Score returned by Gloster. These are the possible outcomes:
Gloster will consume an event about a Canary Stack not receiving traffic anymore and it will stop the analysis. Also, as mentioned previously, Gloster interfaces with our telemetry infrastructure for retrieving the required metrics.
The anomaly detection algorithm has proven quite powerful, so we added one more feature that extends the usage of Gloster beyond canary analysis. Gloster can easily be configured to consume blue-green-completion events and it will automatically start an analysis job comparing the new release (just deployed) to the time shifted telemetry of the previous version. Most of the existing ACA configuration is re-used and the user only has to enable the capability and potentially change the defaults about the time-shift value (by default -2 hours) and the duration (by default 2 hours). Gloster will send Slack notifications to the team if the aggregate score is below a threshold. The user can override the default threshold for fine tuning sensitivity for the time shifted analysis and the confidence calculated in ACA is ignored in this type of anomaly detection. The caveat in this case is that the two versions compared do not necessarily handle the same amount of traffic, so this score and the Slack alert are indications without high confidence that an anomaly occurs. However, we have noticed that the score will drop significantly if an actual anomaly is caused by the new release, but some false alarms may be generated unless a lower threshold is configured for this analysis.
A few months after completing Gloster v2 I came across Kayenta, a very promising open source ACA tool that as an added benefit operates as a Spinnaker plugin. Browsing through the code and the related blog posts I saw some interesting ideas and common approaches. The first instinct, and an opportunity when great open source code is released, was to compare the two systems and potentially scrap and replace Gloster, or replace the reasoning logic in Gloster with something better. Since we do not currently use Spinnaker and there was huge effort involved into introducing Spinnaker just for testing Kayenta, I decided to try to plugin the Kayenta classifier into Gloster as a separate reasoning engine (Gloster was designed with this in mind). Bear in mind that this is a gross simplification, since Kayenta and Gloster clean up and prepare data in different ways, so feeding the data prepared for Gloster by Gloster will not necessarily yield the optimal results with the Kayenta classifier. In any case, our experiments showed pretty similar results, with Kayenta being a little more sensitive in some cases resulting to false alarms (lower scores). Again, I am making no claim on any of the two approaches being better, but what is interesting is that plugging a different engine into Gloster (with the risk of being a little "out of context") validated the results we are getting with Gloster's native approach.
The following screenshots show two representative examples.


Gloster introduced an extra, automated layer of release confidence into our CICD pipeline. Version 2 was really successful in making it straightforward for service owners to adopt canary analysis. We continue evolving Gloster for achieving better accuracy, extending support for other types of anomaly signals (e.g. better logging support) and improving performance and scalability (e.g. datastore sharding). Creating and open sourcing a Gloster Spinnaker plugin is also something we are looking into.