Chaos Unleashed: A Retrospective on Our First Year of Chaos Engineering

November 7, 2023

4 min

The Upwork Engineering organization has been working with distributed engineering teams for years to create products and services that connect millions of businesses with independent freelancing talent around the globe.

Over the past year, chaos engineering has been injected throughout Upwork’s software development lifecycle. The idea of intentionally introducing chaos into our systems was initially daunting, but the results have been eye-opening and invaluable. This retrospective is a chance to look back on the challenges, lessons learned, and the impact of chaos engineering on Upwork’s entire engineering organization.

The chaos engineering project has involved engineers across multiple teams, led by site reliability engineering (SREs) in collaboration with service owners. As a point of fact, the majority of Upwork’s engineering team are freelancers working remotely around the world.

Why Upwork adopted chaos engineering

The Upwork ecosystem is complex. An unfortunate issue of complex systems is that nobody knows how all the moving pieces behave and how they may fail under specific circumstances. Upwork needed a discipline that could holistically support its complex environment. Instead of taking a reactive approach, problem mitigation could be proactively taken by identifying, raising, and resolving issues before they become serious incidents.

Chaos engineering, the process of testing a distributed computing environment to ensure it can withstand unexpected disruptions, is not a new concept to most software organizations. Most of us have been using some form of chaos engineering, in some way or another, without even knowing it. A couple of examples come to mind:

Shutting down a container to make sure that traffic is appropriately redirected
Changing the firewall configuration to see how long it takes for a service to recover after a networking event

Formally introducing chaos engineering involves much more than that. It is a discipline that requires cultural change embracing chaotic and unexpected complex systems behavior. Upwork Engineering became familiar with the practice by embracing the Principles of Chaos Engineering, digesting lots of articles, watching seminars, and reading books on the topic. Many online resources provide case studies of how chaos engineering is implemented among different companies. All of this research helped Upwork determine how to use chaos engineering practices in its own ecosystem.

The importance of executive buy-in

Any large-scale endeavor (like chaos engineering) should require support from executive leadership. At Upwork, there was leadership buy-in from the beginning. It can be difficult to convince anyone that “breaking things” is the best way to achieve resiliency—it feels unnatural and counterintuitive. Since a key component of SRE is fixing and patching things together, “breaking things on purpose” could feel like going against your every instinct as an engineer.

A recommended approach is to start with a small, well-defined proof of concept on a service that tends to be problematic, begin in staging, and minimize any possible impact. By starting small, there should be enough evidence to showcase to executives that chaos engineering is feasible and achievable. Adopting this approach should translate into improvements in product resilience.

Simulating problems

The quality of results depend on the confidence of tooling and the creativity to simulate real-world scenarios and experiments against services. As some incidents can be unique and no instrument can simulate all possible scenarios. There's no silver bullet to test all scenarios, but there are some recommended approaches to address that concern:

Using Gremlin

Gremlin’s chaos engineering tools can safely and securely inject failure into systems to find weaknesses before they cause customer-facing issues." Gremlin uses a set of attacks that simulate problems related to resources, networks, and processes. This approach is useful to experiment with specific failure patterns across infrastructure. An abort button is also available that can halt attacks being introduced.

Working with AWS

To simulate a relational database service (RDS) failover or lose a cache node can be replicated using Amazon Web Services (AWS) tooling and capabilities. Changing a Security Group, forcing RDS failover, or shutting down a Memcached node are actions that help experiment on the ecosystem and reveal gaps in system availability.

Selecting the best QA tooling

Benchmarking tools, automated testing, and stress testing can help simulate problems, generate traffic (to see how users and services react to injected issues), and check the health of user flows while experiments are taking place.

Leveraging resources like these can produce amazing results in any chaos engineering exercise. Ultimately, however, it is up to the creativity of the people involved that will determine the level of success.

Define key indicators and goals

Defining clear and achievable goals was crucial at the outset of Upwork’s chaos engineering initiative. The engineering team focused on enhancing overall system resilience and minimizing downtime due to unforeseen failures. The team could prioritize improvements once the weakest points in the infrastructure were identified. These decisions became the roadmap to guide us through the chaos engineering process. It also provided guidance for measuring success along the way.

Since there was no initial baseline at Upwork, metrics were identified for the project, including the following:

Number of bugs/issues found during an injection: This metric counts the number of defects, issues, and opportunities that needed to be improved. When exposed to simulated failures, this helps identify vulnerabilities and weaknesses in a system design and implementation.

Number of injections executed: This metric tracks the total count of fault injections or chaos experiments performed during a chaos engineering initiative. A higher number indicates a broader range of scenarios tested, leading to a better understanding of a system's resilience

Percentage of impact to the environment: The percentage of impact identifies the severity of disruptions caused by fault injections relative to normal operation. The goal is for controlled injections to have minimal impact to system operation, ideally, this number should be 0.

Identifying initial metrics is critical to know what to track and how to evaluate the performance of the experimentation. The captured benchmarking information should be used to present results and to set goals going forward.

Choosing what needs to be verified

There are two approaches to decide what chaos engineering experiments to execute

Reactive approach: Simulate past incidents while verifying that current systems can withstand repetitive actions. This involves checking incident trends and history, recurring problems, issues that take too long to be resolved, problems that have significant system impact, or other problematic metrics.

Proactive approach: With assistance from engineering, identify and verify pieces of the ecosystem that are new or haven’t had adverse incidents. This may require additional exploration to ensure the design is fault-tolerant and validate how real-world scenarios impact system resilience.

Coordinating with owners

A service owner is the most knowledgeable person on how a service works and operates. No chaos engineering experiment should be executed without the owner’s awareness and involvement. Rather than just providing information about the service, an owner should actively participate in the design of injections and commit to any follow-up actions. Make clear the roles an owner has to provide during the experiment:

Elaborate the targeted service’s purpose and its expected behavior (on the service itself and its dependencies) if an outage occurs
Find a way to introduce traffic into the service for verification during the experiment
Provide a dashboard and metrics to measure the impact of the outage of the targeted service
Suggest and propose additional engineering or quality assurance resources that could be critical to the experiment’s success

Drafting a plan

Once you have decided your target for an injection, it is a good idea to use a planning template to help define and document the experiment’s design. The template provides the framework and checklist needed for a chaos engineering experiment that includes the following information:

Problem statement: Clearly state the experiment’s target and why the specific target is chosen.
Impact: Identify the dependencies and the impact of executing the experiment. (The impact must be minimized or mitigated.)
Cross-team communications: Specify how communication will be done with cross-functional team members.
How tests will be performed: Inspired by “Principles of Chaos Engineering,” use the “Simulating problems” section to help select the tools to use for the experiment.
Execution plan: List each step to be performed, including what will be executed and by whom.
Notes: Elaborate and set the overall context to include any discussion items, the mindset for the injection, and any relevant information that can help make the experiment successful.

Upwork regularly conducts “GameHours” (inspired by Gremlin’s GameDays) as part of its chaos engineering approach. The team simulates real-world failure scenarios in a controlled environment. At Upwork, active participation of service owners is required. These injections range from simulating sudden spikes in user traffic to inducing failures in specific components

For the purpose of sharing awareness and impact, identify the following in the plan:

Simulations and scenarios to be verified
Resources needed to assist during the exercise
Observers who would like to be present

Upwork designs scenarios to be executed within a 2-hour span. This includes initial setup, presentation of the plan with the involved parties, execution of the planned scenarios, and information collection. This approach keeps any issues contained to that time period while minimizing any possible work disruption.

Game hours helps the team validate the effectiveness of system resilience and serves as a valuable learning experience for engineers. The practice encourages cross-team collaboration, improved communication, and fosters a deeper understanding of the system's intricate behaviors.

Environments where injections are performed

Running the first attempt at chaos engineering injections in staging instead of production is a good idea. Even with obvious differences in those environments, failure patterns could still be reliably reproduced and problems identified.

This doesn’t mean that staging should be used as a “crash test dummy.” Experiments should always be designed with uptime in mind, minimizing blast radius and having a “big red button“ to stop any attack. Each experiment should have clear abort conditions to keep any disruption to a minimum.

Managing the risks

Issues are bound to happen. Even when multiple dependencies (ex: metrics, logs, and service) can’t be controlled at execution, risks can be mitigated by managing the following:

Metrics health: Observability is critical for the execution of any experiments. You cannot measure what you can’t observe. Fortunately, systems are redundant at Upwork, with multiple monitoring sources that can handle fallback in case of issues. When creating the plan, perform audits and request additional metrics from service owners to track during the experiment.

Service health: Unfortunately, having a malfunctioning service or staging environment can force the suspension of an experiment. Continuous communication with service owners helps ensure that the service is good before running the experiment.

Log availability: Avoid log rotation that may lose results, proof, and troubleshooting information that the service owner will need to investigate any outstanding situation encountered during the experiment. Always raise awareness during the injection, especially when unrelated actions are discovered during the experiment.

The results attained at Upwork

After a year of exploring chaos engineering, Upwork Engineering executed 15 injections without impacting our staging environment. Those injections generated 26 follow-up actions distributed between monitoring bugs, service improvements, improvements on documentation and processes, and functional bugs. As a result, Engineering adopted a collection of new processes, playbooks, guidelines, tutorials, training sessions and plans (used to document and track all the efforts).

Chaos engineering has revealed new product experiences, collaboration, and valuable insight into multiple parts of Upwork’s platform. Engineering now has better further clarity and understanding of how systems interact with each other.

In summary

After having a successful year, the next step for Upwork is to:

Increase the number of injections
Identify new failure patterns
Experiment with stress, soak, and load testing
Adopt additional metrics into Upwork’s practices
Continually adjust the production environment according to what we learn

By deliberately injecting chaos into Upwork’s systems, Engineering has gained a deeper understanding of user behavior, increased system resilience, and fortified the ability to deliver consistent and reliable services to customers.

Upwork has made chaos engineering an integral part of its engineering culture. It has become a practice that helps prevent incidents and empowers distributed engineering teams to work cohesively and confidently.