
The Upwork Engineering organization has been working with distributed engineering teams for years to create products and services that connect millions of businesses with independent freelancing talent around the globe.
Over the past year, chaos engineering has been injected throughout Upwork’s software development lifecycle. The idea of intentionally introducing chaos into our systems was initially daunting, but the results have been eye-opening and invaluable. This retrospective is a chance to look back on the challenges, lessons learned, and the impact of chaos engineering on Upwork’s entire engineering organization.
The chaos engineering project has involved engineers across multiple teams, led by site reliability engineering (SREs) in collaboration with service owners. As a point of fact, the majority of Upwork’s engineering team are freelancers working remotely around the world.
The Upwork ecosystem is complex. An unfortunate issue of complex systems is that nobody knows how all the moving pieces behave and how they may fail under specific circumstances. Upwork needed a discipline that could holistically support its complex environment. Instead of taking a reactive approach, problem mitigation could be proactively taken by identifying, raising, and resolving issues before they become serious incidents.
Chaos engineering, the process of testing a distributed computing environment to ensure it can withstand unexpected disruptions, is not a new concept to most software organizations. Most of us have been using some form of chaos engineering, in some way or another, without even knowing it. A couple of examples come to mind:
Formally introducing chaos engineering involves much more than that. It is a discipline that requires cultural change embracing chaotic and unexpected complex systems behavior. Upwork Engineering became familiar with the practice by embracing the Principles of Chaos Engineering, digesting lots of articles, watching seminars, and reading books on the topic. Many online resources provide case studies of how chaos engineering is implemented among different companies. All of this research helped Upwork determine how to use chaos engineering practices in its own ecosystem.
Any large-scale endeavor (like chaos engineering) should require support from executive leadership. At Upwork, there was leadership buy-in from the beginning. It can be difficult to convince anyone that “breaking things” is the best way to achieve resiliency—it feels unnatural and counterintuitive. Since a key component of SRE is fixing and patching things together, “breaking things on purpose” could feel like going against your every instinct as an engineer.
A recommended approach is to start with a small, well-defined proof of concept on a service that tends to be problematic, begin in staging, and minimize any possible impact. By starting small, there should be enough evidence to showcase to executives that chaos engineering is feasible and achievable. Adopting this approach should translate into improvements in product resilience.
The quality of results depend on the confidence of tooling and the creativity to simulate real-world scenarios and experiments against services. As some incidents can be unique and no instrument can simulate all possible scenarios. There's no silver bullet to test all scenarios, but there are some recommended approaches to address that concern:
Gremlin’s chaos engineering tools can safely and securely inject failure into systems to find weaknesses before they cause customer-facing issues." Gremlin uses a set of attacks that simulate problems related to resources, networks, and processes. This approach is useful to experiment with specific failure patterns across infrastructure. An abort button is also available that can halt attacks being introduced.
To simulate a relational database service (RDS) failover or lose a cache node can be replicated using Amazon Web Services (AWS) tooling and capabilities. Changing a Security Group, forcing RDS failover, or shutting down a Memcached node are actions that help experiment on the ecosystem and reveal gaps in system availability.
Benchmarking tools, automated testing, and stress testing can help simulate problems, generate traffic (to see how users and services react to injected issues), and check the health of user flows while experiments are taking place.
Leveraging resources like these can produce amazing results in any chaos engineering exercise. Ultimately, however, it is up to the creativity of the people involved that will determine the level of success.
Defining clear and achievable goals was crucial at the outset of Upwork’s chaos engineering initiative. The engineering team focused on enhancing overall system resilience and minimizing downtime due to unforeseen failures. The team could prioritize improvements once the weakest points in the infrastructure were identified. These decisions became the roadmap to guide us through the chaos engineering process. It also provided guidance for measuring success along the way.
Since there was no initial baseline at Upwork, metrics were identified for the project, including the following:
Identifying initial metrics is critical to know what to track and how to evaluate the performance of the experimentation. The captured benchmarking information should be used to present results and to set goals going forward.
There are two approaches to decide what chaos engineering experiments to execute
Reactive approach: Simulate past incidents while verifying that current systems can withstand repetitive actions. This involves checking incident trends and history, recurring problems, issues that take too long to be resolved, problems that have significant system impact, or other problematic metrics.
Proactive approach: With assistance from engineering, identify and verify pieces of the ecosystem that are new or haven’t had adverse incidents. This may require additional exploration to ensure the design is fault-tolerant and validate how real-world scenarios impact system resilience.
A service owner is the most knowledgeable person on how a service works and operates. No chaos engineering experiment should be executed without the owner’s awareness and involvement. Rather than just providing information about the service, an owner should actively participate in the design of injections and commit to any follow-up actions. Make clear the roles an owner has to provide during the experiment:
Once you have decided your target for an injection, it is a good idea to use a planning template to help define and document the experiment’s design. The template provides the framework and checklist needed for a chaos engineering experiment that includes the following information:
Upwork regularly conducts “GameHours” (inspired by Gremlin’s GameDays) as part of its chaos engineering approach. The team simulates real-world failure scenarios in a controlled environment. At Upwork, active participation of service owners is required. These injections range from simulating sudden spikes in user traffic to inducing failures in specific components
For the purpose of sharing awareness and impact, identify the following in the plan:
Upwork designs scenarios to be executed within a 2-hour span. This includes initial setup, presentation of the plan with the involved parties, execution of the planned scenarios, and information collection. This approach keeps any issues contained to that time period while minimizing any possible work disruption.
Game hours helps the team validate the effectiveness of system resilience and serves as a valuable learning experience for engineers. The practice encourages cross-team collaboration, improved communication, and fosters a deeper understanding of the system's intricate behaviors.
Running the first attempt at chaos engineering injections in staging instead of production is a good idea. Even with obvious differences in those environments, failure patterns could still be reliably reproduced and problems identified.
This doesn’t mean that staging should be used as a “crash test dummy.” Experiments should always be designed with uptime in mind, minimizing blast radius and having a “big red button“ to stop any attack. Each experiment should have clear abort conditions to keep any disruption to a minimum.
Issues are bound to happen. Even when multiple dependencies (ex: metrics, logs, and service) can’t be controlled at execution, risks can be mitigated by managing the following:
Metrics health: Observability is critical for the execution of any experiments. You cannot measure what you can’t observe. Fortunately, systems are redundant at Upwork, with multiple monitoring sources that can handle fallback in case of issues. When creating the plan, perform audits and request additional metrics from service owners to track during the experiment.
Service health: Unfortunately, having a malfunctioning service or staging environment can force the suspension of an experiment. Continuous communication with service owners helps ensure that the service is good before running the experiment.
Log availability: Avoid log rotation that may lose results, proof, and troubleshooting information that the service owner will need to investigate any outstanding situation encountered during the experiment. Always raise awareness during the injection, especially when unrelated actions are discovered during the experiment.
After a year of exploring chaos engineering, Upwork Engineering executed 15 injections without impacting our staging environment. Those injections generated 26 follow-up actions distributed between monitoring bugs, service improvements, improvements on documentation and processes, and functional bugs. As a result, Engineering adopted a collection of new processes, playbooks, guidelines, tutorials, training sessions and plans (used to document and track all the efforts).
Chaos engineering has revealed new product experiences, collaboration, and valuable insight into multiple parts of Upwork’s platform. Engineering now has better further clarity and understanding of how systems interact with each other.
After having a successful year, the next step for Upwork is to:
By deliberately injecting chaos into Upwork’s systems, Engineering has gained a deeper understanding of user behavior, increased system resilience, and fortified the ability to deliver consistent and reliable services to customers.
Upwork has made chaos engineering an integral part of its engineering culture. It has become a practice that helps prevent incidents and empowers distributed engineering teams to work cohesively and confidently.