As Upwork moved toward distributed micro frontends and backends, we needed tools to facilitate communication between services. Among other things, we wanted the access to remote systems to be as fault tolerant as practically possible and supported by a clearly defined contract to avoid human error.

I want to tell you about two instruments that we made available on GitHub: Phystrix and Donkey.

Preventing cascading failures

In a distributed system, the challenge of dealing with occasional and unexpected failures becomes more pronounced. One of the key elements in implementing a resilient architecture and ensuring quick recovery was the adoption of Netflix’s Hystrix open source library.

Hystrix protects the client by monitoring the health of a remote system and “breaking the circuit” when the dependency is misbehaving. The overall flow is as follows:

  1. The application makes a call to a remote dependency through Hystrix.
  2. Hystrix checks if the dependency is healthy, preventing calls to failing dependencies
  3. Hystrix isolates the call in a designated thread, measuring the latency (semaphore isolation is also available).
  4. If latency is too high, Hystrix will enforce the timeout configured and stop the thread’s execution.
  5. Hystrix stores the success to failure ratio. When it breaches the configured threshold, the dependency is marked as unhealthy.
  6. For unhealthy dependencies, Hystrix will allow infrequent probing requests to go through, to see if the dependency has recovered.

The problem was that for PHP, which powers our distributed frontend, there was no alternative.

Meet Phystrix

Phystrix is our implementation of the circuit breaker pattern for the PHP ecosystem.

One of the challenges presented by PHP was the complete isolation of the request scope. Unlike in JVM-based web frameworks, it is not possible to store the Circuit Breaker metrics data and provide concurrent access to it by means of PHP itself, e.g. in a variable. We chose APC cache PHP extension to share data between the threads:

  • It’s fast. There is no network call.
  • APC cache is very stable. It used to back PHP’s bytecode caching for many years (the first stable version is dated to 2003).
  • It provides atomic operations required for accurate tracking of the metrics.

We designed Phystrix configuration system in a way that mimics Hystrix configuration, to reuse terminology between Java & PHP stacks:

        'circuitBreaker' => array(
            // Whether circuit breaker is enabled, if not Phystrix will always allow a request
            'enabled' => true,
            // How many failed request it might be before we open the circuit (disallow consecutive requests)
            'errorThresholdPercentage' => 50,
            // If true, the circuit breaker will always be open regardless the metrics
            'forceOpen' => false,
            // If true, the circuit breaker will always be closed, allowing all requests, regardless the metrics
            'forceClosed' => false,
            // How many requests we need minimally before we can start making decisions about service stability
            'requestVolumeThreshold' => 10,
            // For how long to wait before attempting to access a failing service
            'sleepWindowInMilliseconds' => 5000,
        ),

To make use of the Circuit Breaker, a call to an upstream dependency needs to be encapsulated using the Command pattern:

class GetAvatarUrlCommand extends AbstractCommand
{
    protected $user;

    public function __construct($user)
    {
        $this->user = $user;
    }

    protected function run()
    {
        $remoteAvatarService = $this->serviceLocator->get('avatarService');
        return $remoteAvatarService->getUrlByUser($this->user);
    }

    /**
     * When __run__ fails for some reason, or when Phystrix doesn't allow the request in the first place,
     * this function result will be returned instead
     *
     * @return string
     */
    protected function getFallback()
    {
        // we failed getting user's picture, so showing a generic no-photo placeholder instead.
        return 'http://example/avatars/no-photo.jpg';
    }
}

We open sourced Phystrix library on GitHub. There is also a bundle available for integration with the popular Symfony PHP framework.

Cross-language interaction

From the very beginning of our journey toward microservices, we aspired to design an architecture accepting of different programming languages and paradigms. We wanted to be able to use technologies best suited for a particular task. At the moment of writing, we have PHP, JavaScript, Java, Perl, and others working together. For that, we needed to define the APIs of the individual services clearly and strictly enough to avoid human error and false assumptions.

We liked Thrift data types, but we also appreciated RESTful design for its simplicity, statelessness, and uniformity. We were missing the piece that would combine the two.

We needed a tool that would allow us to describe an API in a clear and language-agnostic manner; we also wanted to be able to generate client code in any language of our choice.

The most promising tool available at the time was Swagger V1. However, the DSL lacked readability and code generation tools were limited. But Swagger’s primary objective was to help generate the UI to access a Web service—not what we were looking for.

We created a tool called Donkey, which puts the DSL first:

resource Users "/users" {

    GET "/{name}" User getUser(pathParam string name) throws UserNotFoundException;

    POST void addUser(requestBody User user);

    ## Search for users registered at a given address
    POST "/search-by-addresses" UserList searchUsersByAddress(requestBody Address address)
        throws AddressMalformedException;
}

With Donkey, the definition file for a service carries all the essential REST semantics while at the same time allows engineers to focus on the business logic behind the service.

Things that are left out from the definition file are headers with contextual data, the entities serialization rules, and exception handling mechanisms. We noticed that these elements of the API are usually decided upon once and are reused in all services. For example, we decided to use Apache Thrift for data types definition, transfer, and code generation. All the clients are aware of this convention and can work with any service that has a Donkey DSL file.

Donkey core provides the domain specific language and the framework to build code generation tools. Unlike Apache Thrift, Donkey core does not include any generators out of the box. We felt like requirements are just too different sometimes, and no one client library fits all use cases.

For example, our generators produce a set of Hystrix or Phystrix commands with the circuit breaker pattern built-in. Others may have different requirements for their HTTP client configuration: logging, tracking metrics, setting additional headers, etc. For these developers, there are clients and server generators for Symfony PHP Framework and Spring Java available on GitHub.

Progressing towards a fault-tolerant and distributed future

In this article we took a look at two new tools we made available on Github that were developed as part of Upwork’s ongoing modernization efforts. Phystrix provides a “Hystrix-esque” implementation of the circuit-breaker pattern for the PHP ecosystem. Donkey, provides a DSL for describing APIs in a clear, language agnostic manner. Check out our engineering blog posts for more about Upwork’s engineering team.