Concord was born out of need and experience – the need for a stable, predictable stream processor given rise from painful experiences keeping Storm clusters reliable and efficient at scale. There are a few core points to the philosophy underlying Concord:
Stream processing shouldn’t be restricted to distributed systems experts – application programmers and data scientists should be able to write streaming computations in the languages they’re comfortable with
Cluster management should be accessible for regular developers, not just specialists
Debugging a distributed system (& tracking down errors) should be simple
For multi-tenancy, processes must be isolated from and protected against one another
In this post, we’ll be exploring these ideas and discussing how they impacted our design process when building Concord.
Mesos - Cluster Management
Concord is a native stream processing framework (implemented in C++) built on Mesos. Close integration with Mesos enables us to do many things.
Mesos integrates with the Linux control groups APIs, making it possible to ensure a given process does not exceed its CPU and Memory allocations, denying other processes of the resources they need. This makes it possible to operate Mesos clusters at very high utilization, avoiding wasted (idle) computing resources while still keeping performance predictable. These guarantees are especially powerful when paired with Mesos’ fine grain scheduling API, through which framework authors (like us at Concord) have exacting control over which machines execute which tasks.
This scheduling flexibility enables one of Concord’s more interesting features: dynamic deployments. In many existing stream processing frameworks - Storm and Spark, to name two - topologies are static – once deployed, they can’t be changed. This means that a modification to any stream operator requires that the entire topology be torn down and restarted.
In Concord, we prefer to look at streaming computations - what we call ‘operators’ - as actors. Individual actors can be shut down, be redeployed or crash without affecting the rest of the topology. This dynamic quality better reflects the way software projects grow and mature – piece by piece, not as a whole. This means that a bug in some downstream operator can be patched without disrupting the rest of the well-functioning system. This also means that data scientists can be comfortable deploying experimental code without worrying about affecting mission-critical systems.
Before and after — adding a new operator, C, to an existing topology
Mesos also has a simple but effective web interface that helps to expose critical information to the user like cluster utilization and - most importantly - task logs. Debugging issues in the middle of the night is even more difficult when you have to ssh into one of 20 production machines to find a crashed process’ logs. With Mesos, not only is it easy to identify which process crashed, you can also easily navigate to the relevant logs through their UI.
Running operators visible in the Mesos UI
Concord’s internal and external APIs are all build on Thrift, a fast, multi-language RPC framework. User operators running in Concord communicate with the framework via Thrift, meaning users can write computations in any of the 15 languages supported by Thrift. To make the experience better suit each language’s idioms, we’ve made higher level APIs for Python, Ruby, Java, Go, and C++.
At Concord we realize that stream processors are hard to build - trust us, it hasn’t been easy! - but we firmly believe that that shouldn’t mean they’re difficult to use. We hope that, through the simplicity of our APIs and tooling, you’ll find the confidence to build and maintain streaming systems that you previously thought unattainable. You shouldn’t need a PhD to count clicks on a page and you shouldn’t need years of experience to keep your systems running in the presence of failures.
Concord: Stream processing without the headaches.