Tutorials

Tracing and Metrics

By default each Concord operator logs its performance metrics to local files. These files make point to point latency and throughput metrics accessible through the Mesos UI. To view these logs, click on the sandbox link on the operator you are attempting to inspect, then on the sole folder. You should see something like this:

Latency Metrics

Concord records the total average latency of sampled traces in microseconds. Every time a sampled trace is processed its timestamp is added to a histogram and new latency values are calculated and logged. These values include current average, 95th percentile, 99th percentile and 99.9th percentile. The values inside of each file ending in *_latencies.txt are described below.

  • dispatcher_latencies.txt: Measurements will report latency metrics representing the time it takes for a message to be fully processed when attempting to send data downstream.
  • principal_latencies.txt: Measurements will report latency metrics representing the time it takes for the user code to process a message after leaving our incoming queue.
[2015-10-23 11:08:08.318] [principal_latencies] [info] 127.0.0.1:31003: traceId: 0, parentId: 0, id: 0, p50: 5141us, p95: 7353us, p99: 9239us, p999: 17194us [2015-10-23 11:08:08.330] [principal_latencies] [info] 127.0.0.1:31003: traceId: 0, parentId: 0, id: 0, p50: 5142us, p95: 7355us, p99: 9238us, p999: 17193us [2015-10-23 11:08:08.344] [principal_latencies] [info] 127.0.0.1:31003: traceId: 0, parentId: 0, id: 0, p50: 5143us, p95: 7370us, p99: 9251us, p999: 17191us [2015-10-23 11:08:08.358] [principal_latencies] [info] 127.0.0.1:31003: traceId: 0, parentId: 0, id: 0, p50: 5144us, p95: 7384us, p99: 9249us, p999: 17189us [2015-10-23 11:08:08.373] [principal_latencies] [info] 127.0.0.1:31003: traceId: 0, parentId: 0, id: 0, p50: 5145us, p95: 7399us, p99: 9381us, p999: 17188us

Throughput Metrics

Concord records the total number of messages entering and/or leaving each computation in one second intervals. The values inside of each file ending in *_throughput.txt are described below.

  • outgoing_throughput.txt: Measurements reporting throughput metrics describe how many messages per second the operator can push downstream.
  • incoming_throughput.txt: Measurements reporting throughput metrics describe how many incoming messages per second the operator can ingest.
[2015-10-23 10:44:07.396] [outgoing_throughput] [info] Initialized [2015-10-23 10:44:19.964] [outgoing_throughput] [info] 100000 req in 1972786us. total: 100000 req [2015-10-23 10:44:21.170] [outgoing_throughput] [info] 100000 req in 1205819us. total: 200000 req [2015-10-23 10:44:22.606] [outgoing_throughput] [info] 100000 req in 1436067us. total: 300000 req [2015-10-23 10:44:23.843] [outgoing_throughput] [info] 100000 req in 1237254us. total: 400000 req [2015-10-23 10:44:25.121] [outgoing_throughput] [info] 100000 req in 1278382us. total: 500000 req [2015-10-23 10:44:26.370] [outgoing_throughput] [info] 100000 req in 1248673us. total: 600000 req [2015-10-23 10:44:27.813] [outgoing_throughput] [info] 100000 req in 1442747us. total: 700000 req [2015-10-23 10:44:29.005] [outgoing_throughput] [info] 100000 req in 1192606us. total: 800000 req [2015-10-23 10:44:30.484] [outgoing_throughput] [info] 100000 req in 1478702us. total: 900000 req

Zipkin

Concord also integrates with Zipkin, an open source distributed tracing system. With Zipkin, developers can troubleshoot latency issues allowing them to pinpoint computations that are slowing down their topologies.

Zipkin is comprised of multiple components: a backing store, collector service, query service, and front end UI. If you are using the getting started Vagrant box, then Zipkin tracing will already be running. Direct your web browser to localhost:8080, you should see the Zipkin Web UI:

End-To-End Latency

Zipkin makes it easy to measure the total time it takes for your topology to process a record. Spans stored in Zipkin's collector service are searchable by their computation name. Select the dropdown on the left to see the available Spans by computation. You can use the second drop down to view traces from the proxy's incoming end (principal) or outgoing end (dispatcher). Parameters such as span duration and end time may be provided to limit your results. Clicking Find Traces will present you with a list of available traces that match your search criteria.

After selecting on a particular span you will be directed to a window that will show you the full trace, which is a tree of spans starting from an operator that must of been a data source. In this trace it is easy to see where any bottlenecks could be within your topology by just looking for the span with the longest duration time. The total end-to-end latency of the trace is in the top right hand corner. It is calculated by measuring the time between the first and last spans in the trace.