Tracing and Metrics
By default each Concord operator logs its performance metrics to local files.
These files make point to point latency and throughput metrics accessible
through the Mesos UI. To view these logs, click on the
sandbox link on the
operator you are attempting to inspect, then on the sole folder. You should see
something like this:
Concord records the total average latency of sampled traces in microseconds.
Every time a sampled trace is processed its timestamp is added to a histogram
and new latency values are calculated and logged. These values include current
average, 95th percentile, 99th percentile and 99.9th percentile. The values
inside of each file ending in
*_latencies.txt are described below.
- dispatcher_latencies.txt: Measurements will report latency metrics representing the time it takes for a message to be fully processed when attempting to send data downstream.
- principal_latencies.txt: Measurements will report latency metrics representing the time it takes for the user code to process a message after leaving our incoming queue.
Concord records the total number of messages entering and/or leaving each
computation in one second intervals. The values inside of each file ending in
*_throughput.txt are described below.
- outgoing_throughput.txt: Measurements reporting throughput metrics describe how many messages per second the operator can push downstream.
- incoming_throughput.txt: Measurements reporting throughput metrics describe how many incoming messages per second the operator can ingest.
Concord also integrates with Zipkin, an open source distributed tracing system. With Zipkin, developers can troubleshoot latency issues allowing them to pinpoint computations that are slowing down their topologies.
Zipkin is comprised of multiple components: a backing store, collector service, query
service, and front end UI. If you are using the getting started Vagrant box, then Zipkin
tracing will already be running. Direct your web browser to
should see the Zipkin Web UI:
Zipkin makes it easy to measure the total time it takes for your topology to process a record. Spans stored in Zipkin's collector service are searchable by their computation name. Select the
dropdown on the left to see the available Spans by computation. You can use the second drop down
to view traces from the proxy's incoming end (principal) or outgoing end (dispatcher). Parameters
such as span duration and end time may be provided to limit your results. Clicking
will present you with a list of available traces that match your search criteria.
After selecting on a particular span you will be directed to a window that will show you the full trace, which is a tree of spans starting from an operator that must of been a data source. In this trace it is easy to see where any bottlenecks could be within your topology by just looking for the span with the longest duration time. The total end-to-end latency of the trace is in the top right hand corner. It is calculated by measuring the time between the first and last spans in the trace.