Guest post originally published on Jaeger Tracing’s blog by Dotan Horovits
Running systems in production involves requirements for high availability, resilience and recovery from failure. When running cloud native applications this becomes even more critical, as the base assumption in such environments is that compute nodes will suffer outages, Kubernetes nodes will go down and microservices instances are likely to fail, yet the service is expected to remain up and running.
In a recent post, I presented the different Jaeger components and best practices for deploying Jaeger in production. In that post, I mentioned that Jaeger uses external services for ingesting and persisting the span data, such as Elasticsearch, Cassandra and Kafka. This is due to the fact that the Jaeger Collector is a stateless service and you need to point it to some sort of storage to which it will forward the span data.
In this post, I’d like to discuss how to ingest and persist Jaeger trace data in production to ensure resilience and high availability, and the external services you need to set up for that. I’ll cover:
- Standard persistent storage for Jaeger with Elasticsearch and Cassandra
- Alternative persistent storage with gRPC plugin
- Handling high load tracing data streams with Kafka
- Jaeger persistence during development with jaegertracing all-in-one
Deploying Jaeger with Elasticsearch, Kafka or other External Services
Jaeger deployments may involve additional services such as Elasticsearch, Cassandra and Kafka. But do these services come as part of Jaeger’s installation and how are these services deployed?
The Jaeger Operator and Jaeger’s Helm chart (see Jaeger’s deployment tools in this post) offer the option of a self-provisioned Elasticsearch/Cassandra/Kafka cluster (in which Jaeger deployment also deploys these clusters), as well as the option of connecting to an existing cluster.
The self-provisioned option offers a good starting point, but you may prefer to deploy these services independently for better flexibility and control over the way these clusters are deployed, managed, monitored, upgraded and secured, in accordance with your team’s DevOps practices. In particular, if you are already running a Kafka or Elasticsearch cluster, it may make more sense to re-use these infrastructure components rather than maintain a separate cluster.
Elasticsearch vs. Cassandra as Jaeger Backend Storage
For production deployments, Jaeger currently provides built-in support for two storage solutions, both of which are very popular open source NoSQL databases: Elasticsearch and Cassandra. The Jaeger collector and query service need to be configured with the storage solution of choice so they can write to it and query it. You can pass the desired storage type and the database endpoint via environment variables. For example, a basic Elasticsearch setup will define the following environment variables:
So which storage backend should you use: Elasticsearch or Cassandra?
The Jaeger team provides a clear recommendation to use Elasticsearch as the storage backend over Cassandra. And they have very good reasons:
- Cassandra is a key-value database, so it is more efficient for retrieving traces by trace ID, but it does not provide the same powerful search capabilities as Elasticsearch. Effectively, the Jaeger backend implements the search functionality on the client side, on top of k-v storage, which is limited and may produce inconsistent results (see issue-166 for more details). Elasticsearch does not suffer from these issues, resulting in better usability. Elasticsearch can also be queried directly, e.g. from Kibana dashboards, and provide useful analytics and aggregations.
- Based on past performance experiments we observed single writes to be much faster in Cassandra than Elasticsearch, which might suggest that it may sustain higher write throughput. However, because the Jaeger backend needs to implement search capability on top of k-v storage, writing spans to Cassandra is actually subject to large write amplification: in addition to writing a record for the span itself, Jaeger performs extra writes for service name and operation name indexing, as well as extra index writes for every tag. In contrast, saving a span to Elasticsearch is a single write, and all indexing takes place inside the ES node. As a result, the overall throughput to Cassandra is comparable with Elasticsearch.
One benefit of Cassandra backend is simplified maintenance due to its native support for data TTL. In Elasticsearch the data expiration is managed through index rotation, which requires additional setup (see Elasticsearch Rollover).
Alternative Persistent Storage for Jaeger
In addition to Jaeger’s built-in support for Elasticsearch and Cassandra, Jaeger supports a gRPC plugin (
SPAN_STORAGE_TYPE=grpc-plugin) which enables developing custom plugins to other storage types. The Jaeger community currently offers integrations with several persistent storage types, four of which are defined as ‘available’ at present: ScyllaDB, InfluxDB, Couchbase and Logz.io (disclaimer: I work at Logz.io).
Other integrations, which are not yet available, include NoSQL data stores from the big cloud vendors such as Amazon DynamoDB, Azure CosmosDB and Google BigTable, as well as popular SQL databases MySQL and PostgreSQL. You can check out the list of additional storage backends and updated status on this Jaeger GitHub issue.
Using Kafka to Ingest High-Load Jaeger Span Data
If you monitor many microservices, if you have a high volume of span data, or if your system generates data bursts on occasions, then your external backend storage may not be able to handle the load and may become a bottleneck, impacting the overall performance. In such cases you should employ the streaming deployment strategy that I mentioned in the previous post which uses Kafka between the Collector and the storage to buffer the span data from the Jaeger Collector.
In this case, you configure Kafka as the target for Jaeger Collector (
SPAN_STORAGE_TYPE=kafka ) as well as the relevant Kafka brokers, topic and other parameters.
I’d like to stress that Kafka is not an alternative backend storage (although the setting
SPAN_STORAGE_TYPE=kafka may be confusing). Your Jaeger backend still needs a backend storage as described in the previous sections, with Kafka serving as a buffer to take off pressure.
To support the streaming deployment Jaeger project also offers the Jaeger Ingester service, which can asynchronously read from Kafka topic and write to the storage backend (Elasticsearch or Cassandra). Of course, you can choose to implement your own service to do the same, if you need a particular target storage or ingestion strategy.
Jaeger Persistence During Development with jaegertracing All-in-One
Up till now I discussed production deployment. However, if you are exploring Jaeger or are doing a small PoC or development, then you are probably using Jaeger’s All-in-One installation, and you may be wondering how this is applicable to you.
All-in-one is a single node installation, in which you don’t trouble yourself with non-functional requirements such as resilience or scalability. In an all-in-one deployment, Jaeger uses in-memory persistence by default. Alternatively, you can choose to use Badger, which provides a single-node storage based on a filesystem (similar to Prometheus model). You can find more details on using Badger here.
Bear in mind that both in-memory and Badger are meant for all-in-one deployments only, and are not suitable for production deployments.
When deploying Jaeger in production, you need to address data persistence, high availability and scalability concerns. In order to address these concerns you need to deploy additional services.
First of all, you should deploy and configure an external persistence storage for your span data. The recommended persistence storage for Jaeger in production is Elasticsearch.
Secondly, when dealing with high load of span data, you should deploy Kafka in front of the storage to handle the ingestion and provide backpressure.
Running in production entails many other considerations not covered in this post, such as upgrades to Jaeger components as well as Elasticsearch, Kafka or any additional service in the deployment; monitoring the different services, and securing access to these services.
Thanks to Juraci Paixão Kröhling and Yuri Shkuro.
Dotan Horovits (@horovits) is a CNCF speaker, a co-organizer of the Israeli chapter of CNCF, and a developer advocate at Logz.io – Twitter LinkedIn
Logz.io provides a cloud observability platform based on ELK, Prometheus and Jaeger.