Custom Kafka metrics using Apache Spark PrometheusServlet | by Vitor Teixeira | Feb, 2023

By Jessie Hobb On Feb 3, 2023

Creating and exposing custom Kafka Consumer Streaming metrics in Apache Spark using PrometheusServlet

In this blog post, I will describe how to create and enhance current Spark Structured Streaming metrics with Kafka consumer metrics and expose them using the Spark 3 PrometheusServlet that can be directly targeted by Prometheus. In previous Spark versions, one must set up either a JmxSink/JmxExporter, GraphiteSink/GraphiteExporter, or a custom sink deploying metrics to a PushGateway server. With that said, we couldn’t really avoid the increase in the complexity of our solutions as we must set up external components that interact with our applications so that they can be scraped by Prometheus.

Motivation

More than ever, observability is a must when it comes to software. It allows us to get insights into what is happening inside the software without having to directly interact with the system. One way of building upon this observability pillar is by exposing application metrics. When built upon an observability stack, they allow us to detect problems either by alerts or simply looking at a dashboard and finding their root cause by analyzing metrics.

Apache Spark applications are no different. It is true that one can access the Spark Web UI and gather insights into how our application is running, but when the number of applications increases by ten or hundredfold it becomes hard to troubleshoot them. That is when an observability tool like Grafana comes in handy. Grafana is able to connect to Prometheus databases, and Prometheus integrates seamlessly with our applications by targeting the PrometheusServlet.

When configurated, Apache Spark exposes several metrics natively, which are detailed here. In Structured Streaming, no metrics are exposed by default unless we set "spark.sql.streaming.metricsEnabled" -> "true". Below is an example of the metrics that are exposed on a Kafka Streaming job:

Default Spark Structured Streaming metrics

As we can see, these metrics are very generic and do not provide any detailed information about our source.

The goal is to be able to expose Kafka Consumer metrics that help us monitor how our event consumption is going.

Metrics should be quantifiable values that provide real-time insights about the status or performance of the application. In the scope of this article, we’ll be covering the following metrics:

Start offsets: The offsets where the streaming query first started.
End offsets: The last processed offsets by the streaming query. Tracks the consumer progress of the query.
Lead offsets: The latest offsets of the topic the streaming query is consuming. Tracks the evolution of the consumed topic.
Lag offsets: The difference between the last processed offsets and the lead offsets from the topic. Tracks how far a streaming query is in comparison with real-time.
Consumed rate: The consumed rate of the streaming query topics. It is the sum of all the topics subscribed on a streaming query.
Last record timestamp: The last message timestamp consumed from each TopicPartition. It tracks the latency between the producer and the consumer.

The next step, after defining the metrics, is to create the metric source that will be responsible for exposing the metrics to Spark’s MetricsSystem.

In order to expose the metrics we need to create a class that extends Source. What this does is create an executor connection to the driver to pass the metrics as part of the heartbeat process.

KafkaMetricsSource implementation

Make sure to define the gauges as SettableGauge in order to be able to update them during the execution.

After having the source defined all we need to do is instantiate the source and register it in Spark’s MetricsSystem.

A simplified version of source registration

For the full code of the source implementation, you can check:

Now that we have our source in place all we need is to make use of it. For that, we’ll need the metrics to populate our recently created gauges.

If you ran a Structured Streaming job before you might have noticed an output similar to the following when a Streaming Query progresses: