ML Prediction on Streaming Data Using Kafka Streams | by Alon Agmon | Jul, 2022

By Jessie Hobb On Jul 12, 2022

Boost the performance of your Python-trained ML models by serving them over your Kafka streaming platform in a Scala application

Suppose you have a robust streaming platform based on Kafka, which cleans and enriches your customers’ event data before writing it to some warehouse. One day, during a casual planning meeting, your product manager raises the requirement to use a machine learning model (developed by the data science team) over incoming data and generate an alert for messages marked by the model. “No problem”, you reply. “We can select any data set we want from the data warehouse, and then run whatever model we want”. “Not exactly”, the PM replies. “We want this to run as real-time as possible. We want the results of the ML model to be available for consumption in a Kafka topic in less than a minute after we receive the event”.

This is a common requirement, and it will only get more popular. The requirement for real time ML inference on streaming data becomes important for many customers that have to make time-sensitive decisions on the result of the model.

It seems that big data engineering and data science play nicely together and should have some straightforward solution, but often that is not the case, and using ML for near real time inference over heavy workloads of data involves quite a few challenges. Among these challenges, for example, is the difference between Python, which is the dominant language of ML, and the JVM environment (Java/Scala) which is the dominant environment for big data engineering and data streaming. Another challenge relates to the data platform we are using for our workloads. If you are already working with Spark then you have the Spark ML lib at your service, but sometimes it will not be good enough, and sometimes (as in our case) Spark is not part of our stack or infra.

Its true that the ecosystem is aware of these challenges and is slowly addressing them with new features, though our specific and common scenario currently leaves you with a few common options. One, for example, is to add Spark to your stack and write a pySpark job that will add the ML inference stage to your pipeline. This is will offer better support for Python for your data science team but it also means that your data processing flow might take longer and that you also need to add and maintain a Spark cluster to your stack. Another option would be to use some third-party model serving platform that will expose an inference service endpoint based on your model. This might help you retain your performance but might also require the cost of additional infra while being an overkill for the some tasks.

The common solution — add a Spark cluster to the stack to run ML inference

In this post, I want to show another approach to this task using Kafka Streams. The advantage of using Kafka Streams for this task is that unlike Flink or Spark, it does not require a dedicated compute cluster. Rather, it can run on any application server or container environment you are already using, and if you are already using Kafka for stream processing, then it can be embedded in your flow quite seamlessly.

While both Spark and Flink have their machine learning libraries and tutorials, using Kafka Streams for this task seems like a less common use case , and my goal is to show how easy it is to implement. Specifically, I show how we can use an XGBoost model — a production grade machine learning model, trained in a Python environment, for real time inference over a stream of events on a Kafka topic.

This is intended to be a very hands-on post. In Section 2, we train an XGBoost classifier on a fraud detection dateset. We do so in a Jupyter notebook in a Python environment. Section 3 is an example for how the model’s binary can be imported and wrapped in a Scala class, and Section 4 shows how this can be embedded in a Kafka Stream application and generate real time prediction on streaming data. At the end of the post you can find a link to a repo with the full code described here.

( Note that in many cases I use Scala in a very non-idiomatic way. I do so for the sake of clarity as idiomatic Scala can sometimes be confusing. )

For this example, we start by training a simple classification model based on the Kaggle credit fraud data set.¹ You can find the full model training code here. The important bit (below) is that after we (or our data scientists) are satisfied with the results of our model, we simply save it in its simple binary form. This binary is all we need to load the model in our Kafka Streams app.

Boost the performance of your Python-trained ML models by serving them over your Kafka streaming platform in a Scala application

( Note that in many cases I use Scala in a very non-idiomatic way. I do so for the sake of clarity as idiomatic Scala can sometimes be confusing. )

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

ML Prediction on Streaming Data Using Kafka Streams | by Alon Agmon | Jul, 2022

Boost the performance of your Python-trained ML models by serving them over your Kafka streaming platform in a Scala application

One last comment on serialization:

Boost the performance of your Python-trained ML models by serving them over your Kafka streaming platform in a Scala application

One last comment on serialization: