Techno Blender
Digitally Yours.

Do Real-Time Data Pipelines Even Exist? | by Jyoti Dhiman | Jul, 2022

0 62


Sharing a fresh perspective on real-time data pipelines

Photo by Djim Loic on Unsplash

How often have you heard these terminologies — real-time data pipelines or real-time data processing or real-time analytics or just real-time data? These are often discussed to solve some very interesting and critical use cases such as fault detection, anomaly detection, and many more.

In this article, we will take a look at some of the commonly used real-time flows & technologies and what it really means to be real-time?

No, it is not a philosophical question, but more of a technical one. I hope this article helps you develop a fresh perspective of real-time data processing(or maybe change mine? 🙂 )

Let’s get started.

A typical real-time data flow can look something like below

Image by Author

Source: Data producer, Ex: Kafka/Kinesis/Solace/API, etc.
Consumer: Data consumers could be something that consumes data from the source and then process it(if needed), this may be some custom implementation or some technology, Ex: Structured Streaming, Flink, Samza, etc.
Action: Action could be something like persisting the same to some storage/database powering some alerting/visualizations to make quick decisions etc.

For instance, one of the most common real-time infrastructures I have worked on has been with structured streaming consumer reading data from some topic and persisting the same:

Image by author

Structured Streaming, processes data in micro-batches(read this to know-how), with low batch frequency, it can provide low latency of processing.

Some of the other commonly used technologies in the real-time ecosystem are:

Flink

Now, spark structured streaming could be replaced with something such as Flink which has native support for streaming and provides extremely low latency(link), and can be faster than structured streaming for some use cases. Some examples where Flink can be used are listed here

Kafka

We all know by now how fantastic Kafka is! (I mean let’s take a moment to appreciate this beautiful technology) it provides unmatched persistence/throughput and low latency. Here is an example of a real-time pipeline built using Kafka in Walmart.

Pub/Sub

These are commonly used for data streaming or messaging use cases and provide extremely low latency, they don’t give you persistence but that might be fine depending on the use case.

Having taken look at some examples and technologies, my question to you is, is it really real-time?

The keyword is low latency, and my argument here is should low latency be considered real-time?

low latency == real-time?

For instance, now wherever structured streaming is involved, data is processed in micro-batches, the batch frequency could be very small but still, it is some level of batching and data is not actually processed in real-time but near-real time. Similarly for other solutions which provide extremely low latency when should we start calling it real-time and not near real-time?

One approach could be that for the pipelines you have designed if the latency it is giving is low enough that for us mortals it is perceived as if really happening in real-time, it can be considered real-time (think about Instagram live, Youtube live or Live matches) for other instances if the latency configured/designed or supported is not low enough that it is actually noticeable the solution becomes near real-time and not real-time.

So, it’s not just on the technology you use for the design but how you use it, how it helps with the design as well as the load to understand its latency.

I mean things can be as fast as you want them to be but in the end, you have to account for the latency of networks and hops as well. That is why we always consider the source of truth to be event created timestamp rather than arrival timestamp, as there could be delay or latency. (Here is an interesting article on time ordering)

Now, I have not come here to discard real-time infrastructures but just the terminology and the perception and due to the above reasons, I am very careful in using the term real-time and prefer using near real-time if I don’t know the specifics.

So, what are your thoughts?

Nothing is real-time or everything is?

I will be happy to be proven wrong or to discuss other viewpoints as well, feel free to leave a comment or connect with me on Linkedin.

If you liked this article, please do give a clap, it makes me smile in NEAR REAL-TIME 🙂

Until next time,
JD


Sharing a fresh perspective on real-time data pipelines

Photo by Djim Loic on Unsplash

How often have you heard these terminologies — real-time data pipelines or real-time data processing or real-time analytics or just real-time data? These are often discussed to solve some very interesting and critical use cases such as fault detection, anomaly detection, and many more.

In this article, we will take a look at some of the commonly used real-time flows & technologies and what it really means to be real-time?

No, it is not a philosophical question, but more of a technical one. I hope this article helps you develop a fresh perspective of real-time data processing(or maybe change mine? 🙂 )

Let’s get started.

A typical real-time data flow can look something like below

Image by Author

Source: Data producer, Ex: Kafka/Kinesis/Solace/API, etc.
Consumer: Data consumers could be something that consumes data from the source and then process it(if needed), this may be some custom implementation or some technology, Ex: Structured Streaming, Flink, Samza, etc.
Action: Action could be something like persisting the same to some storage/database powering some alerting/visualizations to make quick decisions etc.

For instance, one of the most common real-time infrastructures I have worked on has been with structured streaming consumer reading data from some topic and persisting the same:

Image by author

Structured Streaming, processes data in micro-batches(read this to know-how), with low batch frequency, it can provide low latency of processing.

Some of the other commonly used technologies in the real-time ecosystem are:

Flink

Now, spark structured streaming could be replaced with something such as Flink which has native support for streaming and provides extremely low latency(link), and can be faster than structured streaming for some use cases. Some examples where Flink can be used are listed here

Kafka

We all know by now how fantastic Kafka is! (I mean let’s take a moment to appreciate this beautiful technology) it provides unmatched persistence/throughput and low latency. Here is an example of a real-time pipeline built using Kafka in Walmart.

Pub/Sub

These are commonly used for data streaming or messaging use cases and provide extremely low latency, they don’t give you persistence but that might be fine depending on the use case.

Having taken look at some examples and technologies, my question to you is, is it really real-time?

The keyword is low latency, and my argument here is should low latency be considered real-time?

low latency == real-time?

For instance, now wherever structured streaming is involved, data is processed in micro-batches, the batch frequency could be very small but still, it is some level of batching and data is not actually processed in real-time but near-real time. Similarly for other solutions which provide extremely low latency when should we start calling it real-time and not near real-time?

One approach could be that for the pipelines you have designed if the latency it is giving is low enough that for us mortals it is perceived as if really happening in real-time, it can be considered real-time (think about Instagram live, Youtube live or Live matches) for other instances if the latency configured/designed or supported is not low enough that it is actually noticeable the solution becomes near real-time and not real-time.

So, it’s not just on the technology you use for the design but how you use it, how it helps with the design as well as the load to understand its latency.

I mean things can be as fast as you want them to be but in the end, you have to account for the latency of networks and hops as well. That is why we always consider the source of truth to be event created timestamp rather than arrival timestamp, as there could be delay or latency. (Here is an interesting article on time ordering)

Now, I have not come here to discard real-time infrastructures but just the terminology and the perception and due to the above reasons, I am very careful in using the term real-time and prefer using near real-time if I don’t know the specifics.

So, what are your thoughts?

Nothing is real-time or everything is?

I will be happy to be proven wrong or to discuss other viewpoints as well, feel free to leave a comment or connect with me on Linkedin.

If you liked this article, please do give a clap, it makes me smile in NEAR REAL-TIME 🙂

Until next time,
JD

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment