Techno Blender
Digitally Yours.

3 Reasons Why Spark’s Lazy Evaluation is Useful | by Amal Hasni | Sep, 2022

0 165


First, Lazy Evaluation is not a concept Spark invented and has been around for a while and is just one of many evaluation strategies. In our context two will be useful to know:

  • Lazy Evaluation is an evaluation strategy that delays the evaluation of an expression until its value is needed.
  • Eager Evaluation is the evaluation strategy you’ll most probably be familiar with and is used in most programming languages. As opposed to Lazy Evaluation, the evaluation of an expression is performed as soon as it is encountered.

Let’s go back to Spark. In Spark, Lazy Evaluation means that You can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called.

💡 So transformations are lazy but actions are eager.

Transformations

Transformations are the instructions you use to modify the DataFrame in the way you want and are lazily executed. There are two types of transformations:

  • narrow transformations: the data it requires to be computed for a single partition exists in that same partition.
    Examples: select , filter
Image by author
  • wide transformations: the data it requires to be computed for a single partition, may exist in multiple partitions.
    Examples: groupBy , repartition
Wide
Image by author

Actions

Actions are statements that will ask for a value to be computed immediately and are eager statements.
Examples: show , count , collect , save

💡 Typically, a transformation will take an RDD and return another RDD. Actions will take an RDD but will return something of different nature:

.---------------- ---------.
| Transformation | Actions |
| -------------- | ------- |
| select | show |
| distinct | count |
| groupBy | collect |
| sum | save |
| orderBy | |
| where | |
| limit | |
.---------------- ---------.

Before Talking about actual advantages, let’s quickly talk about Spark’s Catalyst Optimizer.

When performing different transformations, Spark will store them in a Directed Acyclic Graph (or DAG). You could actually look at the DAG in the SparkUI. Here’s a simplified example of what it might loosely look like:

Image by author

Once the DAG is constructed, Spark’s catalyst optimizer will perform a set of rule-based and cost-based optimizations to determine a logical and then physical plan of execution.

Image by author

Clearly, this is a very simplified version, but it will be enough for our purpose. If you want more details, you can check out this post on Databricks’s blog.

Amongst the advantages of using lazy evaluation we find these three:

1. Improve Efficiency

Spark’s Catalyst optimizer will group operations together, reducing the number of passes on data and improving performance.

One more advantage of the catalyst optimizer is that, often, values that won’t be used for the final result, will simply not be computed.

Time for an example. Let’s first define some dataframe:

Now if we add a column “gender” and then overwrite it immediately afterward:

Spark will automatically group these operations together and therefore ignore the first definition as it is not actually used in the final result. A quick look at the Logical vs Physical Plan will make it clearer:

Screenshot by author

All of this will also optimize driver and cluster communications, and speed up the program.

📔 Check out this blog post for more detailed examples.

2. Better readability

Since you know that Spark will group operations together and optimize the code behind the scenes, you can organize your program using smaller operations which will improve the code readability and maintenance.

3. Memory Management

If Spark’s transformations were eager, you would have to store all the intermediate Dataframes/RDDs somewhere or at the very least managing memory will become another concern of yours.

With Lazy Evaluation, Spark will only store intermediate results for the time they are actually needed.

Obviously, you could circumvent this behavior manually if needed by caching or exporting the results. But most of the time, intermediary results are “intermediary” and do not need to be stored.

It is undeniable that Lazy Evaluation adds a lot of value when it comes to optimization and efficiency. However, it also comes with some setbacks.

One of them is that Lazy evaluation is tricky to use with features coming from imperative programming which assumes a fixed order of execution. An example would be exception handling: If an error occurs during runtime, Spark will only show it when an action is used. Since the order of operations is not determinate (due to potential optimizations), this makes it hard to know which exact transformation caused it.

If you’re new to PySpark and transitioning from Pandas or would simply like to have a nice cheat sheet, you might want to check out this article:

Thank you for sticking this far. Stay safe and see you in the next story 😊!


First, Lazy Evaluation is not a concept Spark invented and has been around for a while and is just one of many evaluation strategies. In our context two will be useful to know:

  • Lazy Evaluation is an evaluation strategy that delays the evaluation of an expression until its value is needed.
  • Eager Evaluation is the evaluation strategy you’ll most probably be familiar with and is used in most programming languages. As opposed to Lazy Evaluation, the evaluation of an expression is performed as soon as it is encountered.

Let’s go back to Spark. In Spark, Lazy Evaluation means that You can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called.

💡 So transformations are lazy but actions are eager.

Transformations

Transformations are the instructions you use to modify the DataFrame in the way you want and are lazily executed. There are two types of transformations:

  • narrow transformations: the data it requires to be computed for a single partition exists in that same partition.
    Examples: select , filter
Narrow
Image by author
  • wide transformations: the data it requires to be computed for a single partition, may exist in multiple partitions.
    Examples: groupBy , repartition
Wide
Image by author

Actions

Actions are statements that will ask for a value to be computed immediately and are eager statements.
Examples: show , count , collect , save

💡 Typically, a transformation will take an RDD and return another RDD. Actions will take an RDD but will return something of different nature:

.---------------- ---------.
| Transformation | Actions |
| -------------- | ------- |
| select | show |
| distinct | count |
| groupBy | collect |
| sum | save |
| orderBy | |
| where | |
| limit | |
.---------------- ---------.

Before Talking about actual advantages, let’s quickly talk about Spark’s Catalyst Optimizer.

When performing different transformations, Spark will store them in a Directed Acyclic Graph (or DAG). You could actually look at the DAG in the SparkUI. Here’s a simplified example of what it might loosely look like:

Image by author

Once the DAG is constructed, Spark’s catalyst optimizer will perform a set of rule-based and cost-based optimizations to determine a logical and then physical plan of execution.

Image by author

Clearly, this is a very simplified version, but it will be enough for our purpose. If you want more details, you can check out this post on Databricks’s blog.

Amongst the advantages of using lazy evaluation we find these three:

1. Improve Efficiency

Spark’s Catalyst optimizer will group operations together, reducing the number of passes on data and improving performance.

One more advantage of the catalyst optimizer is that, often, values that won’t be used for the final result, will simply not be computed.

Time for an example. Let’s first define some dataframe:

Now if we add a column “gender” and then overwrite it immediately afterward:

Spark will automatically group these operations together and therefore ignore the first definition as it is not actually used in the final result. A quick look at the Logical vs Physical Plan will make it clearer:

Screenshot by author

All of this will also optimize driver and cluster communications, and speed up the program.

📔 Check out this blog post for more detailed examples.

2. Better readability

Since you know that Spark will group operations together and optimize the code behind the scenes, you can organize your program using smaller operations which will improve the code readability and maintenance.

3. Memory Management

If Spark’s transformations were eager, you would have to store all the intermediate Dataframes/RDDs somewhere or at the very least managing memory will become another concern of yours.

With Lazy Evaluation, Spark will only store intermediate results for the time they are actually needed.

Obviously, you could circumvent this behavior manually if needed by caching or exporting the results. But most of the time, intermediary results are “intermediary” and do not need to be stored.

It is undeniable that Lazy Evaluation adds a lot of value when it comes to optimization and efficiency. However, it also comes with some setbacks.

One of them is that Lazy evaluation is tricky to use with features coming from imperative programming which assumes a fixed order of execution. An example would be exception handling: If an error occurs during runtime, Spark will only show it when an action is used. Since the order of operations is not determinate (due to potential optimizations), this makes it hard to know which exact transformation caused it.

If you’re new to PySpark and transitioning from Pandas or would simply like to have a nice cheat sheet, you might want to check out this article:

Thank you for sticking this far. Stay safe and see you in the next story 😊!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment