Getting started with Delta Lake & Spark in AWS— The Easy Way | by Irfan Elahi | Aug, 2022

By Jessie Hobb On Aug 31, 2022

A step-by-step tutorial to configure Apache Spark and Delta Lake on EC2 in AWS along with code examples in Python

If you have worked on engineering a datalake or lake-house solutions, chances are that you may have employed (or have heard of) de-coupled and distributed computation frameworks against scalable storage layer of your datalake platform. Though the list of such computation frameworks is growing, but Apache Spark has continued to evolve and prove its robustness in the big data processing landscape quite consistently. A number of vendors are offering varying flavors (e.g. managed, serverless, containerized ) of Apache Spark based solutions (e.g. Databricks, Amazon EMR) in accordance with the growing success of Apache Spark as well. Additionally, there has been a surge in addressing ACID limitations on existing data-lake solution and you may have heard of solutions like Delta Lake, Hudi, Iceberg in this context. All of this seems fascinating but it can be a bit overwhelming for beginners at times. There is a possibility that you may want to start small to embrace the potential of Apache Spark on an ACID compliant data-lake (e.g. by having it properly configured in a VM (EC2) in AWS). Or you may want to address specific use-cases which don’t require vendor-based offerings or don’t require a fleet of instances for massive distributed processing. Some examples of such specific use-cases could be:

You find that your existing ETL processes don’t have SQL like processing capabilities. e.g. assuming that you are using Python, you are relying mostly on Python native data structures like lists, dictionaries and its modules likes csv, pandas to achieve desired transformation. You believe that having the capability to run SQL expressions interchangeable with more abstract and scalable data structures like dataframes to achieve your desired transformation can accelerate and simplify your development e.g. instead of resorting to multiple lines of codes to read data from landing directory and split it into multiple target directories based on value of column; you can achieve it conveniently with a couple of lines of code in Spark SQL.
You find S3 Select to be pretty limiting and want a better solution to run SQL on S3 in addition to Athena.
If the velocity of your data is near-real-time and at any point in time, the amount of data being processed can fit within the resources i.e. CPU and RAM of a VM and thus it doesn’t require distributed computation.
You want to prototype Apache Spark and Delta Lake based solution to substantiate its business value by starting small i.e. by running it in an EC2 instance of your choice. You don’t want to run a full-blown vendor based solution for your initial prototype. Same heuristics apply if you want to learn these new technologies.

If any one of the above clauses resonate with you, then you will probably find this article quite helpful as it explains in simple/easy way how to get started with Apache Spark and Delta Lake on an EC2 instance in AWS.

Pre-Requisites

To follow along, you will require the following:

An AWS Account
EC2 instance (can be of any size but would suggest at least the ones with 2 vCPUs) configured with the following:
– Python (ideally > 3.8) (optionally with virtual environment configured)
– JDK (ideally Oracle’s but OpenJDK works fine as well. I’ve specifically found Oracle JDK 11 to be quite reliable for Apache Spark 3)
S3 bucket (where you will hydrate/store data for your Delta Lake)
IAM role attached to your EC2 instance that allows read/write from/to the S3 bucket

Steps

First step is to install PySpark in your (virtual) environment. At the time of this writing, I’ve found pyspark 3.2.2 to be quite stable when used in conjunction with Delta Lake dependencies. So I’ll be using that in this article.

If you are using pip to install dependencies in your environment, run this:

pip install pyspark==3.2.2

if all goes well, PySpark module will be installed in your Python environment! Time to use it now.

A step-by-step tutorial to configure Apache Spark and Delta Lake on EC2 in AWS along with code examples in Python

You find that your existing ETL processes don’t have SQL like processing capabilities. e.g. assuming that you are using Python, you are relying mostly on Python native data structures like lists, dictionaries and its modules likes csv, pandas to achieve desired transformation. You believe that having the capability to run SQL expressions interchangeable with more abstract and scalable data structures like dataframes to achieve your desired transformation can accelerate and simplify your development e.g. instead of resorting to multiple lines of codes to read data from landing directory and split it into multiple target directories based on value of column; you can achieve it conveniently with a couple of lines of code in Spark SQL.
You find S3 Select to be pretty limiting and want a better solution to run SQL on S3 in addition to Athena.
If the velocity of your data is near-real-time and at any point in time, the amount of data being processed can fit within the resources i.e. CPU and RAM of a VM and thus it doesn’t require distributed computation.
You want to prototype Apache Spark and Delta Lake based solution to substantiate its business value by starting small i.e. by running it in an EC2 instance of your choice. You don’t want to run a full-blown vendor based solution for your initial prototype. Same heuristics apply if you want to learn these new technologies.

Pre-Requisites

To follow along, you will require the following:

An AWS Account
EC2 instance (can be of any size but would suggest at least the ones with 2 vCPUs) configured with the following:
– Python (ideally > 3.8) (optionally with virtual environment configured)
– JDK (ideally Oracle’s but OpenJDK works fine as well. I’ve specifically found Oracle JDK 11 to be quite reliable for Apache Spark 3)
S3 bucket (where you will hydrate/store data for your Delta Lake)
IAM role attached to your EC2 instance that allows read/write from/to the S3 bucket

Steps

If you are using pip to install dependencies in your environment, run this:

pip install pyspark==3.2.2

if all goes well, PySpark module will be installed in your Python environment! Time to use it now.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Getting started with Delta Lake & Spark in AWS— The Easy Way | by Irfan Elahi | Aug, 2022

A step-by-step tutorial to configure Apache Spark and Delta Lake on EC2 in AWS along with code examples in Python

Pre-Requisites

Steps

Reading data from S3

SQL Based Transformations on data

Writing data in Delta format to S3

Performing Updates on the Data

Reference

A step-by-step tutorial to configure Apache Spark and Delta Lake on EC2 in AWS along with code examples in Python

Pre-Requisites

Steps

Reading data from S3

SQL Based Transformations on data

Writing data in Delta format to S3

Performing Updates on the Data

Reference