How to introduce the schema in a Row in Spark?
The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.
Concepts related to the topic
- StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
- StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
- DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.
Examples 1:
Step 1: Load the necessary libraries and functions and Create a SparkSession object
Python3
|
Output:
SparkSession - in-memory SparkContext Spark UI Version v3.3.1 Master local[*] AppName Schema
Step 2: Define the schema
Python3
|
Step 3: List of employee data with 5-row values
Python3
|
Step 4: Create a data frame from the data and the schema, and print the data frame
Python3
|
Output:
+---+------+---+ | id| name|age| +---+------+---+ |101|Sravan| 23| |102|Akshat| 25| |103| Pawan| 25| |104|Gunjan| 24| |105|Ritesh| 26| +---+------+---+
Step 5: Print the schema
Output:
root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true)
Step 6: Stop the SparkSession
Example 2:
Steps needed
- Create a StructType object defining the schema of the DataFrame.
- Create a list of StructField objects representing each column in the DataFrame.
- Create a Row object by passing the values of the columns in the same order as the schema.
- Create a DataFrame from the Row object and the schema using the createDataFrame() function.
Creating a data frame with multiple columns of different types using schema.
Python3
|
Output
+---+------+---+ | id| name|age| +---+------+---+ |100|Akshat| 19| +---+------+---+ root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true)
The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.
Concepts related to the topic
- StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
- StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
- DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.
Examples 1:
Step 1: Load the necessary libraries and functions and Create a SparkSession object
Python3
|
Output:
SparkSession - in-memory SparkContext Spark UI Version v3.3.1 Master local[*] AppName Schema
Step 2: Define the schema
Python3
|
Step 3: List of employee data with 5-row values
Python3
|
Step 4: Create a data frame from the data and the schema, and print the data frame
Python3
|
Output:
+---+------+---+ | id| name|age| +---+------+---+ |101|Sravan| 23| |102|Akshat| 25| |103| Pawan| 25| |104|Gunjan| 24| |105|Ritesh| 26| +---+------+---+
Step 5: Print the schema
Output:
root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true)
Step 6: Stop the SparkSession
Example 2:
Steps needed
- Create a StructType object defining the schema of the DataFrame.
- Create a list of StructField objects representing each column in the DataFrame.
- Create a Row object by passing the values of the columns in the same order as the schema.
- Create a DataFrame from the Row object and the schema using the createDataFrame() function.
Creating a data frame with multiple columns of different types using schema.
Python3
|
Output
+---+------+---+ | id| name|age| +---+------+---+ |100|Akshat| 19| +---+------+---+ root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true)