How to introduce the schema in a Row in Spark?

By Ann Roberts On Jun 9, 2023

Improve Article

Save Article

Like Article

Improve Article

Save Article

Like Article

The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.

Concepts related to the topic

StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.

Examples 1:

Step 1: Load the necessary libraries and functions and Create a SparkSession object

Python3

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

from pyspark.sql import Row

spark = SparkSession.builder.appName("Schema").getOrCreate()

spark

Output:

SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema

Step 2: Define the schema

Python3

schema = StructType([

StructField("id", IntegerType(), True),

StructField("name", StringType(), True),

StructField("age", IntegerType(), True)

])

Step 3: List of employee data with 5-row values

Python3

data = [[101, "Sravan", 23],

[102, "Akshat", 25],

[103, "Pawan", 25],

[104, "Gunjan", 24],

[105, "Ritesh", 26]]

Step 4: Create a data frame from the data and the schema, and print the data frame

Python3

df = spark.createDataFrame(data, schema=schema)

df.show()

Output:

+---+------+---+
| id|  name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+

Step 5: Print the schema

Output:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Step 6: Stop the SparkSession

Example 2:

Steps needed

Create a StructType object defining the schema of the DataFrame.
Create a list of StructField objects representing each column in the DataFrame.
Create a Row object by passing the values of the columns in the same order as the schema.
Create a DataFrame from the Row object and the schema using the createDataFrame() function.

Creating a data frame with multiple columns of different types using schema.

Python3

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

from pyspark.sql import Row

spark = SparkSession.builder.appName("example").getOrCreate()

schema = StructType([

StructField("id", IntegerType(), True),

StructField("name", StringType(), True),

StructField("age", IntegerType(), True)

])

row = Row(id=100, name="Akshat", age=19)

df = spark.createDataFrame([row], schema=schema)

df.show()

df.printSchema()

spark.stop()

Output

+---+------+---+
| id|  name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Last Updated :
09 Jun, 2023

Like Article

Save Article

Improve Article

Save Article

Like Article

Improve Article

Save Article

Like Article

Concepts related to the topic

StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.

Examples 1:

Step 1: Load the necessary libraries and functions and Create a SparkSession object

Python3

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

from pyspark.sql import Row

spark = SparkSession.builder.appName("Schema").getOrCreate()

spark

Output:

SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema

Step 2: Define the schema

Python3

schema = StructType([

StructField("id", IntegerType(), True),

StructField("name", StringType(), True),

StructField("age", IntegerType(), True)

])

Step 3: List of employee data with 5-row values

Python3

data = [[101, "Sravan", 23],

[102, "Akshat", 25],

[103, "Pawan", 25],

[104, "Gunjan", 24],

[105, "Ritesh", 26]]

Step 4: Create a data frame from the data and the schema, and print the data frame

Python3

df = spark.createDataFrame(data, schema=schema)

df.show()

Output:

+---+------+---+
| id|  name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+

Step 5: Print the schema

Output:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Step 6: Stop the SparkSession

Example 2:

Steps needed

Create a StructType object defining the schema of the DataFrame.
Create a list of StructField objects representing each column in the DataFrame.
Create a Row object by passing the values of the columns in the same order as the schema.
Create a DataFrame from the Row object and the schema using the createDataFrame() function.

Creating a data frame with multiple columns of different types using schema.

Python3

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

from pyspark.sql import Row

spark = SparkSession.builder.appName("example").getOrCreate()

schema = StructType([

StructField("id", IntegerType(), True),

StructField("name", StringType(), True),

StructField("age", IntegerType(), True)

])

row = Row(id=100, name="Akshat", age=19)

df = spark.createDataFrame([row], schema=schema)

df.show()

df.printSchema()

spark.stop()

Output

+---+------+---+
| id|  name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Last Updated :
09 Jun, 2023

Like Article

Save Article

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.