Techno Blender
Digitally Yours.

Delta Lake — Automatic Schema Evolution | by Vitor Teixeira | Mar, 2023

0 46


What happens and what you can/can’t do when merging evolutive DataFrames

Photo by McDobbie Hu on Unsplash

In the last post, we covered the transaction log and how to keep Delta Tables fast and clean. This time we will be covering automatic schema evolution in Delta tables.

Schema evolution is a critical aspect of managing data over time. It is very common for data sources to evolve and adapt to new business requirements, which might mean adding or removing fields from an existing data schema. As a data consumer, it is imperative a quick and agile adaption to the new characteristics of the data sources and automatic schema evolution allows us to seamlessly adapt to these changes.

In this post, we will cover automatic schema evolution in Delta while using the people10m public dataset that is available on Databricks Community Edition. We’ll test adding and removing fields in several scenarios.

Automatic schema evolution can be enabled in two ways, depending on our workload. If we are doing blind appends, all we need to do is to enable mergeSchema option:

If we use a merge strategy for inserting data we need to enable spark.databricks.delta.schema.autoMerge.enabled by setting it to true.

In this post, we’ll be using merge so we’ll go with the latter.

We are all set so we can load our Delta Table which should look like this:

Initial dataset

To simulate evolving schemas we will be creating custom DataFrames using a hand-made schema and merging them using Scala’s Delta API.

Disclaimer: All the updates that we will be doing to the schema are just examples and are not meant to make much sense.

Initial DataFrame schema
Simulating and merging new records

Adding a field

Let’s say our company wants to be nickname friendly and people can be called by their favorite nicknames (how awesome!).

We’ll add a new field to our current schema called nickName and update Pennie’s nickName (id number 1).

Schema with nickName
Adding a new field

As we can see a new field as added and Pennie can now be called by her new favorite nickname! Notice how all the other records’ value was automatically filled with null.

Removing a field

With the addition of nicknames, everyone started thinking about how no one uses their middle name so they decided to remove it.

Schema without middleName

We’re going to update Quyen’s nickname as well but as a result of the source deleting the field, her middle name won’t be present. What should happen to the table?

Table after deleting middleName

If you guessed nothing, you were right. Every current target table record remains the same, only new records will have middleName as null.

To showcase this we’re going to insert a new id (0).

Table after inserting a new record

Renaming a column

Renaming a column is the same as removing a column and adding another with a new name. If you wish to rename a column in place please refer to Delta Column Name Mapping.

I won’t dig further into this topic as even though it is a schema evolution, it is not automatic. Have in mind that this feature is irreversible, once you turn it on you aren’t able to turn it off.

Changing a column type/order

Changing a column type or column order is also not part of automatic schema evolution.

Adding/Removing a field in a struct

Let’s imagine that we have added an employee history struct that includes the startDate and endDate to track when the employee started and left the job.

For a more complete history, we now wish to include the title in order to track the employee’s career in the company.

Updated struct with ‘title’

As we can see, adding a field to a struct is also not an issue. If we try to remove the newly added field it will also work. Adding and removing fields inside a struct works the same way as if it is performed on the root.

Adding/Removing a field in an array of structs

Now we are getting more complex. In this case, we’ll be adding a new field to a struct that is inside an array. Imagine we now have an array of equipment that currently belongs to an employee:

To showcase the addition of a new field inside the array we’ll be adding a serial_num to the struct so that we can better track the equipment.

Updated struct inside array with ‘serial_num’

As we can see, this also works as expected. The table schema is updated, new records have the respective serial_num and older records serial_num are filled with null values.

If we remove the newly added field again it works as expected.

Adding/Removing a field in a map of structs

Now it’s time to test the same but inside a map. We have added a new column called connections that will be responsible for holding the hierarchy for each employee.

To simulate an update we’ll be adding a new column called title to the struct inside the connections column.

Updated struct inside map with ‘title’

This time, removing the field that returns an AnalysisException which means that MapType conversions are not well supported.

After a brief investigation, I found that it is due to castIfNeeded function not supporting MapTypes yet. I have opened a bug and will try to work on a fix for this issue.

https://github.com/delta-io/delta/issues/1641

In this article, we went through the addition and removal of fields in several different scenarios. We concluded that automatic schema evolution in Delta is very complete and supports most of the complex scenarios. By allowing these scenarios we can avoid having to manually intervene to update our schemas when data sources evolve. This is especially useful when consuming hundreds of data sources.

As a bonus, we also found a missing case that is not supported in MapTypes which is a great opportunity to give back to the community for such an awesome open-source project.

I hope you liked the read! Make sure to tune in for more!


What happens and what you can/can’t do when merging evolutive DataFrames

Photo by McDobbie Hu on Unsplash

In the last post, we covered the transaction log and how to keep Delta Tables fast and clean. This time we will be covering automatic schema evolution in Delta tables.

Schema evolution is a critical aspect of managing data over time. It is very common for data sources to evolve and adapt to new business requirements, which might mean adding or removing fields from an existing data schema. As a data consumer, it is imperative a quick and agile adaption to the new characteristics of the data sources and automatic schema evolution allows us to seamlessly adapt to these changes.

In this post, we will cover automatic schema evolution in Delta while using the people10m public dataset that is available on Databricks Community Edition. We’ll test adding and removing fields in several scenarios.

Automatic schema evolution can be enabled in two ways, depending on our workload. If we are doing blind appends, all we need to do is to enable mergeSchema option:

If we use a merge strategy for inserting data we need to enable spark.databricks.delta.schema.autoMerge.enabled by setting it to true.

In this post, we’ll be using merge so we’ll go with the latter.

We are all set so we can load our Delta Table which should look like this:

Initial dataset

To simulate evolving schemas we will be creating custom DataFrames using a hand-made schema and merging them using Scala’s Delta API.

Disclaimer: All the updates that we will be doing to the schema are just examples and are not meant to make much sense.

Initial DataFrame schema
Simulating and merging new records

Adding a field

Let’s say our company wants to be nickname friendly and people can be called by their favorite nicknames (how awesome!).

We’ll add a new field to our current schema called nickName and update Pennie’s nickName (id number 1).

Schema with nickName
Adding a new field

As we can see a new field as added and Pennie can now be called by her new favorite nickname! Notice how all the other records’ value was automatically filled with null.

Removing a field

With the addition of nicknames, everyone started thinking about how no one uses their middle name so they decided to remove it.

Schema without middleName

We’re going to update Quyen’s nickname as well but as a result of the source deleting the field, her middle name won’t be present. What should happen to the table?

Table after deleting middleName

If you guessed nothing, you were right. Every current target table record remains the same, only new records will have middleName as null.

To showcase this we’re going to insert a new id (0).

Table after inserting a new record

Renaming a column

Renaming a column is the same as removing a column and adding another with a new name. If you wish to rename a column in place please refer to Delta Column Name Mapping.

I won’t dig further into this topic as even though it is a schema evolution, it is not automatic. Have in mind that this feature is irreversible, once you turn it on you aren’t able to turn it off.

Changing a column type/order

Changing a column type or column order is also not part of automatic schema evolution.

Adding/Removing a field in a struct

Let’s imagine that we have added an employee history struct that includes the startDate and endDate to track when the employee started and left the job.

For a more complete history, we now wish to include the title in order to track the employee’s career in the company.

Updated struct with ‘title’

As we can see, adding a field to a struct is also not an issue. If we try to remove the newly added field it will also work. Adding and removing fields inside a struct works the same way as if it is performed on the root.

Adding/Removing a field in an array of structs

Now we are getting more complex. In this case, we’ll be adding a new field to a struct that is inside an array. Imagine we now have an array of equipment that currently belongs to an employee:

To showcase the addition of a new field inside the array we’ll be adding a serial_num to the struct so that we can better track the equipment.

Updated struct inside array with ‘serial_num’

As we can see, this also works as expected. The table schema is updated, new records have the respective serial_num and older records serial_num are filled with null values.

If we remove the newly added field again it works as expected.

Adding/Removing a field in a map of structs

Now it’s time to test the same but inside a map. We have added a new column called connections that will be responsible for holding the hierarchy for each employee.

To simulate an update we’ll be adding a new column called title to the struct inside the connections column.

Updated struct inside map with ‘title’

This time, removing the field that returns an AnalysisException which means that MapType conversions are not well supported.

After a brief investigation, I found that it is due to castIfNeeded function not supporting MapTypes yet. I have opened a bug and will try to work on a fix for this issue.

https://github.com/delta-io/delta/issues/1641

In this article, we went through the addition and removal of fields in several different scenarios. We concluded that automatic schema evolution in Delta is very complete and supports most of the complex scenarios. By allowing these scenarios we can avoid having to manually intervene to update our schemas when data sources evolve. This is especially useful when consuming hundreds of data sources.

As a bonus, we also found a missing case that is not supported in MapTypes which is a great opportunity to give back to the community for such an awesome open-source project.

I hope you liked the read! Make sure to tune in for more!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment