Techno Blender
Digitally Yours.

How to Perform Bulk Inserts With SQLAlchemy Efficiently in Python | by Lynn Kwong | Sep, 2022

0 127


Learn different ways to insert large numbers of records into the database efficiently in Python

Image by PublicDomainPictures in Pixabay

It’s very convenient to use SQLAlchemy to interact with relational databases with plain SQL queries or object-relational mappers (ORM). However, when it comes to bulk inserts, namely, inserting a large number of records into a table, we may often have performance issues. In this post, we will introduce different ways for bulk inserts and compare their performances through a hands-on tutorial. You will have a better understanding of these methods and can choose one that best suits your practical case.

Install SQLAlchemy

In order to make it a hands-on tutorial that can be followed along, we need to install the libraries required on our computers. It’s recommended to install the packages in a virtual environment so it won’t mess up your system libraries. We will use conda to create a virtual environment because we can install a specific version of Python in the virtual environment:

Packages installed for this tutorial:

  • SQLAlchemy — The main package that will be used to interact with a database.
  • mysqlclient — A high-performance driver for the MySQL database. If you encounter some issues installing mysqlclient or using it, you can install PyMySQL instead which has no system dependency issues. Check this post if needed for using PyMySQL as the driver.
  • cryptography — Used by SQLAlchemy for authentication.
  • ipython — Used to execute interactive Python code more conveniently.

Set up a local MySQL server

In this tutorial, we will use a local MySQL server rather than SQLite to make it more similar to practical use cases. The server can be conveniently set up with Docker:

Note that a volume is attached to the MySQL Docker container so the data can persist even when the container is restarted. Besides, the root password is specified as an environment variable so it can be used for authentication later. Finally, a high port (13306) is assigned for the container so it won’t have potential conflicts with other existing MySQL servers.

Set up database connection

Now let’s set up the database connection metadata which will be used in the tests to be introduced soon. Two context managers are created that yield a Session and a Connection object, respectively. The Session object will be used to perform operations with ORM models and the Connection object for working with SQLAlchemy Core APIs or executing plain SQL queries directly. Some cleanup work is done in the context managers as well so we can run multiple tests consecutively. For a more detailed introduction regarding the SQLAlchemy engine, connection and session, please have a look at this post.

The code snippet for setting up database connection metadata is as follows:

Alert:

  • It is 127.0.0.1 rather than localhost that should be used as the hostname in the DB URL above, otherwise, there can be connection issues.

Create an ORM class for testing

We will create a simple customers table that has two fields, namely id and name, with id with the primary key which is auto-incremented by default. By the way, the table will be located in the data schema as specified in the DB URL above. The ORM class for this table is shown below. This table will be created when a session or connection is created by the context managers and will be dropped when the cleanup parameter is True.

Add the ORM objects one by one

Now let’s add a large number of records to the table with different methods and compare their performances. The first one is Session.add() which is very commonly used when you use ORM to interact with a database.

We will first add 20,000 records to the database without specifying the primary keys:

This test function takes about 5 seconds. The time taken can vary depending on the performance of your computer and can be slightly different each time it’s run. If it’s too fast or too slow, you can fine-tune the num parameter. And if you want to check the inserted data in the database, set cleanup to be False.

With ORM, there is a short-cut method Session.add_all() that takes a list of ORM instances as the parameter:

The performance with Session.add() and Session.add_all() should be pretty similar because the data is not saved to the database until you run Session.commit() which is the real time-limiting step.

Actually, taking 5 seconds to insert 20,000 records can be a major performance issue for an application. It can be more serious if the database is located on a remote server. There are two main reasons responsible for the low performance:

  • An ORM instance needs to be created for each record.
  • The primary keys and other default values need to be returned to the ORM instances due to the unit-of-work design of ORM.

The second one is more impactful, which can be proved if we provide primary keys for the ORM instances created:

Alert:

  • If the primary key is auto-incremented and is explicitly specified like here, it must not be zero otherwise the data may not be inserted successfully. You can try to change id=idx+1 to id=idx and see if it happens to you as well.

It turned out that the performance can be dramatically improved if primary keys are provided. This is great! However, this is not the most efficient way to perform bulk inserts with SQLAlchemy and sometimes it may not be applicable to specify primary keys, so hang in there a bit.

Use Session.bulk_save_objects

SQLAlchemy has some methods specifically designed for bulk operations. For bulk inserts, there are Session.bulk_save_objects() and Session.bulk_insert_mappings(). Session.bulk_save_objects() takes a list of ORM instances as the parameter, similar to Session.add_all(), while Session.bulk_insert_mappings() takes a list of mappings/dictionaries as the parameter. We will use Session.bulk_save_objects() here, and use Session.bulk_insert_mappings() later.

Before starting to use it, we should be aware of two major caveats of Session.bulk_save_objects():

  • Most ORM benefits like foreign key relationships and automatic updates of the attributes are not available for the ORM instances passed this method. If we want to have these benefits, then we should not use this method, but use Session.add_all() instead.
  • We should not return the primary keys of the ORM instances being inserted, otherwise, the performance will be dramatically degraded. If we need the primary keys to be returned, we should also use Session.add_all() instead.

In the following code snippet, we will perform three tests and compare their performances:

  • Use Session.bulk_save_objects() with primary keys returned.
  • Use Session.bulk_save_objects() without returning primary keys.
  • Use Session.bulk_save_objects() and explicitly specify the primary keys.

When the three tests are run, it shows that when the primary keys are returned, the performance is indeed dramatically degraded. However, different from the case of Session.add_all(), it doesn’t matter much if primary keys are specified for the ORM instances to be saved or not.

Use bulk_insert_objects

Another SQLAlchemy method for bulk inserts is Session.bulk_insert_mappings(). As the name indicates, a list of mappings (dictionaries in Python) is passed as the parameter for this method. The benefit of using mappings directly is to avoid the overhead of creating ORM instances, which is normally not an issue but can become significant when a large number of ORM instances need to be created and saved.

In the following code snippet, we will perform two tests and compare their performances:

  • Use Session.bulk_insert_mappings() with no primary keys specified.
  • Use Session.bulk_insert_mappings() with primary keys specified.

The above tests are extremely fast. If no primary keys are specified, it’s about two times faster than Session.bulk_save_objects() which is in turn about five times faster than Session.add_all(). Besides, similar to Session.bulk_save_objects(), it doesn’t matter much if primary keys are specified for the mappings to be saved.

Use SQLAlchemy Core API

SQLAlchemy ORM models are built on top of the Core APIs. If performance is the only goal, we should use the Core APIs for inserting directly and avoid all the overheads of ORM models.

We can use SQL Expression Language to access the Core APIs of SQLAlchemy. The benefit of using the SQL Expression Language is the ability to access Core APIs directly and thus achieve high performance and at the same time provide a backend/database-neutral language applicable to all types of relational databases. We will introduce the direct usage of plain MySQL queries in the next section.

We can use the __table__property of the ORM class to access the underlying Table object which provides the Insert construct. Similar to Session.bulk_insert_mappings(), a list of mappings/dictionaries can be passed to the Insert construct. However, an SQLAlchemy Connection object is used to execute the insert expression, rather than a Session object.

In the following code snippet, we will perform two tests and compare their performances:

  • Use Core API to insert a list of dictionaries with no primary keys.
  • Use Core API to insert a list of dictionaries with primary keys specified.

The above tests are even faster than using Session.bulk_insert_mappings(), but not much, since the overhead of using ORM models is avoided completely here. Besides, it doesn’t matter much if primary keys are specified for the mappings to be saved.

Use plain SQL query

If you are an old school who only wants to work with plain SQL queries and don’t want to deal with Core API or ORM at all, you can use Connection.exec_driver_sql() to perform bulk inserts, which utilizes underlying DBAPI directly and has the same performance as using the Core APIs shown above:

Check this post if you want to learn more about executing plain SQL queries in SQLAlchemy.

The code for all the examples can be found here. You can run it directly once you install the libraries and set up the MySQL server.

In this post, different SQLAlchemy methods are introduced for bulk inserts. The codes for them are introduced in an easy-to-follow manner and the performances are compared systematically.

In conclusion, if you work with plain SQL queries, you don’t need to worry about the performance of SQLAlchemy because it’s calling the underlying DBAPI directly. It’s the query itself that should be optimized.

If you sit in the middle and don’t use plain SQL queries or ORM models, but use the so-called Expression Language, you can use the Insert construct which access the Core API directly to perform bulk inserts which is also very efficient.

Finally, if you use ORM models and want to access the updated status of the ORM instances after inserts, you should use Session.add_all(). Provide the primary keys if available because the performance can be dramatically increased. On the other hand, if you use ORM models and don’t need to access the updated data, you can use Session.bulk_insert_mappings() which has comparable efficiency to Core APIs.


Learn different ways to insert large numbers of records into the database efficiently in Python

Image by PublicDomainPictures in Pixabay

It’s very convenient to use SQLAlchemy to interact with relational databases with plain SQL queries or object-relational mappers (ORM). However, when it comes to bulk inserts, namely, inserting a large number of records into a table, we may often have performance issues. In this post, we will introduce different ways for bulk inserts and compare their performances through a hands-on tutorial. You will have a better understanding of these methods and can choose one that best suits your practical case.

Install SQLAlchemy

In order to make it a hands-on tutorial that can be followed along, we need to install the libraries required on our computers. It’s recommended to install the packages in a virtual environment so it won’t mess up your system libraries. We will use conda to create a virtual environment because we can install a specific version of Python in the virtual environment:

Packages installed for this tutorial:

  • SQLAlchemy — The main package that will be used to interact with a database.
  • mysqlclient — A high-performance driver for the MySQL database. If you encounter some issues installing mysqlclient or using it, you can install PyMySQL instead which has no system dependency issues. Check this post if needed for using PyMySQL as the driver.
  • cryptography — Used by SQLAlchemy for authentication.
  • ipython — Used to execute interactive Python code more conveniently.

Set up a local MySQL server

In this tutorial, we will use a local MySQL server rather than SQLite to make it more similar to practical use cases. The server can be conveniently set up with Docker:

Note that a volume is attached to the MySQL Docker container so the data can persist even when the container is restarted. Besides, the root password is specified as an environment variable so it can be used for authentication later. Finally, a high port (13306) is assigned for the container so it won’t have potential conflicts with other existing MySQL servers.

Set up database connection

Now let’s set up the database connection metadata which will be used in the tests to be introduced soon. Two context managers are created that yield a Session and a Connection object, respectively. The Session object will be used to perform operations with ORM models and the Connection object for working with SQLAlchemy Core APIs or executing plain SQL queries directly. Some cleanup work is done in the context managers as well so we can run multiple tests consecutively. For a more detailed introduction regarding the SQLAlchemy engine, connection and session, please have a look at this post.

The code snippet for setting up database connection metadata is as follows:

Alert:

  • It is 127.0.0.1 rather than localhost that should be used as the hostname in the DB URL above, otherwise, there can be connection issues.

Create an ORM class for testing

We will create a simple customers table that has two fields, namely id and name, with id with the primary key which is auto-incremented by default. By the way, the table will be located in the data schema as specified in the DB URL above. The ORM class for this table is shown below. This table will be created when a session or connection is created by the context managers and will be dropped when the cleanup parameter is True.

Add the ORM objects one by one

Now let’s add a large number of records to the table with different methods and compare their performances. The first one is Session.add() which is very commonly used when you use ORM to interact with a database.

We will first add 20,000 records to the database without specifying the primary keys:

This test function takes about 5 seconds. The time taken can vary depending on the performance of your computer and can be slightly different each time it’s run. If it’s too fast or too slow, you can fine-tune the num parameter. And if you want to check the inserted data in the database, set cleanup to be False.

With ORM, there is a short-cut method Session.add_all() that takes a list of ORM instances as the parameter:

The performance with Session.add() and Session.add_all() should be pretty similar because the data is not saved to the database until you run Session.commit() which is the real time-limiting step.

Actually, taking 5 seconds to insert 20,000 records can be a major performance issue for an application. It can be more serious if the database is located on a remote server. There are two main reasons responsible for the low performance:

  • An ORM instance needs to be created for each record.
  • The primary keys and other default values need to be returned to the ORM instances due to the unit-of-work design of ORM.

The second one is more impactful, which can be proved if we provide primary keys for the ORM instances created:

Alert:

  • If the primary key is auto-incremented and is explicitly specified like here, it must not be zero otherwise the data may not be inserted successfully. You can try to change id=idx+1 to id=idx and see if it happens to you as well.

It turned out that the performance can be dramatically improved if primary keys are provided. This is great! However, this is not the most efficient way to perform bulk inserts with SQLAlchemy and sometimes it may not be applicable to specify primary keys, so hang in there a bit.

Use Session.bulk_save_objects

SQLAlchemy has some methods specifically designed for bulk operations. For bulk inserts, there are Session.bulk_save_objects() and Session.bulk_insert_mappings(). Session.bulk_save_objects() takes a list of ORM instances as the parameter, similar to Session.add_all(), while Session.bulk_insert_mappings() takes a list of mappings/dictionaries as the parameter. We will use Session.bulk_save_objects() here, and use Session.bulk_insert_mappings() later.

Before starting to use it, we should be aware of two major caveats of Session.bulk_save_objects():

  • Most ORM benefits like foreign key relationships and automatic updates of the attributes are not available for the ORM instances passed this method. If we want to have these benefits, then we should not use this method, but use Session.add_all() instead.
  • We should not return the primary keys of the ORM instances being inserted, otherwise, the performance will be dramatically degraded. If we need the primary keys to be returned, we should also use Session.add_all() instead.

In the following code snippet, we will perform three tests and compare their performances:

  • Use Session.bulk_save_objects() with primary keys returned.
  • Use Session.bulk_save_objects() without returning primary keys.
  • Use Session.bulk_save_objects() and explicitly specify the primary keys.

When the three tests are run, it shows that when the primary keys are returned, the performance is indeed dramatically degraded. However, different from the case of Session.add_all(), it doesn’t matter much if primary keys are specified for the ORM instances to be saved or not.

Use bulk_insert_objects

Another SQLAlchemy method for bulk inserts is Session.bulk_insert_mappings(). As the name indicates, a list of mappings (dictionaries in Python) is passed as the parameter for this method. The benefit of using mappings directly is to avoid the overhead of creating ORM instances, which is normally not an issue but can become significant when a large number of ORM instances need to be created and saved.

In the following code snippet, we will perform two tests and compare their performances:

  • Use Session.bulk_insert_mappings() with no primary keys specified.
  • Use Session.bulk_insert_mappings() with primary keys specified.

The above tests are extremely fast. If no primary keys are specified, it’s about two times faster than Session.bulk_save_objects() which is in turn about five times faster than Session.add_all(). Besides, similar to Session.bulk_save_objects(), it doesn’t matter much if primary keys are specified for the mappings to be saved.

Use SQLAlchemy Core API

SQLAlchemy ORM models are built on top of the Core APIs. If performance is the only goal, we should use the Core APIs for inserting directly and avoid all the overheads of ORM models.

We can use SQL Expression Language to access the Core APIs of SQLAlchemy. The benefit of using the SQL Expression Language is the ability to access Core APIs directly and thus achieve high performance and at the same time provide a backend/database-neutral language applicable to all types of relational databases. We will introduce the direct usage of plain MySQL queries in the next section.

We can use the __table__property of the ORM class to access the underlying Table object which provides the Insert construct. Similar to Session.bulk_insert_mappings(), a list of mappings/dictionaries can be passed to the Insert construct. However, an SQLAlchemy Connection object is used to execute the insert expression, rather than a Session object.

In the following code snippet, we will perform two tests and compare their performances:

  • Use Core API to insert a list of dictionaries with no primary keys.
  • Use Core API to insert a list of dictionaries with primary keys specified.

The above tests are even faster than using Session.bulk_insert_mappings(), but not much, since the overhead of using ORM models is avoided completely here. Besides, it doesn’t matter much if primary keys are specified for the mappings to be saved.

Use plain SQL query

If you are an old school who only wants to work with plain SQL queries and don’t want to deal with Core API or ORM at all, you can use Connection.exec_driver_sql() to perform bulk inserts, which utilizes underlying DBAPI directly and has the same performance as using the Core APIs shown above:

Check this post if you want to learn more about executing plain SQL queries in SQLAlchemy.

The code for all the examples can be found here. You can run it directly once you install the libraries and set up the MySQL server.

In this post, different SQLAlchemy methods are introduced for bulk inserts. The codes for them are introduced in an easy-to-follow manner and the performances are compared systematically.

In conclusion, if you work with plain SQL queries, you don’t need to worry about the performance of SQLAlchemy because it’s calling the underlying DBAPI directly. It’s the query itself that should be optimized.

If you sit in the middle and don’t use plain SQL queries or ORM models, but use the so-called Expression Language, you can use the Insert construct which access the Core API directly to perform bulk inserts which is also very efficient.

Finally, if you use ORM models and want to access the updated status of the ORM instances after inserts, you should use Session.add_all(). Provide the primary keys if available because the performance can be dramatically increased. On the other hand, if you use ORM models and don’t need to access the updated data, you can use Session.bulk_insert_mappings() which has comparable efficiency to Core APIs.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment