How to Index Elasticsearch Documents with the Bulk API in Python | by Lynn Kwong | Jun, 2022

By Jessie Hobb On Jun 14, 2022

Learn different ways to index documents in bulk efficiently

Image by PublicDomainPictures in Pixabay

When we need to create an Elasticsearch index, the data sources are normally not normalized and cannot be imported directly. The original data can be stored in a database, in raw CSV/XML files, or even obtained from a third-party API. In this case, we need to pre-process the data to make it work with the Bulk API. In this tutorial, we will demonstrate how to index Elasticsearch documents from a CSV file with simple Python code. Both the native Elasticsearch bulk API and the one from the helpers module will be used. You will learn how to use the proper tool to index Elasticsearch documents on different occasions.

Preparations

As usual, I would like to provide all the technical details for setting up a demo system and environment in which you can run the code snippets directly. Running the code by yourself would be the best way to understand the code and logic.

Please use this docker-compose.yaml to set up a local Elasticsearch server with Docker. To learn more about how to run Elasticsearch and Kibana on Docker, please check this post.

Then we need to create a virtual environment and install the Elasticsearch Python client library whose version is compatible with that of the Elasticsearch Docker image. We will install the latest version 8 client.

It’s better to use the latest version when you get started with Elasticsearch. On the other hand, if you want to upgrade your Elasticsearch library from version 7 to 8, please do have a look at this post, which will very likely save you a lot of effort for code updates.

Create the index in Python

We will create the same latops-demo index as demonstrated in this post. However, the syntax will be different since we are using Elasticsearch 8 in this example. First of all, we will use the Elasticsearch client to create an index directly. Besides, the settings and mappings will be passed as top-level parameters, rather than through the body parameter, as explained in this post.

The configurations can be found here, and the command to create the index is:

Now the index is created, we can start to add documents to it.

Use the native Elasticsearch bulk API

When you have a small data set to load, using the native Elasticsearch bulk API would be convenient because the syntax is the same as native Elasticsearch queries which can be run in the Dev console directly. You don’t need to learn anything new.

The data file (dummy data created by the author) that will be loaded can be downloaded from this link. Save it as laptops_demo.csv, which will be used in the Python code below:

Note that we use the csv library to read data from a CSV file conveniently. As can be seen, the syntax for the native bulk API is very straightforward and can be used across different languages (including Dev Tools Console), as shown in the official document.

Use Bulk helpers

A problem with the native bulk API as demonstrated above is that all the data needs to be loaded to memory before it can be indexed. This can be problematic and very inefficient when we have a large dataset. To solve this problem we can use the bulk helper which can index Elasticsearch documents from iterators or generators. Therefore, it doesn’t need to load all data to the memory first, which is very efficient memory-wise. However, the syntax is a bit different, as we will see soon.

Before we index documents with the bulk helper, we should remove the documents in the index to confirm that the bulk helper indeed works successfully:

Then we can run the following code to load the data to Elasticsearch with the bulk helper:

In fact, the code is simpler than using the native bulk API. We only need to specify the document to be indexed, not the action to be performed. Technically, you can specify other actions with the _op_type parameter, such as delete, update, etc, which are not so commonly used though.

Especially, for the bulk helper, a generator is created which yields the document to be indexed. Note that we should call that function to actually create a generator object. With the generator, we don’t load all the documents to memory first. Instead, they are generated on the fly, and thus won’t consume as much memory. If you need more fine-grained control, you can use the parallel_bulk which uses threading to accelerate the indexing process.

In this post, two different ways are introduced to index documents in bulk in Python. The native bulk API and bulk helper are used, respectively. The former is suitable for small datasets which don’t consume a lot of memory, while the latter is for large datasets which are heavy to load. With these two tools, you can index documents in Python conveniently for all kinds of datasets.