Real-Time Typeahead Search with Elasticsearch (AWS OpenSearch) | by Zhou (Joe) Xu | Jun, 2022

By Jessie Hobb On Jun 30, 2022

An end-to-end example of building a scalable and intelligent search engine on the cloud with the MovieLens dataset

Typeahead Example of Searching in Google. Image by Author

· 1. Introduction
· 2. Dataset Preparation
· 3. Setting up the OpenSearch
· 4. Index data
· 5. Basic Query with Match
· 6. Basic Front-end Implementation with Jupyter Notebook and ipywidgets
· 7. Some Advanced Queries
∘ 7.1 Match Phrase Prefix
∘ 7.2 Match + Prefix with Boolean
∘ 7.3 Multi-field Search
· 8. Conclusion
· About Me
· References

Have you ever thought about how Google makes its search engine so intelligent that it can predict what we think and autocomplete the whole search term even without us typing the whole thing? It is called typeahead search. It is a very useful language prediction tool that many search interfaces use to provide suggestions for users as they type in a query. [1]

As a data scientist or anyone who works on the backend of the data, sometimes we may want such an interactive search engine interface for our users to query structured/unstructured data with minimal effort. This can always bring the user experience to the next level.

Luckily, we don’t have to build it from scratch. There are many open-source tools ready to be used, and one of them is Elasticsearch.

Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization. [2]

On the other hand, AWS OpenSearch, created by Amazon, is a forked version of Elasticsearch fit into its AWS ecosystem. It has a very similar interface with underlying structures with Elasticsearch. In this post, to simplify the process of downloading, installing, and setting up Ealsticsearch on your local machine, I will instead walk you through an end-to-end example of indexing and querying data using AWS Open Search.

In the real world, another great reason to use such cloud services is scalability. We can easily adjust the resources we need to accommodate any data complexity.

Please bear in mind that even though we use AWS OpenSearch here, you can still follow the steps in Elasticsearch if you already have it set up. These tools are very similar in nature.

In this example, we are going to use the MovieLens 20M Dataset, which is a popular open movie dataset used by many data professionals in various projects. It is called 20M because there are 20 million ratings included in the dataset. In addition, there are 465,000 tag applications, 27,000 movies, and 138,000 users included in the whole dataset.

This dataset contains several files and can be used for very complex examples, but assuming we only want to build a movie search engine here that can query movie titles, years, and genres, we only need one file movies.csv.

This is a very clean dataset. The structure is shown below:

movies.csv (MovieLense 20M). Image by Author

There are only 3 fields: movieId, title (with years in parenthesis), and genres (separated by |). We are going to index the dataset using title and genres, but it looks like there are movies without genres specified (eg, movieId = 131260), so we may want to replace these genres as NA, to prevent them from being queried as unwanted genre keywords. Several lines of processing should suffice:

import pandas as pd
import numpy as npdf = pd.read_csv('../data/movies.csv')
df['genres'] = df['genres'].replace('(no genres listed)', np.NaN)
df.to_csv('../data/movies_clean.csv', index=False)

With this super short chunk of code, we have just cleaned up the dataset and saved it as a new file called movie_clean.csv . Now we can go ahead and spin up an AWS OpenSearch domain.

Here is the official documentation from AWS OpenSearch. You can follow it for a more detailed introduction, or you can read through the simplified version I made below.

If you don’t have an AWS account, you can follow this link to sign up for AWS. You also need to add a payment method for AWS services. However don’t panic yet, as in this tutorial, we will use the minimal resources and the cost should be no more than $1.

After your account is created, simply log into your AWS management console, and search for the OpenSearch service, or click here to go into the OpenSearch dashboard.

In the dashboard, follow the steps below:

Choose Create domain.
Give a Domain name.
In Development type, select Development and testing.

4. Change Instance type to t3.small.search, and keep all others as default.

5. For simplicity of this project, in Network, choose Public access

6. In Fine-grained access control, Create the master user by setting the username and password.

7. In the Access policy, Choose Only use fine-grained access control

8. Ignore all the other settings by leaving them as default. Click on Create. This can take up to 15–30 minutes to spin up, but usually faster from my experience.

AWS OpenSearch or Elasticsearch is intelligent enough to automatically index any data we upload, after which we can write queries with any logical rules to query the results. However, some preprocessing work might be needed to simplify our query efforts.

As we recall, our data consists of 3 columns:

Both titles and genres are important to us as we may want to enter any keywords in either/both of them to search for the movie we want. Multi-field search is supported in OpenSearch, but for simplicity of query, we can also preprocess it by putting all of our interested keywords into one dedicated column, so that it increases the efficiency and lowers the query complexity.

Preprocess to create a new search_index column. Code by Author

Using the preprocessing code above, we insert a new column called search_index to the dataframe that contains the title and all the genres:

Dataframe with search_index added. Image by Author

The next step is to convert data into JSON format in order to bulk upload it to our domain. The format specified for bulk data upload can be found in the developer guide Option 2. Something like this:

{"index": {"_index": "movies", "_id": "2"}}
{"title": "Jumanji (1995)", "genres": "Adventure|Children|Fantasy", "search_index": "Jumanji (1995) Adventure Children Fantasy"}

where the first line specifies the index (document) name to be saved in the domain as well as the record id (Here I used the movieId column as the unique identifier). The second line includes all the other fields in the dataset.

The following code is used for the conversion:

Convert from dataframe to JSON. Code by Author

After being converted, the data is stored in the data folder as movies.json . Now we need to upload it into the domain as below:

Bulk Upload JSON data into the domain. Code by Author

Note that the endpoint can be found on your OpenSearch domain page. The username and password are the master username & password we set when creating this domain.

If it returns a <Response [200]>, then we are good to go. The dataset is successfully uploaded into the AWS OpenSearch domain.

Now, with the data uploaded, we have done all the work on the server-side. OpenSearch automatically indexes the data to be ready for queries. We can now start working on the client-side to querying the data from the domain.

To read more about the querying languages, here are 2 options:

Get started with the AWS OpenSearch Service Developer Guide
There are some very detailed documentations for querying data from Elasticsearch and OpenSearch in this Official Elasticsearch Guide Query DSL.

However, we do not need very advanced functionalities in this example. We will mostly use around the standard match query with some small variations.

Here is a basic example:

Basic Match Query. Code by Author

Here, we write the query to look for any matched records with the title = “jumanji”, serialize the JSON query as a string, and send it to the domain with the endpoint and credentials. Let’s see the returned result:

{'took': 4,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 1, 'relation': 'eq'},
'max_score': 10.658253,
'hits': [{'_index': 'movies',
'_type': '_doc',
'_id': '2',
'_score': 10.658253,
'_source': {'title': 'Jumanji (1995)',
'genres': 'Adventure|Children|Fantasy',
'search_index': 'Jumanji (1995) Adventure Children Fantasy'}}]}}

As we can see, it returns the record with the title equals to jumanji . There is only one matched result from our dataset, with the exact title as Jumanji (1995) , together with the other info such as id, genres, and the search_index.

OpenSearch automatically handles the upper/lower case letters, any symbols, and white spaces, so it can find our record well. In addition, the score means how much confidence the returned results match our query, the higher the better. In this case, it’s 10.658253 . If we include the year in the search query, like “jumanji 1995”, the score will then increase to 16.227726 . It is an important metric to rank the results when there are multiple ones returned by the query.

As a data scientist, Jupyter Notebook is a good friend, and with the popular ipywidgets, we can make the notebooks very interactive. Here is some code to build a basic GUI that includes a text box (for entering keywords) and a text output (for query results display).