Techno Blender
Digitally Yours.

Building a Multi-Modal Image Search Application

0 21


In the world of machine learning, there used to be a limit on models — they could only handle one type of data at a time. However, the ultimate aspiration of machine learning is to rival the cognitive prowess of the human mind, which effortlessly comprehends various data modalities simultaneously. Recent breakthroughs, exemplified by models like GPT-4V, have now demonstrated the remarkable ability to concurrently handle multiple data modalities. This opens up exciting possibilities for developers to craft AI applications capable of seamlessly managing diverse types of data, which are known as multi-modal applications.

One compelling use case that has gained immense popularity is multi-modal image search. It lets users find similar images by analyzing features or visual content. Thanks to the rapid advancements in computer vision and deep learning, image search has become incredibly powerful.

In this article, we’re going to build a multi-modal image search application using a model from the Hugging Face library. Before diving into the practical implementation, let’s go over some basics to set the stage for our exploration.

What Is a Multi-Modal System?

A multi-modal system refers to any system that can use more than one mode of interaction or communication. It means a system that can process and understand different kinds of inputs at the same time, such as text, images, voice, and sometimes even touch or gestures, and can also return results in various ways.

For example, GPT-4V(opens new window, developed by OpenAI, is an advanced multimodal model that can handle multiple “modalities” of the text and image inputs at the same time. When provided with an image accompanied by a descriptive query, the model can analyze the visual content based on the provided text.

What Are Multi-Modal Embeddings?

Multi-modal embedding, an advanced machine-learning technique, is the process of generating a numerical representation of multiple modalities, such as images, text, and audio, in a vector format. Unlike basic embedding techniques, which represent only one single data type in a vector space, multi-modal embedding can represent various data types within a unified vector space. This allows, for example, the correlation of a text description with a corresponding image. With the help of multi-modal embeddings, a system could analyze an image and relate it to relevant textual descriptions or vice versa.

Now, let’s discuss how to develop this project and the technologies we will use.

We will use CLIP(opens new windowMyScale (opens new window), and Unsplash-25k Dataset (opens new window)in this project. Let’s look at them in detail.

  • CLIP: You’ll use a pre-trained multi-modal CLIP (opens new window)developed by OpenAI from Hugging Face. This model will be used to integrate text and images.
  • MyScale: MyScale is a SQL vector database that is used to store and process both structured and unstructured data in an optimized way. You will use MyScale to store the vector embeddings and query the relevant images.
  • Unsplash-25k dataset: The dataset provided by Unsplash contains about 25 thousand images. It includes some complicated scenes and objects.

How To Set up Hugging Face and MyScale

To start using Hugging Face and MyScale in the local environment, you need to install some Python packages. Open your terminal and enter the following pip command:

Download and Load the Dataset

The first step is to download the dataset and extract it locally. You can do that by entering the following commands in your terminal.

photo_id photo_url photo_image_url
xapxF7PcOzU https://unsplash.com/photos/wud-eV6Vpwo https://images.unsplash.com/photo-143924685475…
psIMdj26lgw https://unsplash.com/photos/psIMdj26lgw https://images.unsplash.com/photo-144077331099…

The difference between photo_url and photo_image_url is that the photo_url contains the URL to the description page of an image, telling the author and other meta information of the photo. The photo_image_url contains the URL to the image only, and we will use it to download the image.

Load the Model and Get the Embeddings

After loading the dataset, let’s first load the clip-vit-base-patch32 (opens new window)model and write a Python function to transform images into vector embeddings. This function will use the CLIP model to represent the embeddings.

  • If you provide both an image and text, the code returns a single vector, combining the embeddings of both.
  • If you provide either text or image (but not both), the code simply returns the embeddings of the provided text or image.

Note: We are using a basic way to merge two embeddings just to focus on the multi-modal concept. But there are some better ways to merge embeddings, like concatenation and attention mechanisms.

We’ll load, download, and pass the first 1000 images from the dataset to the above create_embeddings function. The returned embeddings will then be stored in a new column photo_embed.

After this process, our dataset is completed. The next step is to create a new table and store the data in MyScale.

Connect With MyScale

To connect the application with MyScale, you’ll need to complete a few steps for setup and configuration.

Once you have the connection details, you can replace the values in the code below:

Create a Table

Once the connection is established, the next step is to create a table. Now, let’s first take a look at our data frame with this command:

photo_id photo_image_url photo_embed
wud-eV6Vpwo https://images.unsplash.com/uploads/1411949294… [0.0028754104860126972, 0.02760922536253929, 0…
psIMdj26lgw https://images.unsplash.com/photo-141633941111… [0.019032524898648262, -0.04198262840509415, 0…
2EDjes2hlZo https://images.unsplash.com/photo-142014251503… [-0.015412664040923119, 0.01923416182398796, 0…

Let’s create a table depending on the data frame.

Insert the Data

Let’s insert the data into the newly created table:

Note: The MSTG algorithm is created by MyScale, and it’s way faster than other indexing algorithms like IVF and HNSW.

How to Query MyScale

Once the data has been inserted, we are ready to utilize MyScale to query data and use the multimodal to get images. So, Let’s first try to get a random image from the table.

How To Get Relevant Images Using Text and Image

As you have learned, a multi-modal model can process multiple data modalities at the same time. Similarly, our model can simultaneously process both images and text, providing relevant images. We will provide the following image along with the text: ‘A man standing on the beach.’

Let’s pass the image URL with the text to the create_embeddings function.

The above code will generate the results similar to this:

reflective images of the above code

Note: You can further improve the results using better techniques to merge the embeddings.

You may have noticed that the resultant images look like a combination of both text and the image. You can also get the results by providing just an image or text to this model, and it’ll work perfectly fine. For that, you simply need to comment on either the image_url or query_text line of code.

Conclusion

Traditional models are used to get the vector representations of just a single data type, but the latest models are trained on much more data, and they are now able to represent different types of data in just a unified vector space. We have used the abilities of the latest model, CLIP, to develop an application that takes both text and images as input and returns the relevant images.

The capabilities of multi-modal embeddings are not limited to image search applications; rather, you can utilize this cutting-edge technique to develop state-of-the-art recommendation systems, visual question answering applications where users can ask questions related to images, and much more. While developing these applications, consider using MyScale(opens new window, an integrated SQL vector database that enables you to store vector embeddings and tabular data from your dataset with super-fast data retrieval capabilities.


In the world of machine learning, there used to be a limit on models — they could only handle one type of data at a time. However, the ultimate aspiration of machine learning is to rival the cognitive prowess of the human mind, which effortlessly comprehends various data modalities simultaneously. Recent breakthroughs, exemplified by models like GPT-4V, have now demonstrated the remarkable ability to concurrently handle multiple data modalities. This opens up exciting possibilities for developers to craft AI applications capable of seamlessly managing diverse types of data, which are known as multi-modal applications.

One compelling use case that has gained immense popularity is multi-modal image search. It lets users find similar images by analyzing features or visual content. Thanks to the rapid advancements in computer vision and deep learning, image search has become incredibly powerful.

In this article, we’re going to build a multi-modal image search application using a model from the Hugging Face library. Before diving into the practical implementation, let’s go over some basics to set the stage for our exploration.

What Is a Multi-Modal System?

A multi-modal system refers to any system that can use more than one mode of interaction or communication. It means a system that can process and understand different kinds of inputs at the same time, such as text, images, voice, and sometimes even touch or gestures, and can also return results in various ways.

For example, GPT-4V(opens new window, developed by OpenAI, is an advanced multimodal model that can handle multiple “modalities” of the text and image inputs at the same time. When provided with an image accompanied by a descriptive query, the model can analyze the visual content based on the provided text.

What Are Multi-Modal Embeddings?

Multi-modal embedding, an advanced machine-learning technique, is the process of generating a numerical representation of multiple modalities, such as images, text, and audio, in a vector format. Unlike basic embedding techniques, which represent only one single data type in a vector space, multi-modal embedding can represent various data types within a unified vector space. This allows, for example, the correlation of a text description with a corresponding image. With the help of multi-modal embeddings, a system could analyze an image and relate it to relevant textual descriptions or vice versa.

Now, let’s discuss how to develop this project and the technologies we will use.

We will use CLIP(opens new windowMyScale (opens new window), and Unsplash-25k Dataset (opens new window)in this project. Let’s look at them in detail.

  • CLIP: You’ll use a pre-trained multi-modal CLIP (opens new window)developed by OpenAI from Hugging Face. This model will be used to integrate text and images.
  • MyScale: MyScale is a SQL vector database that is used to store and process both structured and unstructured data in an optimized way. You will use MyScale to store the vector embeddings and query the relevant images.
  • Unsplash-25k dataset: The dataset provided by Unsplash contains about 25 thousand images. It includes some complicated scenes and objects.

How To Set up Hugging Face and MyScale

To start using Hugging Face and MyScale in the local environment, you need to install some Python packages. Open your terminal and enter the following pip command:

Download and Load the Dataset

The first step is to download the dataset and extract it locally. You can do that by entering the following commands in your terminal.

photo_id photo_url photo_image_url
xapxF7PcOzU https://unsplash.com/photos/wud-eV6Vpwo https://images.unsplash.com/photo-143924685475…
psIMdj26lgw https://unsplash.com/photos/psIMdj26lgw https://images.unsplash.com/photo-144077331099…

The difference between photo_url and photo_image_url is that the photo_url contains the URL to the description page of an image, telling the author and other meta information of the photo. The photo_image_url contains the URL to the image only, and we will use it to download the image.

Load the Model and Get the Embeddings

After loading the dataset, let’s first load the clip-vit-base-patch32 (opens new window)model and write a Python function to transform images into vector embeddings. This function will use the CLIP model to represent the embeddings.

  • If you provide both an image and text, the code returns a single vector, combining the embeddings of both.
  • If you provide either text or image (but not both), the code simply returns the embeddings of the provided text or image.

Note: We are using a basic way to merge two embeddings just to focus on the multi-modal concept. But there are some better ways to merge embeddings, like concatenation and attention mechanisms.

We’ll load, download, and pass the first 1000 images from the dataset to the above create_embeddings function. The returned embeddings will then be stored in a new column photo_embed.

After this process, our dataset is completed. The next step is to create a new table and store the data in MyScale.

Connect With MyScale

To connect the application with MyScale, you’ll need to complete a few steps for setup and configuration.

Once you have the connection details, you can replace the values in the code below:

Create a Table

Once the connection is established, the next step is to create a table. Now, let’s first take a look at our data frame with this command:

photo_id photo_image_url photo_embed
wud-eV6Vpwo https://images.unsplash.com/uploads/1411949294… [0.0028754104860126972, 0.02760922536253929, 0…
psIMdj26lgw https://images.unsplash.com/photo-141633941111… [0.019032524898648262, -0.04198262840509415, 0…
2EDjes2hlZo https://images.unsplash.com/photo-142014251503… [-0.015412664040923119, 0.01923416182398796, 0…

Let’s create a table depending on the data frame.

Insert the Data

Let’s insert the data into the newly created table:

Note: The MSTG algorithm is created by MyScale, and it’s way faster than other indexing algorithms like IVF and HNSW.

How to Query MyScale

Once the data has been inserted, we are ready to utilize MyScale to query data and use the multimodal to get images. So, Let’s first try to get a random image from the table.

How To Get Relevant Images Using Text and Image

As you have learned, a multi-modal model can process multiple data modalities at the same time. Similarly, our model can simultaneously process both images and text, providing relevant images. We will provide the following image along with the text: ‘A man standing on the beach.’

relevant image

Let’s pass the image URL with the text to the create_embeddings function.

The above code will generate the results similar to this:

reflective images of the above code

Note: You can further improve the results using better techniques to merge the embeddings.

You may have noticed that the resultant images look like a combination of both text and the image. You can also get the results by providing just an image or text to this model, and it’ll work perfectly fine. For that, you simply need to comment on either the image_url or query_text line of code.

Conclusion

Traditional models are used to get the vector representations of just a single data type, but the latest models are trained on much more data, and they are now able to represent different types of data in just a unified vector space. We have used the abilities of the latest model, CLIP, to develop an application that takes both text and images as input and returns the relevant images.

The capabilities of multi-modal embeddings are not limited to image search applications; rather, you can utilize this cutting-edge technique to develop state-of-the-art recommendation systems, visual question answering applications where users can ask questions related to images, and much more. While developing these applications, consider using MyScale(opens new window, an integrated SQL vector database that enables you to store vector embeddings and tabular data from your dataset with super-fast data retrieval capabilities.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment