Techno Blender
Digitally Yours.

CRUD with Pinecone. A simple guide for getting started with… | by Manfye Goh | May, 2023

0 47


Vector Database

Photo by Brett Sayles from Pexels

The rapid growth of machine learning applications and advancements in artificial intelligence have spurred the demand for specialized data storage solutions.

Vector databases have emerged as a popular choice for handling large-scale, high-dimensional data due to their ability to perform efficient similarity searches and support complex data structures. Pinecone has become increasingly popular among developers and data scientists as a scalable and efficient vector database solution recently.

I found that there are lots of people teaching how to use Pinecone but there isn’t anyone describing it from a traditional database perspective, such as how it compares to a traditional SQL database.

In this article, we will provide a clear understanding of CRUD (Create, Read, Update, and Delete) operations in Pinecone from a traditional database perspective. We will delve into the differences between vector and traditional databases, exploring how vector databases can be harnessed to optimize data management in modern applications.

The code for this article is available here

What is a Vector Database?

A vector database is a specialized database designed to store, manage, and query high-dimensional data represented as vectors. These databases are handy in applications that require efficient similarity searches, such as recommendation systems, image and text search engines, and natural language processing tasks.

In a vector database, data points are represented as vectors in a high-dimensional space, and the relationships between them are measured using distance metrics, such as Euclidean distance, cosine similarity, or Manhattan distance. By indexing these vectors and optimizing search algorithms, vector databases can perform similarity searches rapidly, even with enormous datasets.

Unlike traditional databases that focus on relational or document-based storage, vector databases emphasize the importance of spatial relationships between data points. This unique focus enables vector databases to deliver high-performance, accurate search results in applications that demand quick identification of similar items within a dataset.

To get started, here is an illustration of how we interact with a vector database:

Mode of operation between traditional and vector databases. Images by Author

Getting Started with Pinecone

After you had gained access to Pinecone, create new indexes with the following setting:

Creating new indexes. Images by Author

State your index’s name and the dimensions needed. In my case, I will use the “manfye-test” and a dimension of 300 in my indexes. Click “Create Index” and the index will be created as below:

Created Index. Images by Author

An indexes is like a table in SQL, where you can make your CRUD operation in the indexes just like SQL.

Before we begin with our CRUD operation, let’s gather all the required ingredients:

Installing the required packages which are pinecone-client which allows you to interact with pinecone and sentence_transformers which help you vectorise your data:

!pip install pinecone-client
!pip install sentence_transformers

You can get your Pinecone API key and Environment name via the “API KEY” tab in the Pinecone dashboard.

Getting API key and Environment name. Images by Author

Working with Pinecone Indexes

There are several functions on handling that you need to know before proceeding, just like setting up a SQL connection:

a) Connecting to Pinecone Server and Indexes

import itertools
import pinecone

#Connecting to Pinecone Server
api_key = "YOUR_API_KEY"

pinecone.init(api_key=api_key, environment='YOUR_ENV_IN_DASHBOARD')

#Connect to your indexes
index_name = "manfye-test"

index = pinecone.Index(index_name=index_name)

b) Exploring your indexes

# Getting Index Details
pinecone.describe_index(index_name)

# Return:
# IndexDescription(name='manfye-test', metric='cosine', replicas=1, dimension=300.0, shards=1, pods=1, pod_type='s1.x1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

index.describe_index_stats()

# Return:
# {'dimension': 300,
# 'index_fullness': 0.0,
# 'namespaces': {},
# 'total_vector_count': 0}

The describe_index_stats is extremely useful especially when you want to know how much data is inside your indexes.

Dataset Preparation

Firstly, we will generate a dataset of complaints as below, this will be our main data to play with:

import pandas as pd

data = {
'ticketno': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
'complains': [
'Broken navigation button on the website',
'Incorrect pricing displayed for a product',
'Unable to reset password',
'App crashes on the latest iOS update',
'Payment processing error during checkout',
'Wrong product delivered',
'Delayed response from customer support',
'Excessive delivery time for an order',
'Difficulty in finding a specific product',
'Error in applying a discount coupon'
]
}

df = pd.DataFrame(data)

C — Create Data in Indexes

In order to create data in the vector database, we will first need to convert our data into a vector via a technique called vector embedding. There are multiple ways to do vector embedding, one of the famous ways is using OpenAI embedding API.

However, to not complicate this article, we will use SentenceTransformer packages to do the embedding. The package will automatically download the required models “average_word_embeddings_glove.6B.300d

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_glove.6B.300d")

df["question_vector"] = df.complains.apply(lambda x: model.encode(str(x)).tolist())

The code will create a column “question_vector” with the embedded vectors. Not that all the words in complains are converted into numbers.

Resulted vectors. Images by Authors

Lastly, upload the data (upsert) into the indexes via chunks:

def chunks(iterable, batch_size=100):
it = iter(iterable)
chunk = tuple(itertools.islice(it, batch_size))
while chunk:
yield chunk
chunk = tuple(itertools.islice(it, batch_size))

for batch in chunks([(str(t), v) for t, v in zip(df.ticketno, df.question_vector)]):
index.upsert(vectors=batch)

Now check your indexes with the index.describe_index_stats():

index.describe_index_stats()

# Return:
# {'dimension': 300,
# 'index_fullness': 0.0,
# 'namespaces': {'': {'vector_count': 10}},
# 'total_vector_count': 10}

Note that the vector count already increased to 10, congratulation on uploading your dataset into the vector database.

R — Retrieving Vectors

Read in vector context refer to two functions, the first one is the read function, where you pass the ID of your data and Pinecone will return you the stored vector:

index.fetch(["1010","1009"])

The data retrieval is easy, just use the index.fetch([<IDs List>]), just put in the list of IDs that you want to retrieve and Pinecone will return you the vectors.

The second one consists of returning similar data that match your queries:

query_questions = [
"navigation button",
]

query_vectors = [model.encode(str(question)).tolist() for question in query_questions]

query_results = index.query(queries=query_vectors, top_k=5, include_values=False)

In the code above, I asked Pinecone to find similar results on the “navigation button” and return me the top 5 most similar results (top_k = 5) as below:

Queries result by Pinecone. Images by Author

Note that by default Pinecone will not return you the values unless it is stated in the index.query() parameter include_values=True. From the result above, it is shown the similarity score and the id of the top 5 similar results.

The next thing is to convert the result into a table and merged it back into our main database. The code is as below:

# Extract matches and scores from the results
matches = []
scores = []
for match in query_results['results'][0]['matches']:
matches.append(match['id'])
scores.append(match['score'])

# Create DataFrame with only matches and scores
matches_df = pd.DataFrame({'id': matches, 'score': scores})

# Match the result dataframe to main dataframe
df["ticketno"] = df["ticketno"].astype(str)
matches_df.merge(df,left_on="id",right_on="ticketno")

The resulting matched table is as below, clearly, complaint 1001 is about the navigation button with a similarity score of 0.81 and the rest have lower similarity scores, this might be due to the size of our dataset:

Retrived similar queries data frame. Images by Author

U — Updating Vectors

For updating the existing vector, just repeat the create step with the intended updates vectors. Overwrite the data with the same ID with the upsert function:

    index.upsert(vectors=batch)

D — Deleting Vectors

To delete by IDs:

index.delete(ids=["id-1", "id-2"], namespace='')

To delete everything and have fun again:

index.delete(deleteAll='true', namespace="")

Limitation and Alternative to Pinecone

While Pinecone offers an easy-to-use vector database that is suitable for beginners, it is important to be aware of its limitations. The free tier, which uses a p1 Pod, allows for only about 1,000,000 rows of data in a 768-dimension vector. For larger-scale applications or more demanding use cases, this might not be sufficient.

Moreover, Pinecone’s paid tiers can be quite expensive, which may not be feasible for all users. As a result, you might want to explore other alternatives such as locally host chroma, or Weaviate before committing to a paid plan or expanding your application.

Words from Author

In conclusion, this article has provided a comprehensive guide to understanding and performing CRUD operations with Pinecone from a traditional database perspective.

As the author, my aim was to demystify the process of working with vector databases and highlight how vector databases’ unique features make them a powerful and efficient solution for managing high-dimensional data in machine learning and AI applications.

By walking you through the process of creating, reading, updating, and deleting data in a Pinecone index, I hope to have offered valuable insights on how to effectively manage and query data in a vector database. With this knowledge in hand, I hope you are now equipped to harness the power of vector databases in your own projects and applications.

Lastly, thank you for reading my articles. If you like to subscribe to Medium membership, please consider using my link below. It will provide me with great support in writing more articles.

If you like my article, here are more articles from me:

References

  1. Pinecone Documentation


Vector Database

Photo by Brett Sayles from Pexels

The rapid growth of machine learning applications and advancements in artificial intelligence have spurred the demand for specialized data storage solutions.

Vector databases have emerged as a popular choice for handling large-scale, high-dimensional data due to their ability to perform efficient similarity searches and support complex data structures. Pinecone has become increasingly popular among developers and data scientists as a scalable and efficient vector database solution recently.

I found that there are lots of people teaching how to use Pinecone but there isn’t anyone describing it from a traditional database perspective, such as how it compares to a traditional SQL database.

In this article, we will provide a clear understanding of CRUD (Create, Read, Update, and Delete) operations in Pinecone from a traditional database perspective. We will delve into the differences between vector and traditional databases, exploring how vector databases can be harnessed to optimize data management in modern applications.

The code for this article is available here

What is a Vector Database?

A vector database is a specialized database designed to store, manage, and query high-dimensional data represented as vectors. These databases are handy in applications that require efficient similarity searches, such as recommendation systems, image and text search engines, and natural language processing tasks.

In a vector database, data points are represented as vectors in a high-dimensional space, and the relationships between them are measured using distance metrics, such as Euclidean distance, cosine similarity, or Manhattan distance. By indexing these vectors and optimizing search algorithms, vector databases can perform similarity searches rapidly, even with enormous datasets.

Unlike traditional databases that focus on relational or document-based storage, vector databases emphasize the importance of spatial relationships between data points. This unique focus enables vector databases to deliver high-performance, accurate search results in applications that demand quick identification of similar items within a dataset.

To get started, here is an illustration of how we interact with a vector database:

Mode of operation between traditional and vector databases. Images by Author

Getting Started with Pinecone

After you had gained access to Pinecone, create new indexes with the following setting:

Creating new indexes. Images by Author

State your index’s name and the dimensions needed. In my case, I will use the “manfye-test” and a dimension of 300 in my indexes. Click “Create Index” and the index will be created as below:

Created Index. Images by Author

An indexes is like a table in SQL, where you can make your CRUD operation in the indexes just like SQL.

Before we begin with our CRUD operation, let’s gather all the required ingredients:

Installing the required packages which are pinecone-client which allows you to interact with pinecone and sentence_transformers which help you vectorise your data:

!pip install pinecone-client
!pip install sentence_transformers

You can get your Pinecone API key and Environment name via the “API KEY” tab in the Pinecone dashboard.

Getting API key and Environment name. Images by Author

Working with Pinecone Indexes

There are several functions on handling that you need to know before proceeding, just like setting up a SQL connection:

a) Connecting to Pinecone Server and Indexes

import itertools
import pinecone

#Connecting to Pinecone Server
api_key = "YOUR_API_KEY"

pinecone.init(api_key=api_key, environment='YOUR_ENV_IN_DASHBOARD')

#Connect to your indexes
index_name = "manfye-test"

index = pinecone.Index(index_name=index_name)

b) Exploring your indexes

# Getting Index Details
pinecone.describe_index(index_name)

# Return:
# IndexDescription(name='manfye-test', metric='cosine', replicas=1, dimension=300.0, shards=1, pods=1, pod_type='s1.x1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

index.describe_index_stats()

# Return:
# {'dimension': 300,
# 'index_fullness': 0.0,
# 'namespaces': {},
# 'total_vector_count': 0}

The describe_index_stats is extremely useful especially when you want to know how much data is inside your indexes.

Dataset Preparation

Firstly, we will generate a dataset of complaints as below, this will be our main data to play with:

import pandas as pd

data = {
'ticketno': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
'complains': [
'Broken navigation button on the website',
'Incorrect pricing displayed for a product',
'Unable to reset password',
'App crashes on the latest iOS update',
'Payment processing error during checkout',
'Wrong product delivered',
'Delayed response from customer support',
'Excessive delivery time for an order',
'Difficulty in finding a specific product',
'Error in applying a discount coupon'
]
}

df = pd.DataFrame(data)

C — Create Data in Indexes

In order to create data in the vector database, we will first need to convert our data into a vector via a technique called vector embedding. There are multiple ways to do vector embedding, one of the famous ways is using OpenAI embedding API.

However, to not complicate this article, we will use SentenceTransformer packages to do the embedding. The package will automatically download the required models “average_word_embeddings_glove.6B.300d

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_glove.6B.300d")

df["question_vector"] = df.complains.apply(lambda x: model.encode(str(x)).tolist())

The code will create a column “question_vector” with the embedded vectors. Not that all the words in complains are converted into numbers.

Resulted vectors. Images by Authors

Lastly, upload the data (upsert) into the indexes via chunks:

def chunks(iterable, batch_size=100):
it = iter(iterable)
chunk = tuple(itertools.islice(it, batch_size))
while chunk:
yield chunk
chunk = tuple(itertools.islice(it, batch_size))

for batch in chunks([(str(t), v) for t, v in zip(df.ticketno, df.question_vector)]):
index.upsert(vectors=batch)

Now check your indexes with the index.describe_index_stats():

index.describe_index_stats()

# Return:
# {'dimension': 300,
# 'index_fullness': 0.0,
# 'namespaces': {'': {'vector_count': 10}},
# 'total_vector_count': 10}

Note that the vector count already increased to 10, congratulation on uploading your dataset into the vector database.

R — Retrieving Vectors

Read in vector context refer to two functions, the first one is the read function, where you pass the ID of your data and Pinecone will return you the stored vector:

index.fetch(["1010","1009"])

The data retrieval is easy, just use the index.fetch([<IDs List>]), just put in the list of IDs that you want to retrieve and Pinecone will return you the vectors.

The second one consists of returning similar data that match your queries:

query_questions = [
"navigation button",
]

query_vectors = [model.encode(str(question)).tolist() for question in query_questions]

query_results = index.query(queries=query_vectors, top_k=5, include_values=False)

In the code above, I asked Pinecone to find similar results on the “navigation button” and return me the top 5 most similar results (top_k = 5) as below:

Queries result by Pinecone. Images by Author

Note that by default Pinecone will not return you the values unless it is stated in the index.query() parameter include_values=True. From the result above, it is shown the similarity score and the id of the top 5 similar results.

The next thing is to convert the result into a table and merged it back into our main database. The code is as below:

# Extract matches and scores from the results
matches = []
scores = []
for match in query_results['results'][0]['matches']:
matches.append(match['id'])
scores.append(match['score'])

# Create DataFrame with only matches and scores
matches_df = pd.DataFrame({'id': matches, 'score': scores})

# Match the result dataframe to main dataframe
df["ticketno"] = df["ticketno"].astype(str)
matches_df.merge(df,left_on="id",right_on="ticketno")

The resulting matched table is as below, clearly, complaint 1001 is about the navigation button with a similarity score of 0.81 and the rest have lower similarity scores, this might be due to the size of our dataset:

Retrived similar queries data frame. Images by Author

U — Updating Vectors

For updating the existing vector, just repeat the create step with the intended updates vectors. Overwrite the data with the same ID with the upsert function:

    index.upsert(vectors=batch)

D — Deleting Vectors

To delete by IDs:

index.delete(ids=["id-1", "id-2"], namespace='')

To delete everything and have fun again:

index.delete(deleteAll='true', namespace="")

Limitation and Alternative to Pinecone

While Pinecone offers an easy-to-use vector database that is suitable for beginners, it is important to be aware of its limitations. The free tier, which uses a p1 Pod, allows for only about 1,000,000 rows of data in a 768-dimension vector. For larger-scale applications or more demanding use cases, this might not be sufficient.

Moreover, Pinecone’s paid tiers can be quite expensive, which may not be feasible for all users. As a result, you might want to explore other alternatives such as locally host chroma, or Weaviate before committing to a paid plan or expanding your application.

Words from Author

In conclusion, this article has provided a comprehensive guide to understanding and performing CRUD operations with Pinecone from a traditional database perspective.

As the author, my aim was to demystify the process of working with vector databases and highlight how vector databases’ unique features make them a powerful and efficient solution for managing high-dimensional data in machine learning and AI applications.

By walking you through the process of creating, reading, updating, and deleting data in a Pinecone index, I hope to have offered valuable insights on how to effectively manage and query data in a vector database. With this knowledge in hand, I hope you are now equipped to harness the power of vector databases in your own projects and applications.

Lastly, thank you for reading my articles. If you like to subscribe to Medium membership, please consider using my link below. It will provide me with great support in writing more articles.

If you like my article, here are more articles from me:

References

  1. Pinecone Documentation

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment