Spam Classification using OpenAI – GeeksforGeeks

By Ann Roberts On Jun 2, 2023

The majority of people in today’s society own a mobile phone, and they all frequently get communications (SMS/email) on their phones. But the key point is that some of the messages you get may be spam, with very few being genuine or important interactions. You may be tricked into providing your personal information, such as your password, account number, or Social Security number, by scammers that send out phony text messages. They may be able to access your bank, email, and other accounts if they obtain this information. To filter out these messages, a spam filtering system is used that marks a message spam on the basis of its contents or sender.

In this article, we will be seeing how to develop a spam classification system and also evaluate our model using various metrics. In this article, we will be majorly focusing on OpenAI API. There are 2 ways to

We will be using the Email Spam Classification Dataset dataset which has mainly 2 columns and 5572 rows with spam and non-spam messages. You can download the dataset from here.

Steps to implement Spam Classification using OpenAI

Now there are two approaches that we will be covering in this article:

1. Using Embeddings API developed by OpenAI

Step 1: Install all the necessary salaries

!pip install -q openai

Step 2: Import all the required libraries

Python3

import openai

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score

from sklearn.metrics import confusion_matrix

Step 3: Assign your API key to the OpenAI environment

Python3

openai.api_key = "YOUR API KEY"

Step 4: Read the CSV file and clean the dataset

Our dataset has 3 unnamed columns with NULL values,

Note: Open AI’s public API does not process more than 60 requests per minute. so we will drop them and we are taking only 60 records here only.

Python3

df = pd.read_csv('spam.csv', encoding_errors='ignore', on_bad_lines='skip')

print(df.shape)

df = df.dropna(axis=1)

df = df.iloc[:60]

df.rename(columns = {'v1':'OUTPUT', 'v2': 'TEXT'}, inplace = True)

print(df.shape)

df.head()

Output:

Email Spam Classification Dataset

Step 5: Define a function to use Open AI’s Embedding API

We use the Open AI’s Embedding function to generate embedding vectors and use them for classification. Our API uses the “text-embedding-ada-002” model which belongs to the second generation of embedding models developed by OpenAI. The embeddings generated by this model are of length 1536.

Python3

def get_embedding(text, model="text-embedding-ada-002"):

return openai.Embedding.create(input = , model=model)['data'][0]['embedding']

df["embedding"] = df.TEXT.apply(get_embedding).apply(np.array)

df.head()

Output:

Email Spam Classification Dataset

Step 6: Custom Label the classes of the output variable to 1 and 0, where 1 means “spam” and 0 means “not spam”.

Python3

class_dict = {'spam': 1, 'ham': 0}

df['class_embeddings'] = df.OUTPUT.map(class_dict)

df.head()

Output:

Spam Classification dataFrame after feature engineerin

Step 7: Develop a Classification model.

We will be splitting the dataset into a training set and validation dataset using train_test_split and training a Random Forest Classification model.

Python3

X = np.array(df.embedding)

y = np.array(df.class_embeddings)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train.tolist(), y_train)

preds = clf.predict(X_test.tolist())

report = classification_report(y_test, preds)

print(report)

Output:

             precision    recall  f1-score   support
           0       0.82      1.00      0.90         9
           1       1.00      0.33      0.50         3
    accuracy                           0.83        12
   macro avg       0.91      0.67      0.70        12
weighted avg       0.86      0.83      0.80        12

Step 8: Calculate the accuracy of the model

Python3

print("accuracy: ", np.round(accuracy_score(y_test, preds)*100,2), "%")

Output:

accuracy:  83.33 %

Step 9: Print the confusion matrix for our classification model

Python3

confusion_matrix(y_test, preds)

Output:

array([[9, 0],
       [2, 1]])

2. Using text completion API developed by OpenAI

Step 1: Install the Openai library in the Python environment

!pip install -q openai

Step 2: Import the following libraries

Step 3: Assign your API key to the Openaithe environment

Python3

openai.api_key = "YOUR API KEY"

Step 4: Define a function using the text completion API of Openai

Python3

def spam_classification(message):

response = openai.Completion.create(

model="text-davinci-003",

prompt=f"Classify the following message as spam or not spam:\n\n{message}\n\nAnswer:",

temperature=0,

max_tokens=64,

top_p=1.0,

frequency_penalty=0.0,

presence_penalty=0.0

)

return response['choices'][0]['text'].strip()

Step 5: Try out the function with some examples

Example 1:

Python3

out = spam_classification(

)

print(out)

Output:

Spam

Example 2:

Python3

out = spam_classification("Hey Alex, just wanted to let you know tomorrow is an off. Thank you")

print(out)

Output:

Not spam

Frequently Asked Questions (FAQs)

1. Which algorithm is best for spam detection?

There isn’t a single algorithm that has consistently produced reliable outcomes. The type of the spam, the data that is accessible, and the particular requirements of the problem are some of the variables that affect an algorithm’s efficiency. Although Naive Bayes, Neural Networks (RNNs), Logistic Regression, Random Forest, and Support Vector Machines are some of the most frequently used classification techniques.

2. What is embedding or word embedding?

The embedding or Word embedding is a natural language processing (NLP) technique where words are mapped into vectors of real numbers. It is a way of representing words and documents through a dense vector representation. This representation is learned from data and is shown to capture the semantic and syntactic properties of words. The words closest in vector space have the most similar meanings.

3. Is spam classification supervised or unsupervised?

Spam classification is supervised as one requires both independent variable(message contents) and target variables(outcome,i.e., whether the email is spam or not) to develop a model.

4. What is spam vs ham classification?

Email that is not spam is referred to be “Ham”. Alternatively, “good mail” or “non-spam” It ought to be viewed as a quicker, snappier alternative to “non-spam”. The phrase “non-spam” is probably preferable in most contexts because it is more extensively used by anti-spam software makers than it is elsewhere.

Conclusion

In this article, we discussed the development of a spam classifier using OpenAI modules. Open AI has many such modules that can help you ease your daily work and also help you get started with projects in the field of Artificial Intelligence. You can check out other tutorials using Open AI API’s below:

Last Updated :
02 Jun, 2023

Like Article

Save Article

We will be using the Email Spam Classification Dataset dataset which has mainly 2 columns and 5572 rows with spam and non-spam messages. You can download the dataset from here.

Steps to implement Spam Classification using OpenAI

Now there are two approaches that we will be covering in this article:

1. Using Embeddings API developed by OpenAI

Step 1: Install all the necessary salaries

!pip install -q openai

Step 2: Import all the required libraries

Python3

import openai

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, accuracy_score

from sklearn.metrics import confusion_matrix

Step 3: Assign your API key to the OpenAI environment

Python3

openai.api_key = "YOUR API KEY"

Step 4: Read the CSV file and clean the dataset

Our dataset has 3 unnamed columns with NULL values,

Note: Open AI’s public API does not process more than 60 requests per minute. so we will drop them and we are taking only 60 records here only.

Python3

df = pd.read_csv('spam.csv', encoding_errors='ignore', on_bad_lines='skip')

print(df.shape)

df = df.dropna(axis=1)

df = df.iloc[:60]

df.rename(columns = {'v1':'OUTPUT', 'v2': 'TEXT'}, inplace = True)

print(df.shape)

df.head()

Output:

Email Spam Classification Dataset

Step 5: Define a function to use Open AI’s Embedding API

Python3

def get_embedding(text, model="text-embedding-ada-002"):

return openai.Embedding.create(input = , model=model)['data'][0]['embedding']

df["embedding"] = df.TEXT.apply(get_embedding).apply(np.array)

df.head()

Output:

Email Spam Classification Dataset

Step 6: Custom Label the classes of the output variable to 1 and 0, where 1 means “spam” and 0 means “not spam”.

Python3

class_dict = {'spam': 1, 'ham': 0}

df['class_embeddings'] = df.OUTPUT.map(class_dict)

df.head()

Output:

Spam Classification dataFrame after feature engineerin

Step 7: Develop a Classification model.

We will be splitting the dataset into a training set and validation dataset using train_test_split and training a Random Forest Classification model.

Python3

X = np.array(df.embedding)

y = np.array(df.class_embeddings)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train.tolist(), y_train)

preds = clf.predict(X_test.tolist())

report = classification_report(y_test, preds)

print(report)

Output:

             precision    recall  f1-score   support
           0       0.82      1.00      0.90         9
           1       1.00      0.33      0.50         3
    accuracy                           0.83        12
   macro avg       0.91      0.67      0.70        12
weighted avg       0.86      0.83      0.80        12

Step 8: Calculate the accuracy of the model

Python3

print("accuracy: ", np.round(accuracy_score(y_test, preds)*100,2), "%")

Output:

accuracy:  83.33 %

Step 9: Print the confusion matrix for our classification model

Python3

confusion_matrix(y_test, preds)

Output:

array([[9, 0],
       [2, 1]])

2. Using text completion API developed by OpenAI

Step 1: Install the Openai library in the Python environment

!pip install -q openai

Step 2: Import the following libraries

Step 3: Assign your API key to the Openaithe environment

Python3

openai.api_key = "YOUR API KEY"

Step 4: Define a function using the text completion API of Openai

Python3

def spam_classification(message):

response = openai.Completion.create(

model="text-davinci-003",

prompt=f"Classify the following message as spam or not spam:\n\n{message}\n\nAnswer:",

temperature=0,

max_tokens=64,

top_p=1.0,

frequency_penalty=0.0,

presence_penalty=0.0

)

return response['choices'][0]['text'].strip()

Step 5: Try out the function with some examples

Example 1:

Python3

out = spam_classification(

)

print(out)

Output:

Spam

Example 2:

Python3

out = spam_classification("Hey Alex, just wanted to let you know tomorrow is an off. Thank you")

print(out)

Output:

Not spam

Frequently Asked Questions (FAQs)

1. Which algorithm is best for spam detection?

There isn’t a single algorithm that has consistently produced reliable outcomes. The type of the spam, the data that is accessible, and the particular requirements of the problem are some of the variables that affect an algorithm’s efficiency. Although Naive Bayes, Neural Networks (RNNs), Logistic Regression, Random Forest, and Support Vector Machines are some of the most frequently used classification techniques.

2. What is embedding or word embedding?

The embedding or Word embedding is a natural language processing (NLP) technique where words are mapped into vectors of real numbers. It is a way of representing words and documents through a dense vector representation. This representation is learned from data and is shown to capture the semantic and syntactic properties of words. The words closest in vector space have the most similar meanings.

3. Is spam classification supervised or unsupervised?

Spam classification is supervised as one requires both independent variable(message contents) and target variables(outcome,i.e., whether the email is spam or not) to develop a model.

4. What is spam vs ham classification?

Email that is not spam is referred to be “Ham”. Alternatively, “good mail” or “non-spam” It ought to be viewed as a quicker, snappier alternative to “non-spam”. The phrase “non-spam” is probably preferable in most contexts because it is more extensively used by anti-spam software makers than it is elsewhere.

Conclusion

Last Updated :
02 Jun, 2023

Like Article

Save Article

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.