Hands-on Sentiment Analysis on Hotels Reviews Using Artificial Intelligence and Open AI’s ChatGPT, with Python | by Piero Paialunga | Dec, 2022

By Jessie Hobb On Dec 15, 2022

Here’s how to automatically classify good and bad reviews using Machine Learning, in a few lines of code

I am a researcher, and I work with AI every day. I can argue that everyone in my position is excited like a dog that stares at an ice cream cone.🤩

This is the reason:

Open AI’s Chat GPT is awesome.

For those who don’t know what I’m talking about, Chat GPT is an artificial intelligence chatbot that can do, well, pretty much everything.
It can code, it can write articles, it can help you decorate your home, it can make up a recipe (I don’t recommend that if you are Italian), and the list goes on.

Yes, we can argue that it will cause ethical (and not only ethical) problems in the future. My mother is a high school teacher, and she is terrified about the idea that her students will use Chat GPT to cheat on their tests, and this is just one of the many examples of how things can go wrong with this incredibly powerful technology.

But again, the problem is the use, not the product. If we strictly talk about the technological aspect (that is, frankly, the one I am more interested in, as I am a certified nerd), it is freaking amazing.

Now, a lot of developers have used and tested this chatbot to try and develop their codes and their AI ideas, and of course, the usage of this chatbot strictly depends on your background. For example, if you are a web developer, you would ask ChatGPT to build a website using HTML. If you are a tester, you could ask ChatGPT to help you find that bug in that specific system.

In my specific case, I am a researcher. In particular, what I do for a living is build some surrogate models using AI. Let’s say that you want to conduct research on “A,” but to do “A,” you need a lot of money, a lot of power, and a lot of computational time. The idea behind this surrogate model is to replace it with a data-driven approach using artificial intelligence.

Now let’s completely change the subject for a moment.

Let’s say I am an entrepreneur, and I have tons of hotels all over the USA. Given a certain review of a given hotel, I want to know if that review is a good one or a bad one for that hotel. How do I do that? I have three options:

I hire a person that reads millions of reviews and classify them every day and I probably get arrested because it is clearly an abuse of human rights.
I hire a person that, among other things reads hundreds of reviews and classifies them every day. After months, I am able to build a dataset out of this. I train a Machine Learning model out of that dataset.
I automatically generate good and bad reviews. I instantly build a dataset out of this. I train a Machine Learning model out of that dataset.

As I value the time of my reader, let me skip the first option.

The second option is what you would do before ChatGPT. You can’t know in advance if a review is bad or good, so if you want to build a dataset out of this, you need to hire people and wait until the dataset is ready.

Now that we have ChatGPT, though, we can simply ask them to generate good and bad reviews! This would take minutes (rather than months), and it will allow us to build our machine learning algorithm to automatically classify our customer reviews!

Congratulations, this is your first Surrogate Model. 😊

Keep in mind that we will not train ChatGPT or do any fine-tuning. This model is exceptional for a task like this, and no fine-tuning is required in this instance. Now, the training of the ChatGPT model is, of course, not open source (just like the model itself). All we know is the small description that is in the Open AI’s blog. They explain that the model is trained by human AI trainers and a reinforcement-learning supervised algorithm.*

*The fact alone that OpenAI’s ChatGPT is not open source raises some very tricky and interesting ethical questions. Should such a powerful model be open source, so that everyone (bad people too) can use it, or should it be not open source so that no one can really trust it?

Let me recap:

The little brain-shell thing that you see is the surrogate model; as we will see, it will be a random forest. But I said it was a hands-on article, so let’s dive in! (so excited!!!)

*I am sorry, I love spoilers.

The first step is to use Open’s AI Python API to generate our simulations.
A few things to consider:
1. Open AI is made by geniuses for non-genius users. For this reason, if you want to install it, you just have to do:

pip install --upgrade openai

(that is LOVABLE)

2. Of course, if you want to send a lot of requests, you will have to pay for a premium service. Assuming we don’t want to do that, we’d just have to wait around 30 minutes to get our dataset of fake reviews. Again, this is nothing compared to the waiting time (and cost) of months that we’d have to wait if we did this manually. You’ll also have to log in to Open AI and get your Open AI key.

3. We will automatically input whether this is a good review or a bad review by starting with the same sentence: “This hotel was terrible.” for a bad review and “This hotel was great.” for a good review. ChatGPT will complete our review for us. Of course, except for these first four words, which we won’t include in our reviews anyway, the rest of the review will be different.

Let me give you an example of a bad review:

And of a good one:

Now this is the code you will need to generate your whole dataset.

We will then store everything in a dataframe using Pandas.

Import pandas and build df:

2. Fill df:

3. Export df:

So now we need to build and train a machine learning algorithm.
As we are dealing with texts, the first thing that we need to do is use a vectorizer. A vectorizer is something that transforms a text into a vector.

Like:

As you can see, similar texts have similar vectors (I know, similarity is a tricky concept, but you know what I mean). and different texts have non-similar vectors.

There are tons of ways to do the vectorization steps. Some ways are more complex than others; some ways are more efficient than others; some ways require machine learning, and some ways don’t.

For the purpose of this project (and because I am not a NLP Machine Learning engineer), we will use a fairly simple one, which is called TfIDF vectorizer, ready to use on SkLearn.

Let’s start by importing the libraries:

And by importing the dataset we just generated using ChatGPT,

A little bit of preprocessing here and there…

And here we go:

Fantastic. Now let’s do this vectorization thing:

As I was telling you earlier, the machine learning model we will be using is known as Random Forest. What is a random forest? It is a collection of decision trees. And what is a decision tree?

A decision tree is a Machine Learning algorithm that, given a certain information theory criterion, optimizes a tree-search of all the possible splits of the features of your dataset, until it finds a way to distinguish what is 1 and what is 0 based on that split.*

* I am sorry if it is super confusing, but explaining this in 4 lines is a hard task. This article takes its time to do so and it does that splendidly. Highly recommended.

Now let’s:

1. Define our random forest:

2. Split our dataset into training and testing:

3. Train our model:

The results are pretty impressive, especially considering the lack of hyperparameter tuning.

As we have our trained model, you can use that on a new, unlabeled dataset. I used a set of New York City hotel reviews I found online, but you can use your own or you could make up a review and see how it works.

This dataset is open source (CC0: Public Domain), very small (2MB) and can be downloaded on Kaggle.

Let’s preprocess the review column (or your text):

And let’s print our predictions:

As we can see, all these 5 random reviews that are classified as 1 are actually good!

Let’s show a count plot:

So what did we do here?

We acknowledge that ChatGPT is awesome.
We used ChatGPT to build a dataset for our surrogate model. More specifically, we used ChatGPT to make up good and bad reviews for hotels.
We used that labeled dataset that we built to train our Machine Learning model. The model that we used is a Random Forest Classifier.
We tested our trained model on a new dataset getting promising results.

Is there any room for improvement? A ton.

We can get the Open AI premium service and generate WAY more than 1000 reviews.
We can improve our querying skills by giving different inputs, maybe also in other languages rather than english only
We can improve our Machine Learning model by doing some hyperparameter tuning

Now, let me conclude this with a thought.

There are, and there will be, a lot of concerns about how and who is going to use Open AI ChatGPT. While I am not a lawyer (let alone an expert in ethical AI), I can imagine how this tool can be dangerous in many ways and on many different levels.

I strongly disagree with the people who are not impressed by the performance of ChatGPT, because I find it pretty amazing and I am so excited to see how this technology is going to evolve. I hope this toy example sparks something in my readers too. ❤️

If you liked the article and you want to know more about machine learning, or you just want to ask me something, you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available.