My First Exploratory Data Analysis with ChatGPT | by Jye Sawtell-Rickson | May, 2023

By Jessie Hobb On May 11, 2023

Unleashing the power of ChatGPT: A deep dive into an exploratory data analysis and future opportunities

“An AI exploring a vast world of data. Digital art. Vivid colours.” (Author generated via DALL-E 2)

ChatGPT is an extraordinary tool for working more efficiently, and that doesn’t stop with data analytics. In this article we’ll run through an example of exploratory data analysis (EDA) run by ChatGPT. We’ll cover the various stages of an EDA, see some impressive outputs (Wordclouds!) and note where ChatGPT does well (and not so well). Finally, we’ll touch on the future of LLMs in analytics and how excited we are for it.

The dataset used for the analysis is a sample from Common Crawl, which is free to be accessed and analysed by anyone. The Common Crawl dataset is a vast collection of web crawl data, comprising billions of web pages from the internet. The dataset includes various web content types and is regularly updated. It serves as a significant resource for training language models like LLMs and made up 60% of the training data for ChatGPT. You can find the dataset sample curated by the author hosted on Kaggle here.

Throughout the post, content will be truncated, so feel free to follow along directly on the Google Colab used to run this analysis.

We’ve broken down the analysis into five sections:

🤔 Overall Structure
🧹 Data Cleaning
💯 Basic Analysis
🇬🇧 NLP Approaches
🔑 Key Findings

With all that out of the way, let’s jump into the analysis!

Before starting analytics work, it’s good to formulate a plan. EDA’s typically follow a standard structure to start with, before going off on tangents as interesting insights are discovered in the analysis. One such template is this, which details the core components: preview data, clean data, plot data. We could use this as a starting point, but let’s assume we know nothing and see what ChatGPT comes up with.

Prompt:

Let's run an EDA on the Common Crawl text dataset together. The dataset is a scrape of many web pages on the internet. The columns are 'text', 'WARC-Type', 'WARC-Target-URI', 'WARC-Date', 'WARC-Record-ID', 'WARC-Refers-To', 'WARC-Block-Digest', 'WARC-Identified-Content-Language', 'Content-Type', and 'Content-Length'. The dataset is saved in a csv file called common_crawl_text.

You should give me an overall plan, then we can go through step-by-step. In each step you should write some code and I will run it in the terminal then report back to you the results.

Response:

Sure, here's a high-level plan for the EDA on the Common Crawl text dataset:

Load the dataset into a pandas DataFrame.
Check for missing values in each column and handle them appropriately.
Explore the distribution of the target variable 'WARC-Type'.
Check the distribution of 'WARC-Identified-Content-Language'.
Check the distribution of 'Content-Type'.
Explore the 'text' column, including the distribution of text lengths and the most common words.
Check for any correlations between the different columns.
Summarize the findings and identify any interesting patterns or insights.

Let's start with step 1 and load the dataset into a pandas DataFrame.

ChatGPT has done a pretty good job here covering the standard steps. Notice that it doesn’t want us to deeply analyse all columns, but has a subset that it’s specifically interested in. This is sensible, but I think it’s missing some valuable data points. For example, “WARC-Target-URI” is worth exploring, and the “WARC-Date” column will give insights to when the data was collected.

I prompted ChatGPT with this, and it happily included these into the plan. At the same time, it added in the creation of a Wordcloud, something I had in the back of my mind.

I love this sort of interaction, where it feels like a partnership, rather than just reading a book.

With each of the steps outlined in the structure, we can ask ChatGPT to expand on that step and give us the exact code to run. To start off, it suggests:

We can start by loading the Common Crawl text dataset and taking a look at its structure. Here's the code to load the dataset into a Pandas DataFrame:

import pandas as pd

df = pd.read_csv("common_crawl_text.csv")

Once you've loaded the dataset, you can check the structure of the DataFrame using the following code:

df.head()

This will show you the first five rows of the DataFrame. You can also use the [df.info](<http://df.info/>)() method to get an overview of the columns and their data types, as well as the number of non-null values in each column.

After giving it the results, it’s able to share some basic information back:

Based on the provided data, it seems like we have a dataset of web archive records with different attributes such as the type of the record, the URI (Uniform Resource Identifier) of the record, the date it was recorded, the length of the content, and the content itself. The content is in different languages and formats, such as plain text, and the length of the content varies. The dataset contains 37832 records, and some of them have null values in the content length column.

We can go a little further and ask it to give a detailed description of each column:

Prompt:

Can you give a detailed list of the columns and the expected meaning?

Response:

Sure! Here's a detailed list of the columns in the DataFrame and their expected meaning:

WARC-Type: The type of WARC record. A WARC (Web ARChive) record is a format for storing internet resources such as web pages and their associated metadata.
WARC-Target-URI: The target URI (Uniform Resource Identifier) of the archived web page.
WARC-Date: The date and time when the web page was archived.
WARC-Record-ID: The unique identifier of the WARC record.
WARC-Refers-To: The identifier of a related WARC record that this record refers to.
WARC-Block-Digest: A cryptographic hash value computed over the contents of the WARC record.
WARC-Identified-Content-Language: The language(s) identified in the content of the web page.
Content-Type: The media type of the content, such as text/plain or text/html.
Content-Length: The length of the content in bytes.
text: The extracted text content of the web page.

Note that some of these columns are specific to the WARC format and may not be present in other types of data.

It’s done a fairly good job at guessing what these columns mean. Some of these may not be correct (e.g. Content-Length), but it’s a helpful starting point. It may actually know these specific names thanks to its training data, but I’ve experimented around with arbitrary column names and it can usually guess what the column will contain from a combination of the name and data returned.

Now, it noticed there were missing values, so let’s see how to deal with that:

Prompt:

Should we do something about the missing values?

Response:

Yes, it would be a good idea to handle the missing values in the Content-Length column. One option is to remove the rows that have missing values if they are not a significant portion of the data. Another option is to impute the missing values using some statistical method such as mean, median, or mode. However, the best approach would depend on the nature of the data and the analysis being performed.

These are the ways I would probably deal with it. In this case, I’m happy to just delete the entries since there aren’t too many.

Prompt:

Let's just remove them as there are very few. How can I do that?

Response:

You can drop the rows containing missing values using the dropna() function in pandas. Here's an example:

import pandas as pd

# create a dataframe

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

# drop rows with missing values

df.dropna(inplace=True) print(df)

Output:

A B 0 1.0 5.0 3 4.0 8.0

In this example, the dropna() function removes the second row which contains missing values. The inplace=True argument modifies the original DataFrame, so the resulting DataFrame only contains rows without any missing values.

Notice how it gives working code, and comments each line of the code. It also gives an explanation of what is happening at the end. This is great because we can learn what’s happening in the code, and if there’s anything we want to modify, it’s a lot easier.

We can now get started with the first parts of the analysis.

Steps 3–5 are roughly the same, plot some distributions of data and see if there’s anything interesting there.

When prompted to start step 3:

Sure, let's explore the distribution of the target variable 'WARC-Type'. We can do this by creating a histogram or a bar chart of the variable to see the count or percentage of each value in the dataset. If you are using Python, you can use libraries like Matplotlib or Seaborn to create visualizations. Here is an example code snippet to create a bar chart using Matplotlib: python import matplotlib.pyplot as plt
# Count the number of occurrences of each value in the 'WARC-Type' column
value_counts = df['WARC-Type'].value_counts()
# Create a bar chart of the value counts
plt.bar(value_counts.index, value_counts.values)
# Set the title and axis labels
plt.title('Distribution of WARC-Type')
plt.xlabel('WARC-Type')
plt.ylabel('Count')
# Show the plot plt.show() This code will create a bar chart showing the count of each value in the 'WARC-Type' column. You can modify the code to show the percentage of each value instead of the count.

And that outputs a graph for us. In this case, it’s not very interesting as there’s actually only one “WARC-Type”, but it was a nice idea. The issue with images is that it’s hard to feed the results back to ChatGPT. I found that it’s better to suggest ChatGPT to return tables of data to solve this, as well as additional plots where it makes sense.

After learning the column only has one value, ChatGPT is happy to move on to the next stage. When exploring the “WARC-Content-Language” we start to see something a little more complex. The results returned are:

eng 15903

rus 1551

...

hrv,nno,bel 1

zul,eng 1

Of which it learns:

It seems like there are 1187 unique values for the 'WARC-Identified-Content-Language' column, and the most common language is English ('eng') followed by Russian ('rus'), German ('deu'), and Chinese ('zho'). Since there are so many unique values, it might be difficult to analyze the distribution of the variable. One approach to simplifying the analysis could be to group languages into broader categories, such as "English", "Romance languages", "Slavic languages", etc. depending on the research question.

It’s a great point about grouping, but it’s missed something here — each row can have multiple languages. When prompted about this (a few times), it comes up with a good approach.

Here's an updated code that splits the rows with multiple languages and weighs them accordingly: import pandas as pd# Split the rows with multiple languages
df['WARC-Identified-Content-Language'] = df['WARC-Identified-Content-Language'].str.split(',')
# Create a new DataFrame to store the split rows
df_split = pd.DataFrame({ 'WARC-Identified-Content-Language': [lang for langs in df['WARC-Identified-Content-Language'].tolist() for lang in langs], 'count': [1/len(langs) for langs in df['WARC-Identified-Content-Language'].tolist() for lang in langs] })
# Aggregate by language
df_agg = df_split.groupby('WARC-Identified-Content-Language')['count'].sum().reset_index()
# Plot the distribution of languages
plt.figure(figsize=(10,6))
sns.barplot(data=df_agg, x='WARC-Identified-Content-Language', y='count', order=df_agg.sort_values('count', ascending=False)['WARC-Identified-Content-Language'])
plt.title('Distribution of WARC-Identified-Content-Language') plt.xticks(rotation=90)
plt.show() 
This code splits the rows with multiple languages into separate rows, and weights them by the number of languages in each row. Then it aggregates the rows by language and plots the distribution of languages.

After combining this with some prompts about the language grouping, we’re able to get out a nice plot:

One thing I particularly like about this is that it is giving labels to the chart. It can be a little annoying to write out labels every time, but it does make a difference, so it’s nice to see that it’s smart enough to give sensible ones here.

Seems English is still the dominant language — by how much?

Prompt:

It looks like it's mostly English language. Can you calculate what percentage of the agg data is english?

Response:

Sure, here's some code to calculate the percentage of the aggregated data that is English: english_percentage = (agg_data.loc['eng', 'weight'] / agg_data['weight'].sum()) * 100

print(f"The percentage of the aggregated data that is English is: {english_percentage:.2f}%")

This should output the percentage of the aggregated data that is English, rounded to 2 decimal places.

Now we can move on to step 6 which involves doing some exploration of the text column. This is going to be more analytics heavy and will require Natural Language Processing (NLP) techniques. Let’s see how ChatGPT does.

Prompt:

Great. Let's move on to step 6:

"Explore the 'text' column, including the distribution of text lengths and the most common words."

Response:

Sure! To explore the 'text' column, we can start by calculating the length of each text and creating a histogram to visualize the distribution of text lengths. Here's the code to do that:# Calculate the length of each text
df['text_length'] = df['text'].str.len()
# Create a histogram of text lengths
plt.hist(df['text_length'], bins=50)
This will create a histogram that shows the distribution of text lengths in the 'text' column. We can then look at the most common words in the 'text' column by first removing any HTML tags, punctuation, and stop words. Here's the code to do that: import re
import nltk from nltk.corpus
import stopwords
# Remove HTML tags from text
df['text'] = df['text'].apply(lambda x: re.sub('<.*?>', '', x))
# Remove punctuation from text
df['text'] = df['text'].str.replace('[^\\w\\s]','')
# Convert text to lowercase
df['text'] = df['text'].str.lower()
# Remove stop words from text
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
# Tokenize text into individual words
df['text'] = df['text'].apply(nltk.word_tokenize)
# Flatten list of words into a single list
word_list = [word for sublist in df['text'] for word in sublist]
# Calculate frequency distribution of words
freq_dist = nltk.FreqDist(word_list)
# Print the 20 most common words
print(freq_dist.most_common(20)) 
This code will remove HTML tags and punctuation from the 'text' column, convert all text to lowercase, remove stop words, tokenize the text into individual words, and then calculate the frequency distribution of words. Finally, it will print the 20 most common words in the 'text' column.

This yields:

[('nan', 36847), ('0', 3467), ('1', 2809), ('keresőoptimalizálás', 2498), ('idősgondozás', 1852), ('2', 1740), ('3', 1559), ('szerződés', 1529), ('eltartási', 1527), ('brands', 1442), ('5', 1362), ('4', 1350), ('usd', 1245), ('us', 1205), ('2023', 1204), ('new', 1178), ('sex', 1165), ('2022', 1031), ('home', 919), ('cookies', 904)]

I wasn’t very happy with this, as we can see that a lot of the ‘top words’ are just numbers, or stopwords from another language (it only used English ones). I was able to prod it on this, and it helped clean it up by filtering out words which were non-alphabetic tokens and including stopword lists for other languages. Now let’s visualise these in a Wordcloud.

Prompt:

Can you make it a wordcloud?

Response:

Sure! To create a wordcloud from the top 50 most frequent words, you can use the WordCloud library in Python as follows: from wordcloud import WordCloudimport matplotlib.pyplot as plt
# Convert the top 50 most frequent words to a dictionary
freq_dict = dict(top_words)
# Generate the wordcloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(freq_dict)
# Display the wordcloud
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
This code will generate a wordcloud with the top 50 most frequent words in the text column. You can adjust the parameters of the WordCloud function to customize the appearance of the wordcloud.

WARNING: contains explicit language. For those who are wondering, “keresőoptimalizálás” is Hungarian for “Search engine optimization” (author created image).

I’m not sure what to take away from this plot. It seems there are some words which could have useful content associated with them (e.g “information”, “blog”, “brands”), but there is also a lot of undesirable content. I shared the top words to ChatGPT and asked what it thought:

Prompt:

What can we say about this?

Response:

Based on the top words found in the 'text' column, it appears that the data may contain a lot of spam or irrelevant information, as many of the top words are related to online dating, reviews, loans, and escorts. Additionally, the presence of words like 'nan' suggests that there may be missing or incomplete data in the 'text' column. It would be important to further investigate the nature and quality of the data in the 'text' column to determine how useful it is for any downstream analysis.

That’s a fair conclusion to make, and indeed, companies that work with the Common Crawl dataset have stated that they invest considerable resources cleaning up the dataset to make it usable for training AI models.

This concludes the plan outlined by ChatGPT, so we’ll wrap up the analysis here and see what findings it has.

Prompt:

Great. Let's move on to step 8. Can you summarise the findings from all our analysis so far?

Response:

Sure, based on the analysis we have conducted so far, here are some key findings: Language distribution: The most common language in the dataset is English, followed by Russian, German, Chinese, and Spanish. Content-Type distribution: The majority of the content in the dataset is of the type "text/plain". Text length: The text length in the dataset ranges from 0 to 8,542,524, with an average length of 20,566. Common words: The most frequent words in the dataset are not particularly informative, as they include many stop words and other non-informative terms. Correlations: There is a strong positive correlation between the length of the text and the content length. There is also a weak positive correlation between the number of languages detected in a text and the content length.

If a data analyst came to me with this, I’d be pretty disappointed. As mentioned at the start, there are some columns it could have analysed in more detailed, and it’s generally better to make some suggestions for next steps. However, we never gave it a goal for the analysis, nor told it to give us suggestions so I’m not going to penalise it too much.

As you can see, the prompting used throughout was relatively simple. Given that, it’s impressive at how well ChatGPT was able to understand what I wanted done and give useful answers. These prompts could definitely be improved by providing more context in each prompt and being stricter on what we want back. For example, each prompt could contain references to the specific task it is focusing on, as well as additional text to have it do exactly what we want:

Don’t respond with superfluous text. Assume pandas, numpy and matplotlib have been imported in the standard way.

These could be maintained in your own set of prompt templates to speed up this sort of work, or done with a tool such as LangChain.

We could also define our own overall template. I let ChatGPT come up with a plan, but it wasn’t perfect. We could define an overall structure for it to follow, and a standard way to e.g. analyse each variable. With templates, ChatGPT is less likely to miss insights in such an analysis.

While it was fun going back and forth with ChatGPT to get data outputs to it, it quickly became tiring. ChatGPT is much more powerful when it can run the code directly itself. ChatGPT can connected to a Python runtime by instead working with the Python API. In this case, the code could be run automatically, but to cut the human out of the loop we’ll need one more tool.

AutoGPT has been very popular in the last month as a power-up to ChatGPT which effectively provides a guide to ChatGPT agents which allows them to keep executing towards some goal. AutoGPT could replace me in this situation, asking ChatGPT agents to design code, then executing it, feeding the results back to ChatGPT, proceeding until it has a detailed analysis. It would also interface with a memory database which would allow it to execute much larger analyses.

With a tool like AutoGPT we can set a clear goal with requirements such as detail of analysis and expected conclusion style. In this case, we can check in less regularly with the outcomes and eventually have to do little work to get a decent analysis out.

Finally, we should call out that ChatGPT is far from ‘perfect’ and even in this mock analysis, I had to massage the prompts to get an answer that was close to what I wanted. It was a lot less difficult than I expected, but still worth noting. It created some code that had errors, though it managed to fix the errors every time it was told. At times it created code that I wouldn’t have wanted to run, and I needed to suggest it follow a different path, but again, upon prompting it could come up with a decent solution.

In this article, we’ve seen how ChatGPT can be used to support the running of an Exploratory Data Analysis (EDA). We’ve seen that we’re able to get surprisingly good results working with the system, with little exterior help. We also noted that there are already tools which allow us to extend this idea such as AutoGPT which could make an even more powerful assistant.

As a data analyst, I’m already using ChatGPT to help with my analytics in some of the ways described above, though I rarely use it for an end-to-end analysis as detailed in this article. As more integrations are built out with tools like AutoGPT, and the friction to use is reduced, I expect to be using it more and more and am very much excited for it (while I’m not made obsolete 😉 ).

Unleashing the power of ChatGPT: A deep dive into an exploratory data analysis and future opportunities

Throughout the post, content will be truncated, so feel free to follow along directly on the Google Colab used to run this analysis.

We’ve broken down the analysis into five sections:

🤔 Overall Structure
🧹 Data Cleaning
💯 Basic Analysis
🇬🇧 NLP Approaches
🔑 Key Findings

With all that out of the way, let’s jump into the analysis!

Prompt:

Let's run an EDA on the Common Crawl text dataset together. The dataset is a scrape of many web pages on the internet. The columns are 'text', 'WARC-Type', 'WARC-Target-URI', 'WARC-Date', 'WARC-Record-ID', 'WARC-Refers-To', 'WARC-Block-Digest', 'WARC-Identified-Content-Language', 'Content-Type', and 'Content-Length'. The dataset is saved in a csv file called common_crawl_text.

You should give me an overall plan, then we can go through step-by-step. In each step you should write some code and I will run it in the terminal then report back to you the results.

Response:

Sure, here's a high-level plan for the EDA on the Common Crawl text dataset:

Load the dataset into a pandas DataFrame.
Check for missing values in each column and handle them appropriately.
Explore the distribution of the target variable 'WARC-Type'.
Check the distribution of 'WARC-Identified-Content-Language'.
Check the distribution of 'Content-Type'.
Explore the 'text' column, including the distribution of text lengths and the most common words.
Check for any correlations between the different columns.
Summarize the findings and identify any interesting patterns or insights.

Let's start with step 1 and load the dataset into a pandas DataFrame.

I prompted ChatGPT with this, and it happily included these into the plan. At the same time, it added in the creation of a Wordcloud, something I had in the back of my mind.

I love this sort of interaction, where it feels like a partnership, rather than just reading a book.

With each of the steps outlined in the structure, we can ask ChatGPT to expand on that step and give us the exact code to run. To start off, it suggests:

We can start by loading the Common Crawl text dataset and taking a look at its structure. Here's the code to load the dataset into a Pandas DataFrame:

import pandas as pd

df = pd.read_csv("common_crawl_text.csv")

Once you've loaded the dataset, you can check the structure of the DataFrame using the following code:

df.head()

This will show you the first five rows of the DataFrame. You can also use the [df.info](<http://df.info/>)() method to get an overview of the columns and their data types, as well as the number of non-null values in each column.

After giving it the results, it’s able to share some basic information back:

Based on the provided data, it seems like we have a dataset of web archive records with different attributes such as the type of the record, the URI (Uniform Resource Identifier) of the record, the date it was recorded, the length of the content, and the content itself. The content is in different languages and formats, such as plain text, and the length of the content varies. The dataset contains 37832 records, and some of them have null values in the content length column.

We can go a little further and ask it to give a detailed description of each column:

Prompt:

Can you give a detailed list of the columns and the expected meaning?

Response:

Sure! Here's a detailed list of the columns in the DataFrame and their expected meaning:

WARC-Type: The type of WARC record. A WARC (Web ARChive) record is a format for storing internet resources such as web pages and their associated metadata.
WARC-Target-URI: The target URI (Uniform Resource Identifier) of the archived web page.
WARC-Date: The date and time when the web page was archived.
WARC-Record-ID: The unique identifier of the WARC record.
WARC-Refers-To: The identifier of a related WARC record that this record refers to.
WARC-Block-Digest: A cryptographic hash value computed over the contents of the WARC record.
WARC-Identified-Content-Language: The language(s) identified in the content of the web page.
Content-Type: The media type of the content, such as text/plain or text/html.
Content-Length: The length of the content in bytes.
text: The extracted text content of the web page.

Note that some of these columns are specific to the WARC format and may not be present in other types of data.

Now, it noticed there were missing values, so let’s see how to deal with that:

Prompt:

Should we do something about the missing values?

Response:

Yes, it would be a good idea to handle the missing values in the Content-Length column. One option is to remove the rows that have missing values if they are not a significant portion of the data. Another option is to impute the missing values using some statistical method such as mean, median, or mode. However, the best approach would depend on the nature of the data and the analysis being performed.

These are the ways I would probably deal with it. In this case, I’m happy to just delete the entries since there aren’t too many.

Prompt:

Let's just remove them as there are very few. How can I do that?

Response:

You can drop the rows containing missing values using the dropna() function in pandas. Here's an example:

import pandas as pd

# create a dataframe

df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

# drop rows with missing values

df.dropna(inplace=True) print(df)

Output:

A B 0 1.0 5.0 3 4.0 8.0

In this example, the dropna() function removes the second row which contains missing values. The inplace=True argument modifies the original DataFrame, so the resulting DataFrame only contains rows without any missing values.

We can now get started with the first parts of the analysis.

Steps 3–5 are roughly the same, plot some distributions of data and see if there’s anything interesting there.

When prompted to start step 3:

Sure, let's explore the distribution of the target variable 'WARC-Type'. We can do this by creating a histogram or a bar chart of the variable to see the count or percentage of each value in the dataset. If you are using Python, you can use libraries like Matplotlib or Seaborn to create visualizations. Here is an example code snippet to create a bar chart using Matplotlib: python import matplotlib.pyplot as plt
# Count the number of occurrences of each value in the 'WARC-Type' column
value_counts = df['WARC-Type'].value_counts()
# Create a bar chart of the value counts
plt.bar(value_counts.index, value_counts.values)
# Set the title and axis labels
plt.title('Distribution of WARC-Type')
plt.xlabel('WARC-Type')
plt.ylabel('Count')
# Show the plot plt.show() This code will create a bar chart showing the count of each value in the 'WARC-Type' column. You can modify the code to show the percentage of each value instead of the count.

eng 15903

rus 1551

...

hrv,nno,bel 1

zul,eng 1

Of which it learns:

It seems like there are 1187 unique values for the 'WARC-Identified-Content-Language' column, and the most common language is English ('eng') followed by Russian ('rus'), German ('deu'), and Chinese ('zho'). Since there are so many unique values, it might be difficult to analyze the distribution of the variable. One approach to simplifying the analysis could be to group languages into broader categories, such as "English", "Romance languages", "Slavic languages", etc. depending on the research question.

It’s a great point about grouping, but it’s missed something here — each row can have multiple languages. When prompted about this (a few times), it comes up with a good approach.

Here's an updated code that splits the rows with multiple languages and weighs them accordingly: import pandas as pd# Split the rows with multiple languages
df['WARC-Identified-Content-Language'] = df['WARC-Identified-Content-Language'].str.split(',')
# Create a new DataFrame to store the split rows
df_split = pd.DataFrame({ 'WARC-Identified-Content-Language': [lang for langs in df['WARC-Identified-Content-Language'].tolist() for lang in langs], 'count': [1/len(langs) for langs in df['WARC-Identified-Content-Language'].tolist() for lang in langs] })
# Aggregate by language
df_agg = df_split.groupby('WARC-Identified-Content-Language')['count'].sum().reset_index()
# Plot the distribution of languages
plt.figure(figsize=(10,6))
sns.barplot(data=df_agg, x='WARC-Identified-Content-Language', y='count', order=df_agg.sort_values('count', ascending=False)['WARC-Identified-Content-Language'])
plt.title('Distribution of WARC-Identified-Content-Language') plt.xticks(rotation=90)
plt.show() 
This code splits the rows with multiple languages into separate rows, and weights them by the number of languages in each row. Then it aggregates the rows by language and plots the distribution of languages.

After combining this with some prompts about the language grouping, we’re able to get out a nice plot:

Seems English is still the dominant language — by how much?

Prompt:

It looks like it's mostly English language. Can you calculate what percentage of the agg data is english?

Response:

Sure, here's some code to calculate the percentage of the aggregated data that is English: english_percentage = (agg_data.loc['eng', 'weight'] / agg_data['weight'].sum()) * 100

print(f"The percentage of the aggregated data that is English is: {english_percentage:.2f}%")

This should output the percentage of the aggregated data that is English, rounded to 2 decimal places.

Prompt:

Great. Let's move on to step 6:

"Explore the 'text' column, including the distribution of text lengths and the most common words."

Response:

Sure! To explore the 'text' column, we can start by calculating the length of each text and creating a histogram to visualize the distribution of text lengths. Here's the code to do that:# Calculate the length of each text
df['text_length'] = df['text'].str.len()
# Create a histogram of text lengths
plt.hist(df['text_length'], bins=50)
This will create a histogram that shows the distribution of text lengths in the 'text' column. We can then look at the most common words in the 'text' column by first removing any HTML tags, punctuation, and stop words. Here's the code to do that: import re
import nltk from nltk.corpus
import stopwords
# Remove HTML tags from text
df['text'] = df['text'].apply(lambda x: re.sub('<.*?>', '', x))
# Remove punctuation from text
df['text'] = df['text'].str.replace('[^\\w\\s]','')
# Convert text to lowercase
df['text'] = df['text'].str.lower()
# Remove stop words from text
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
# Tokenize text into individual words
df['text'] = df['text'].apply(nltk.word_tokenize)
# Flatten list of words into a single list
word_list = [word for sublist in df['text'] for word in sublist]
# Calculate frequency distribution of words
freq_dist = nltk.FreqDist(word_list)
# Print the 20 most common words
print(freq_dist.most_common(20)) 
This code will remove HTML tags and punctuation from the 'text' column, convert all text to lowercase, remove stop words, tokenize the text into individual words, and then calculate the frequency distribution of words. Finally, it will print the 20 most common words in the 'text' column.

This yields:

[('nan', 36847), ('0', 3467), ('1', 2809), ('keresőoptimalizálás', 2498), ('idősgondozás', 1852), ('2', 1740), ('3', 1559), ('szerződés', 1529), ('eltartási', 1527), ('brands', 1442), ('5', 1362), ('4', 1350), ('usd', 1245), ('us', 1205), ('2023', 1204), ('new', 1178), ('sex', 1165), ('2022', 1031), ('home', 919), ('cookies', 904)]

Prompt:

Can you make it a wordcloud?

Response:

Sure! To create a wordcloud from the top 50 most frequent words, you can use the WordCloud library in Python as follows: from wordcloud import WordCloudimport matplotlib.pyplot as plt
# Convert the top 50 most frequent words to a dictionary
freq_dict = dict(top_words)
# Generate the wordcloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(freq_dict)
# Display the wordcloud
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
This code will generate a wordcloud with the top 50 most frequent words in the text column. You can adjust the parameters of the WordCloud function to customize the appearance of the wordcloud.

Prompt:

What can we say about this?

Response:

Based on the top words found in the 'text' column, it appears that the data may contain a lot of spam or irrelevant information, as many of the top words are related to online dating, reviews, loans, and escorts. Additionally, the presence of words like 'nan' suggests that there may be missing or incomplete data in the 'text' column. It would be important to further investigate the nature and quality of the data in the 'text' column to determine how useful it is for any downstream analysis.

This concludes the plan outlined by ChatGPT, so we’ll wrap up the analysis here and see what findings it has.

Prompt:

Great. Let's move on to step 8. Can you summarise the findings from all our analysis so far?

Response:

Sure, based on the analysis we have conducted so far, here are some key findings: Language distribution: The most common language in the dataset is English, followed by Russian, German, Chinese, and Spanish. Content-Type distribution: The majority of the content in the dataset is of the type "text/plain". Text length: The text length in the dataset ranges from 0 to 8,542,524, with an average length of 20,566. Common words: The most frequent words in the dataset are not particularly informative, as they include many stop words and other non-informative terms. Correlations: There is a strong positive correlation between the length of the text and the content length. There is also a weak positive correlation between the number of languages detected in a text and the content length.

Don’t respond with superfluous text. Assume pandas, numpy and matplotlib have been imported in the standard way.

These could be maintained in your own set of prompt templates to speed up this sort of work, or done with a tool such as LangChain.

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.