Techno Blender
Digitally Yours.

Text Analytics 101 — Word Cloud and Sentiment Analysis | by Shwetha Acharya | Nov, 2022

0 39


A post describing the basics of text processing and how to draw insights from the data from Twitter API

The tweets data that we get from the API is unstructured and in different languages. This is not convenient for Machine Learning or statistical analysis. We will perform mining and Natural Language Processing (NLP) to evaluate text data for sentiments. We will use our friendly Jupyter notebook and python in this journey:)

Text Mining — Image by Author

We can break this down into four steps: :

  1. Data extraction from Twitter using tweepy
  2. Data cleaning & processing
  3. Visualization using wordcloud
  4. Sentiment Analysis

The first thing is to sign-up for a developer account on Twitter and get access to the Twitter API. The API uses OAuth client for authentication which means you have to generate a bearer token or use client id/secret. I have used ‘tweepy’ to access the API and fetch data from twitter.

So what are we going to analyze ?? Hmm…Let’s find out what twitter thinks about “recession”!! — that’ll be our search query. We will formulate a word cloud and check sentiments around it.

#Data Extraction
import tweepy
query = '#recession -is:retweet'
tw_clnt=tweepy.Client(bearer_token='AAAABCJJGJG')
tweets=tweepy.Paginator(tw_clnt.search_recent_tweets,query,max_results=100).flatten(limit=5000)
df=pd.DataFrame(tweets)
df.head(2)
Tweets captured in a dataframe — Image by Author

Ah ! We love dataframes🤩 makes more sense. Time to find out if there are any null values in the df.

#Check for nulls/blank fields
df.id.count(), df.isnull().sum()

5000 records and thankfully no nulls😀

Now we will clean and format the tweet text — remove mentions (Example: @abc423), media links, convert to lower case, and remove newline characters. I suggest that we do not delete hashtags as many times important sentiments/information are hidden in the hashtags #dontignorehashtags 😜

#Remove special characters/links
import re
def tweet_cleaner(x):
text=re.sub("[@&][A-Za-z0-9_]+","", x) # Remove mentions
text=re.sub(r"http\S+","", text) # Remove media links
return pd.Series([text])
df[['plain_text']] = df.text.apply(tweet_cleaner)#Convert all text to lowercase
df.plain_text = df.plain_text.str.lower()
#Remove newline character
df.plain_text = df.plain_text.str.replace('\n', '')
#Replacing any empty strings with null
df = df.replace(r'^\s*$', np.nan, regex=True)
if df.isnull().sum().plain_text == 0:
print ('no empty strings')
else:
df.dropna(inplace=True)

We will store the well-formatted data in a new column — ‘plain_text’

Well formatted text — Image by Author

Our next step is to find out if there are tweets in languages other than English using ‘detect’ to detect language. This library fails if text has only numbers or punctuation in the tweet. Such tweets (which have just numbers) are of no use for our analysis and hence these ‘exception’ records can be deleted.

#detect language of tweetsfrom langdetect import detect
def detect_textlang(text):
try:
src_lang = detect(text)
if src_lang =='en':
return 'en'
else:
#return "NA"
return src_lang
except:
return "NA"
df['text_lang']=df.plain_text.apply(detect_textlang)
Tweet in Spanish — Image by Author

Oh! There are tweets in Spanish, probably many others….We will group all the tweets by language and see the top 10.

# Group tweets by language and list the top 10
import matplotlib.pyplot as plt
plt.figure(figsize=(4,3))
df.groupby(df.text_lang).plain_text.count().sort_values(ascending=False).head(10).plot.bar()
plt.show()
Group tweets by languages — Image by Author

There are records in Dutch, Turkish, etc ..and so we will translate these tweets to English using Googletrans’ Translator. Translation takes time if there are more than 100–200 records; it can time-out if response is not received in time [error — The read operation timed out] . Hence, it is wise to apply the language filter before ‘translate’ function call.

#Translate to English
from googletrans import Translator
def translate_text(lang,text):
translator= Translator()
trans_text = translator.translate(text, data-src=lang).text
return trans_text

df['translated_text']=df.apply(lambda x: x.plain_text if x.text_lang == 'en' else translate_text(x.text_lang, x.plain_text), axis=1)
df.translated_text = df.translated_text.str.lower()
Dataframe with translated text — Image by Author

Cool:) Now our source data is almost ready, just that we need to get rid of ‘#recession’ as that’s our query. The idea is to build a word cloud which can give information about recession and not just repeat that word! Also, we do not want generic words such as ‘will’, ‘go’, ‘has’, ‘would’ etc. to appear in our word cloud. Nltk’s ‘stopwords’ provides a list of all such words, and we can exclude all of them from our ‘translated_text’.

#Remove un-important words from text
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
query_words={'recession', '#' }
stop_words.update(query_words)
for word in query_words:
df.translated_text = df.translated_text.str.replace(word, '')
#Creating word cloud
from wordcloud import WordCloud, ImageColorGenerator
wc=WordCloud(stopwords=stop_words, collocations=False, max_font_size=55, max_words=25, background_color="black")
wc.generate(' '.join(df.translated_text))
plt.figure(figsize=(10,12))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
‘Recession’ Word Cloud — Image by Author

Here you go👍

Per twitter data word cloud people, in the context of recession, are talking about inflation, layoffs and jobs — which is sort of true! Particular focus on stock market, housing market, and crypto market. Also, condition in UK and Fed decision are being brought up frequently. I have kept a max count of 25 words, you can increase it to get more insights.

Let’s explore the general sentiment of these tweets using vadersentiment analyzer. This library returns a number called polarity ranging between -1 and 1; -1 being the most negative sentiment and 1 — most positive. We can then put these sentiments in buckets of ‘negative’, ‘positive’ and ‘neutral’.

#Sentiment Check
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer=SentimentIntensityAnalyzer()
df['polarity']=[analyzer.polarity_scores(text)['compound'] for text in df.translated_text]
def get_sentiment(polarity):
if polarity < 0.0:
return 'Negative'
elif polarity > 0.2:
return 'Positive'
else:
return 'Neutral'
df['sentiment']=df2.polarity.apply(get_sentiment)
plt.figure(figsize=(3,3))
df.sentiment.value_counts().plot.bar()
Count of Positive vs Negative vs Neutral Sentiments — Image by Author

Oh! As expected, count of negative sentiments is the most , let’s see a sample negative tweet and some positive tweets.

Negative Sentiment — Image by Author
Positive Sentiment — Image by Author

On filtering positive sentiments we see some advice about how we can equip ourselves in recession times! Yes indeed these are positive points..

Great Work! This technique has important applications in marketing analytics where customer reviews are assessed for a brand/product. But sometimes (as you see in one of the positive tweets — “wow” — user text can be slightly misleading as ‘wow’ doesn’t mean positive in this scenario. Such nuances cannot picked up easily. Hence, for more accurate results on sentiments, it is advisable to run the analysis with 2–3 packages such as TextBlob/Transformers in addition to VADER and get a weighted score for polarity.

There are other ways of doing Sentiment Analysis , for example through Vectorization, but it also depends on your data and its attributes.

Hmm.. so we have reached the end of class 101;) I hope this article was informative and has cemented your understanding on text analysis. You can access my notebook here. Try this exercise with a search word of your choice (brand/personality/topics) and share your results with me!




A post describing the basics of text processing and how to draw insights from the data from Twitter API

The tweets data that we get from the API is unstructured and in different languages. This is not convenient for Machine Learning or statistical analysis. We will perform mining and Natural Language Processing (NLP) to evaluate text data for sentiments. We will use our friendly Jupyter notebook and python in this journey:)

Text Mining — Image by Author

We can break this down into four steps: :

  1. Data extraction from Twitter using tweepy
  2. Data cleaning & processing
  3. Visualization using wordcloud
  4. Sentiment Analysis

The first thing is to sign-up for a developer account on Twitter and get access to the Twitter API. The API uses OAuth client for authentication which means you have to generate a bearer token or use client id/secret. I have used ‘tweepy’ to access the API and fetch data from twitter.

So what are we going to analyze ?? Hmm…Let’s find out what twitter thinks about “recession”!! — that’ll be our search query. We will formulate a word cloud and check sentiments around it.

#Data Extraction
import tweepy
query = '#recession -is:retweet'
tw_clnt=tweepy.Client(bearer_token='AAAABCJJGJG')
tweets=tweepy.Paginator(tw_clnt.search_recent_tweets,query,max_results=100).flatten(limit=5000)
df=pd.DataFrame(tweets)
df.head(2)
Tweets captured in a dataframe — Image by Author

Ah ! We love dataframes🤩 makes more sense. Time to find out if there are any null values in the df.

#Check for nulls/blank fields
df.id.count(), df.isnull().sum()

5000 records and thankfully no nulls😀

Now we will clean and format the tweet text — remove mentions (Example: @abc423), media links, convert to lower case, and remove newline characters. I suggest that we do not delete hashtags as many times important sentiments/information are hidden in the hashtags #dontignorehashtags 😜

#Remove special characters/links
import re
def tweet_cleaner(x):
text=re.sub("[@&][A-Za-z0-9_]+","", x) # Remove mentions
text=re.sub(r"http\S+","", text) # Remove media links
return pd.Series([text])
df[['plain_text']] = df.text.apply(tweet_cleaner)#Convert all text to lowercase
df.plain_text = df.plain_text.str.lower()
#Remove newline character
df.plain_text = df.plain_text.str.replace('\n', '')
#Replacing any empty strings with null
df = df.replace(r'^\s*$', np.nan, regex=True)
if df.isnull().sum().plain_text == 0:
print ('no empty strings')
else:
df.dropna(inplace=True)

We will store the well-formatted data in a new column — ‘plain_text’

Well formatted text — Image by Author

Our next step is to find out if there are tweets in languages other than English using ‘detect’ to detect language. This library fails if text has only numbers or punctuation in the tweet. Such tweets (which have just numbers) are of no use for our analysis and hence these ‘exception’ records can be deleted.

#detect language of tweetsfrom langdetect import detect
def detect_textlang(text):
try:
src_lang = detect(text)
if src_lang =='en':
return 'en'
else:
#return "NA"
return src_lang
except:
return "NA"
df['text_lang']=df.plain_text.apply(detect_textlang)
Tweet in Spanish — Image by Author

Oh! There are tweets in Spanish, probably many others….We will group all the tweets by language and see the top 10.

# Group tweets by language and list the top 10
import matplotlib.pyplot as plt
plt.figure(figsize=(4,3))
df.groupby(df.text_lang).plain_text.count().sort_values(ascending=False).head(10).plot.bar()
plt.show()
Group tweets by languages — Image by Author

There are records in Dutch, Turkish, etc ..and so we will translate these tweets to English using Googletrans’ Translator. Translation takes time if there are more than 100–200 records; it can time-out if response is not received in time [error — The read operation timed out] . Hence, it is wise to apply the language filter before ‘translate’ function call.

#Translate to English
from googletrans import Translator
def translate_text(lang,text):
translator= Translator()
trans_text = translator.translate(text, data-src=lang).text
return trans_text

df['translated_text']=df.apply(lambda x: x.plain_text if x.text_lang == 'en' else translate_text(x.text_lang, x.plain_text), axis=1)
df.translated_text = df.translated_text.str.lower()
Dataframe with translated text — Image by Author

Cool:) Now our source data is almost ready, just that we need to get rid of ‘#recession’ as that’s our query. The idea is to build a word cloud which can give information about recession and not just repeat that word! Also, we do not want generic words such as ‘will’, ‘go’, ‘has’, ‘would’ etc. to appear in our word cloud. Nltk’s ‘stopwords’ provides a list of all such words, and we can exclude all of them from our ‘translated_text’.

#Remove un-important words from text
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
query_words={'recession', '#' }
stop_words.update(query_words)
for word in query_words:
df.translated_text = df.translated_text.str.replace(word, '')
#Creating word cloud
from wordcloud import WordCloud, ImageColorGenerator
wc=WordCloud(stopwords=stop_words, collocations=False, max_font_size=55, max_words=25, background_color="black")
wc.generate(' '.join(df.translated_text))
plt.figure(figsize=(10,12))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
‘Recession’ Word Cloud — Image by Author

Here you go👍

Per twitter data word cloud people, in the context of recession, are talking about inflation, layoffs and jobs — which is sort of true! Particular focus on stock market, housing market, and crypto market. Also, condition in UK and Fed decision are being brought up frequently. I have kept a max count of 25 words, you can increase it to get more insights.

Let’s explore the general sentiment of these tweets using vadersentiment analyzer. This library returns a number called polarity ranging between -1 and 1; -1 being the most negative sentiment and 1 — most positive. We can then put these sentiments in buckets of ‘negative’, ‘positive’ and ‘neutral’.

#Sentiment Check
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer=SentimentIntensityAnalyzer()
df['polarity']=[analyzer.polarity_scores(text)['compound'] for text in df.translated_text]
def get_sentiment(polarity):
if polarity < 0.0:
return 'Negative'
elif polarity > 0.2:
return 'Positive'
else:
return 'Neutral'
df['sentiment']=df2.polarity.apply(get_sentiment)
plt.figure(figsize=(3,3))
df.sentiment.value_counts().plot.bar()
Count of Positive vs Negative vs Neutral Sentiments — Image by Author

Oh! As expected, count of negative sentiments is the most , let’s see a sample negative tweet and some positive tweets.

Negative Sentiment — Image by Author
Positive Sentiment — Image by Author

On filtering positive sentiments we see some advice about how we can equip ourselves in recession times! Yes indeed these are positive points..

Great Work! This technique has important applications in marketing analytics where customer reviews are assessed for a brand/product. But sometimes (as you see in one of the positive tweets — “wow” — user text can be slightly misleading as ‘wow’ doesn’t mean positive in this scenario. Such nuances cannot picked up easily. Hence, for more accurate results on sentiments, it is advisable to run the analysis with 2–3 packages such as TextBlob/Transformers in addition to VADER and get a weighted score for polarity.

There are other ways of doing Sentiment Analysis , for example through Vectorization, but it also depends on your data and its attributes.

Hmm.. so we have reached the end of class 101;) I hope this article was informative and has cemented your understanding on text analysis. You can access my notebook here. Try this exercise with a search word of your choice (brand/personality/topics) and share your results with me!

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment