Using Google Trends as a Machine Learning Features in BigQuery | by Alan Chen | Dec, 2022

By Jessie Hobb On Dec 8, 2022

Written with Co-Author: Daniel Shin, MSCS at Stanford

Sometimes as engineers and scientists, we think of data only as bytes on RAM, matrices in GPUs, and numeric features that go into our predictive black-box. We forget they represent changes in some real-world patterns.

For example, when real world events and trends arise, we tend to defer to Google first to acquire related information (i.e where to go for a hike, what does term X mean) — which makes Google Search Trends a very good source of data for interpreting and understanding what is going on live around us.

This is why we decided to study a complex interplay between Google Search trends using it to predict other temporal data, and see if perhaps it could be used as features for a temporal machine learning model, and any insights we can draw from it.

In this project, we looked at how Google Trends data could be used as features for times series models or regression models. We chose crime rate as our point of interest (response variable) because it is one of the large temporal datasets on Google Cloud Platform with millions of rows, which fits with our purpose using BigQuery.

However, we will note that whenever we are working with any dataset dealing sensitive topics, it’s important to be impartial and not many assumptions with pre-emptive bias regarding protected traits (i.e age, sex, and race). We included a section at the end that we encourage you to not skip: A discussion on Fairness in AI ⚖️️.

Additionally, for this project, we primarily focused on Narcotics related crimes only (i.e drug trafficking) for our ML model. Even so, we think it’s best not to deploy similar models in the real world without guardrails and further research on its social ramifications. This study is purely for exploratory and education purposes, and should be treated as such.

BigQuery is a serverless data warehouse that has many public datasets available. It can be a cool place to try out machine learning with massive datasets, many of which are regularly updated under bigquery-public-data. In this section, we’ll go through how to set-up BigQuery.

We first authenticate ourselves into Google Colab from the same account which we have credits on BigQuery for. Everyone gets around 1 TB worth of Queries, which is enough for our purposes (i.e we require about ~200 MB per optimized query). Students may also get some educational credits.

# Authenticate yourself into Google Colab
from google.colab import auth
auth.authenticate_user()
project_id = "YOUR_PROJECT_ID_IN_BIGQUERY"
# create a BigQuery client with the same account as Colab
from google.cloud import bigquery
client = bigquery.Client(project=project_id)

Then, we can convert an IPython notebook cell into a SQL cell like the following, which will store the output into a pd.DataFrame in variable_name.

%%bigquery variable_name --project $project_idSELECT *
FROM `bigquery-public-data.chicago_crime.crime`
LIMIT 5

Our initial thought is to use the google_trends.top_rising_terms dataset and do a LEFT JOIN with the columns with date shifts. However, we ran into the problem of the dataset being dominated by current events and names that are unrelated to our response variable.

SQL statement for selecting Chicago search trends, updated on Dec 3rd, 2022, by Author

Then, we tried fetching terms using the unofficial PyTrends API for Google Trends. However, we were quickly hit with velocity limits, so we had to be more selective with our data. Our only option was to download our data by querying each word on the Google Trend site and combing word datasets with a script manually. So, we focused on obtaining the correct word features.

One of the intuitions we had was that the terms describing substances would have regional variations and different names. For this problem, we used wordnet to generate a set of words that have “High Co-occurrence” with some terms for common street drugs.

The inspiration for our usage of WordNet comes from “co-occurrence” based recommendation models (i.e possibly for YouTube videos, Amazon catalogs, etc) which often do batched compute for co-occurrence to rank related items. This is similar to what masked language models (i.e BERT) seeks to represent.

import nltk=
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnetall_words = set()
def find_word_features(word):
words = set()
for syn in wordnet.synsets(word):
for i in syn.lemmas():
words.add(i.name())
print(words)
all_words.update(words)

We then downloaded and combined them with some simple scripting. After that, we fitted a number of Linear Regression models, extracted the R-squared value of the model, then ranked them based on correlation with crime rate after one month (see later on why).

Top words with high correlation with crime-rate after 1 month, by Author

We were very surprised to find some of these words to have a really high R² value, especially the word “go”. The word was generated from co-occurrence with “ecstasy,” but at the end, we assumed this simply correlated with phrases of “where to GO to get …” on Google search.

Joining the dataset requires some degree of data manipulation, which we won’t get into the technical details. But we note here that we did both a 1-month and 2-month shift on the aggregated crime count, as we believe for there to be a temporal correlation there has to be a time-lag between searching and our response variable.

Time lag can be a hyperparameter that we tune later.

%%bigquery narcotics_year_month_two_months_lag --project $project_idWITH shifted_date_tbl AS (
SELECT
date,
CAST( DATE_SUB(CAST (date AS datetime), INTERVAL 2 MONTH) AS timestamp) AS shifted_date,
*
FROM `bigquery-public-data.chicago_crime.crime`
),
date_count AS (
SELECT
CONCAT(CONCAT(EXTRACT(YEAR FROM shifted_date), '-'), EXTRACT(MONTH FROM shifted_date)) AS year_month, COUNT(*) as crime_count
FROM shifted_date_tbl
WHERE primary_type = 'NARCOTICS'
GROUP BY year_month
ORDER BY year_month DESC
)
SELECT *
FROM `your-project-id.crime_search_trends.chicago_narcotics`
AS chicago_narcotics,
date_count
WHERE year_month = Month

We won’t want to do any feature engineering yet at this point because we would want to visualize the correlation before deciding on how to optimize each word distribution.

For example, perhaps we want to take a log or shift of the data or cast the feature distribution into quantiles. But we would first have to visualize the distribution.

We analyzed somewhere between 30–50 words that are both generated by WordNet, as well as our own intuition. Many of them follow a positive trend such as “jet”, but we were surprised to find ones with negative correlation such as “pot.”

“Jet” trend score Vs. Narcotics crime count, by Author

“Pot” trend score Vs. Narcotics crime count, by Author

We were very curious as to why this was the case. Our original thought was this has to do with the decriminalization or marijuana in our region of interest (Chicago, which happened in 2020), but it turns out this was not the case even if we only correlate data from pre-2015.

Is it because of decriminalization & legalization? Seems like a no. — “pot” trend Vs. Narcotics crime count before 2015, by Author

Only after plotting the trends for “positively correlated words” side-by-side with “negatively correlated words” such as “pot” do we realize the trend. See the graph below.

Downward trend for narcotics crime and other terms with uptick of “pot” search trend, by Author

A plausible explanation for this is simply that over-time, crimes for narcotics related crimes and arrests has been steadily decreasing, yet search trends and popular culture actually has a negative correlation in terms of attitudes towards marijuana.

This means that although yes, it would be a great feature for predicting crime due to negative correlation, but not for the reasons we initially thought. This goes to show our initial interpretations and schema towards features may be skewed and perhaps even wrong.

We also did some geographical analysis using heat maps for search term location and locations of crime which was committed.

%%bigquery crime_time_data --project $project_idSELECT longitude, latitude
FROM `bigquery-public-data.chicago_crime.crime`
WHERE primary_type = "NARCOTICS" AND
longitude IS NOT NULL AND
latitude IS NOT NULL

Unfortunately, there was no way to export this data from Google Trends so we could not extrapolate and map the data for aggregation. Otherwise, we could have done a left join on the city names after mapping locations of each crime to the County or City they are from.

Chicago narcotics crime density map Vs. Search term density map, by Author

San Francisco narcotics crime density map, by Author

We won’t go too deeply into details on how to train a model in Colab, there are many guides out there on Medium. We will post another blog on how to train a machine learning model directly in Colab with BigQuery commands later.

Temporal/Times Series models are similar to what we have here, where we use historical data to predict future trends. Examples of this are like the Facebook Prophet model, or the ARIMA/SARIMA model. We think this is a better fit in general for problems dealing with temporal data because it can take account of seasonality and patterns.

You can train them directly in Colab after doing some SQL transforms, then making a single call on Google Cloud Platform using CREATE MODEL statements.

Training & Testing set split for Times Series models, by Author

Regression Models are similar to what we did with time-shifting the dataset. We would train our model to predict what is going to happen T time steps in the future, given the current states. This type of model tends to overfit and ignore trends, but if we have a massive amount of data available over short periods of time, we could always just retrain the model with data from the last few weeks, days, or even hours.

Readers on Towards Data Science would be very familiar with this type of model anyway, so we won’t elaborate futher 🙂

Let’s imagine a more cynical example of what we did in this study.

Suppose we are building a time-delayed prediction algorithm for crime prediction, and suppose we are an (unintentionally) under-informed data science team trying to build a model with a short deadline. We build it fast, it has great metrics, so we throw it over the wall to the DevOps team.

Since we don’t know what words are useful, suppose we built an unsupervised machine learning model that finds the “highly predictive” words for Google trends, and uses it as features. Overtime, the model picks up on words that are disproportionately used by some marginalized groups, and now the model is biased in the worst way possible.

In the case of crime prediction, this model would be highly problematic and have detrimental effects on real people if used for the wrong purposes.

We have also shown in this study, our opinions on what may be “good features,” and even our intuition behind why a feature may be good could be flawed or simply confirmation biases, as with the case of “pot” and “weed” keyword’s negative correlation.

This is why it is important to have clear communications about “ethical AI” between data science teams. Other may not be well informed about how AI works behind the scenes, and it’s therefore our responsibility as engineers and data scientists to ensure that our model is not biased and harmful.