Privacy Concerns After the Facebook – Cambridge Analytica Data Scandal

By Jessie Hobb On Aug 3, 2022

A cross-regional cross-language Twitter study about a major privacy scandal

In 2018, the firm Cambridge Analytica was accused of collecting and using the personal information of over 87 million Facebook users without their authorization. Opinions, facts, and stories related to it were shared on social media, including Twitter, where the hashtag #DeleteFacebook became a trending topic for several days.

While there is increasing global attention to data privacy, most privacy research is only conducted in a few countries in North America and Europe. In this article, we describe an approach for studying data privacy over a larger geographical scope by analyzing social media content related to this major data privacy scandal. We also report our methodology’s limitations, findings, and future work directions.

You can find more details about our methodology and findings in our paper:

Felipe González-Pizarro, Andrea Figueroa, Claudia López, and Cecilia Aragon. Regional Differences in Information Privacy Concerns After the Facebook-Cambridge Analytica Data Scandal. Published in Computer Supported Cooperative Work (CSCW) 31, 33–77 (2022)

Our paper presents an analysis of more than a million public tweets related to the Facebook-Cambridge Analytica scandal. The dataset was divided by language (Spanish and English) and region (Latin America, Europe, North America, and Asia). Using word embeddings and manual content analysis, we studied and compared the semantic context in which privacy-related terms were used. Then, we contrasted our results with one of the most used information privacy concerns frameworks (IUIPC). In our results, we observed language and regional differences in privacy concerns that hint at a need for extensions of current information privacy frameworks.

We implemented a four-step methodology to identify differences in information privacy concerns by language and world region (see Figure 1). (1) Data Collection: Retrieving tweets associated with data privacy during a specific period. (2) Data Preprocessing: Filtering the data, removing retweets, and excluding tweets likely generated by bots. (3) Text Mining: Creating word embeddings (a multi-dimensional representation of a corpus) for the remaining tweets according to their language and world region. (4) Coding & Analysis: Analyzing similarities and differences in the semantic contexts of privacy keywords in the word embeddings.

Figure 1: Four-step methodology | Image by author

We used Tweepy to collect Spanish/English related to the Facebook-Cambridge Analytica scandal between April and July 2018. Tweepy is a Python library for accessing the standard real-time streaming Twitter API that can retrieve tweets that match a given query (e.g., “#DeleteFacebook”, “#Cambridge Analytica”). The complete list of terms/queries we used to review our dataset is available online. Our collection of tweets was allowed under the terms and conditions of the Twitter API.

As the goal was to analyze people’s opinions about information privacy, we decided to pre-process our data in three ways. First, retweets were removed to avoid analyzing exact duplicates. Afterward, we sought to identify and filter our tweets generated by bots. Our last step was to link tweets with their corresponding world region. The previous steps are explained further below.

Bot Detection

We used Botometer [1] to detect and remove tweets created by bots. Botometer uses machine learning to analyze more than one thousand features, including tweets’ content and sentiment, accounts’ and friends’ metadata, retweet/mention network structure, and posting behavior, to generate a score that ranges from 0 to 1. A higher value suggests a high likelihood that an inspected account is a bot. This tool has reached high accuracy (94%) in predicting both simple and sophisticated bots.

Identifying the country of residence of Twitter users

We used the GeoNames API to identify the country of residence of Twitter users in our datasets. On Twitter, users can self-report their city or country of precedence. Nevertheless, textual references to geographic locations can be ambiguous. For example, over 60 places worldwide are named “Paris”[2]. To deal with this challenge, we employed the GeoNames API, a collaborative gazetteer project that contains more than 11M entries and alternate names for locations worldwide in various languages. This tool has yielded results with an accuracy above 80%[2].

We found that 81% of users in our Spanish and 79% in our English datasets had filled the city or country fields in their profiles. However, the GeoNames API could not detect the users’ location in several cases, for example, when inaccurate information was provided (e.g., “Planet Earth.. where everyone else is from”, “Mars”). Nonetheless, the tool was able to identify the location of users who created 59% of the Spanish tweets and 60% of the English ones.

Five language-regional datasets were created to compare information privacy concerns by geographical regions. The Spanish Twitter dataset was divided into two sets: tweets written by users from (1) Latin America and (2) Europe. Similarly, the English dataset was divided into three sets: tweets written by users from (1) North America, (2) Europe, and (3) Asia.

Word embeddings are a type of word representation that encode the meaning of terms in vectors such that related words are expected to be closer in the vector space. Analyzing the closest terms to a given term can reveal the semantic context in which it is used [3,4].

To enable cross-language and cross-regional comparisons, a set of word embeddings were created. First, we built word embeddings for the Spanish and English datasets (containing geolocated and non-geolocated tweets). Then, we generated word embeddings for each of our five language-regional datasets.

When creating word embeddings, we considered different word embedding architecture combinations that involve Word2vec/FastText, CBOW/Skipgram, and different numbers of dimensions and epochs. As there is still no consensus about which word embedding evaluation is more adequate, each word embedding architecture was evaluated over 18 intrinsic conscious evaluation methods using a word embedding benchmark library.

We systematically examined the semantic contexts in which information privacy terms appear according to the word embeddings. We focused our investigation on four keywords in English: information, privacy, users, and company and their corresponding translations in Spanish: información, privacidad, usuarios, and empresa. For each embedding, we retrieved the closest terms to the four keywords. The closeness between each term and a keyword was measured using cosine similarity. For instance, the closest terms for the keyword information in the English word embedding were info, data, details, and personal, in that order (see Figure 2).

Figure 2: Top 20 closest terms to “information” and “privacy” in the Spanish and English word embeddings. Terms in Spanish were translated into English by the authors. Full results are available online | Image by author

Collecting and analyzing the semantic contexts of these privacy-related keywords allowed us to observe the presence of terms related to information privacy concerns in the collected tweets. We systematically conducted open coding of these terms. After several iterations, we developed a set of categories to characterize them. Finally, to assess if information privacy concerns were present, we contrasted these categories to a widely accepted framework to describe internet users’ information privacy concerns (IUIPC). We found relationships among some of our categories, the three IUIPC dimensions, and our initial keywords (see Figure 3).

Figure 3: We identify several categories that can be easily mapped to the three dimensions of the Internet User Information Privacy Concerns (IUIPC): collection, awareness, and control. In this way, we find evidence that social media content can reveal information about privacy concerns | Image by author

Then, we evaluated differences in information privacy concerns across language and world regions. To do so, we used a Chi-squared test to assess if the proportion of terms in the semantic contexts were significantly different across word embeddings. We accounted for multiple comparisons in all of these tests by applying alpha adjustment according to Šidák. This method allowed us to control the probability of making false discoveries when performing multiple hypotheses tests.

IUIPC is a theory-based model widely used to study information privacy concerns on the internet. It includes three constructs: Collection, which refers to data gathering; Control which involves concerns about data governance; and Awareness which refers to the acknowledgment of organizational information privacy practices.

Our results suggest a more granular categorization of the Awareness IUIPC concept. For example, it could include more specific sub-topics that users can be aware of, such as privacy and security terms (e.g., cybersecurity, confidentiality), security mechanisms (e.g., credentials, encrypted), and privacy and security risks (e.g., scams, grooming). The presence of terms that fit these categories reveals that they are already part of public online conversations around privacy. A distinction among broad privacy and security terms, mechanisms to protect data, and potential data risks might be helpful to further describe the kinds of knowledge people have. Additionally, awareness about some of these subtopics might be more influential than others. For example, knowing about risks and mechanisms might be a sign of deeper privacy concerns, while knowing broad privacy and security terms might not. The distinction between sub-topics could also guide the efforts of users, educators, and practitioners to enhance privacy literacy.

The presence of the regulation category highlights its importance in relation to information privacy concerns. Regulation refers to laws or rules that aim to control the use of personal data. The emergence of this category from our open coding confirms its relevance through its frequent appearance in public posts about a data breach scandal. These regulations are not only a topic of data and law experts but also seem to be part of the public discourse around online data privacy.

English speakers emphasize data collection more than Spanish speakers.

Our analysis reveals that English speakers significantly emphasize data collection more than Spanish speakers when freely expressing online about privacy keywords. This difference can lead researchers and practitioners to explore the effectiveness of more tailored data privacy campaigns for specific populations. For example, populations concerned about collection might need more information about the benefits of sharing their information.

North American privacy concerns are not generalizable to other regions.

We also observe significant regional differences in Awareness. Particularly, data from North America shows the smallest emphasis on Awareness while Latin America has the highest. This finding is particularly important because most studies on information privacy concerns are centered on the USA. It warns us against the (sometimes implicit) assumption that North American privacy concerns can be generalizable to other regions. Results provide observational evidence to argue that it is necessary to include more diverse populations to better understand the phenomena around data privacy. This finding also invites practitioners to address other regions, such as Latin America, using different services and privacy policies approaches. Populations more concerned about Awareness might be more receptive to companies that employ more transparent communications regarding their use of personal data, for example.

As with any study, our research has limitations. We collected data through the free standard streaming Twitter API using specific hashtags and keywords. Thus, we only had access to a limited sample of all the tweets about the scandal. We used Botometer to detect and remove tweets likely to be created by bots. This tool can only analyze Twitter public accounts; therefore, it could not be used on suspended accounts or those with their tweets protected when running our analysis. We decided to remove the tweets from such accounts from our datasets because we can not confidently claim that humans generated them. Indeed, previous research suggests that it is likely that social bots were present in this cohort. Moreover, we focused our investigation on four keywords in English: information, privacy, users, and company and their corresponding translations to Spanish. While using synonyms would have brought similar semantic contexts, adding more concepts can strengthen the results. Future work can explore other keywords such as intimacy and consumers.

Our paper uses an alternative approach to study information privacy concerns over a large geographical scope. This approach aims to discover knowledge from a large-scale social media dataset on a topic for which a ground truth does not exist. Unfortunately, such ground truth is unlikely to exist because large-scale, multi-country, and multi-language surveys are too expensive to conduct [5].

We carefully analyzed more than a thousand terms of the semantic contexts, conducted open coding to formulate a data-grounded categorization, and contrasted our categorization with IUIPC [6], one of the well-accepted theoretical conceptualizations of information privacy concerns.

Our paper discusses how our findings can extend current conceptualizations of information privacy concerns. Finally, we examine how they might relate to regulations about personal data usage in the regions we analyzed.

Future work can dig deeper into the observed differences and study the potential causes. Future studies might build upon our work to examine privacy concerns considering more languages, geographical locations, or different information privacy frameworks. Using our methodology to compare datasets across more extended periods could help determine whether the semantic contexts of the privacy keywords change over time.

If you are interested, you can find me on Twitter or visit my website :).

Thanks to Claudia López, Ignacio Tampe, and Adam Geller for suggesting improvements to this article.

[1] Davis, C. A., Varol, O., Ferrara, E., Flammini, A., & Menczer, F. (2016, April). Botornot: A system to evaluate social bots. In Proceedings of the 25th international conference companion on world wide web (pp. 273–274).

[2] Jackoway, A., Samet, H., & Sankaranarayanan, J. (2011, November). Identification of live news events using Twitter. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-Based Social Networks (pp. 25–32).

[3] González, F., Figueroa, A., López, C., & Aragon, C. (2019, November). Information Privacy Opinions on Twitter: A Cross-Language Study. In Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing (pp. 190–194).

[4] Rho, E. H. R., Mark, G., & Mazmanian, M. (2018). Fostering civil discourse online: Linguistic behavior in comments of# metoo articles across political perspectives. Proceedings of the ACM on human-computer interaction, 2(CSCW), 1–28.

[5] Li, Yao; Eugenia Ha Rim Rho; and Alfred Kobsa (2020). Cultural differences in the effects of contextual factors and privacy concerns on users’ privacy decision on social networking sites. Behaviour & Information Technology, 1–23.

[6] Malhotra, Naresh K.; Sung S. Kim; and James Agarwal (2004). Internet users’ information privacy concerns (IUIPC): The construct, the scale, and a causal model. Information Systems Research, vol. 15, no. 4, pp. 336–355.