Techno Blender
Digitally Yours.

Text Network Analysis: A Concise Review of Network Construction Methods | by Petr Korab | Jun, 2022

0 79


A concise, methodical guide, from research question definition to network structure estimation.

Image 1. Text network plot via Textnets. Image by author

This article explores the strategies for constructing network structures from text data. It is the second part of the series on text network analysis in Python. As a prior, please read my opening article that describes the main concepts of text network analysis (the article is here). We will follow the steps defined by (Borsoom et al., 2021) and briefly introduced in the previous article.

Image 2. Schematic representation of the workflow used in network approaches. Adapted from Borsoom et al., (2021). Image by draw.io

Further steps beyond defining the research question depend on the structure of our data. Therefore the key question to ask right at the beginning is: What’s the input to the network model?

We might work with:

  • raw, unprocessed data
  • cleaned data with nodes-edges structure

We can also turn the first into the second one and transform the raw data, clean it and create the nodes-edges structure.

First, let’s start with a question to answer:

Research question: what terminology is shared between research fields in journal article titles?

Research Articles Dataset from Kaggle containing abstracts for journal articles on six topics (Computer Science, Mathematics, Physics, Statistics, Quantitative Biology, and Quantitative Finance) is a great option to illustrate coding in Python. The data license is here.

Here is what it looks like:

Image 3. First rows in Research Articles Dataset

Textnets has been developed as a result of Bail’s (2016) PNAS paper. It exists both in Python and R implementations. By default, it uses the Leiden algorithm for community detection in text data. This group of algorithms helps discover the structure of large and complex networks and identify groups of nodes that are connected among themselves but sparsely connected to the rest of the network (see Traag et al., 2019, Yang et al., 2016). Learn more about other detection algorithms here.

Implementation

Let’s see how it works. First, we import Textnets and Pandas, and read the data. It is important to set index_col='research_field' to draw the graph correctly (see the complete code on my GitHub). Next, we build the corpus from the column of article titles. We use a subset representing 10 article titles from each research field to make the network for illustration simpler.

Textnets then removes stop words, applies stemming, removes punctuation marks, numbers, URLs, and the like, and creates a text network. mind_docs specifies the minimum number of documents a term must appear in to be included in the network.

Now, let’s plot the network. The show_clustersoptions marks the partitions found by the Leiden detection algorithm. It identifies document–term groups that appear to form part of the same theme in the texts.

Here is the net we get:

Image 4. Text network via Textnets. Image by author

Findings

We can clearly distinguish keywords that are shared by more than one research field. These are, e.g., “time” and “risk” (Quantitative Finance — Computer science), “deep” and “empirical” (Mathematics — Statistics — Computer Science), — Mathematics), or “subject” and “memory” (Quantitative Biology — Computer Science).

These findings strongly depend on the sample size. The richer dataset we have, the more precise results we obtain. We can draw the network structure in many other ways, depending on the research question we set at the beginning. Take a look at the Textnets tutorial here.

Photo by Lute on Unsplash

We can work with data with a clear nodes-edges structure that often involves cleaning and pre-processing. To explore possible scenarios, let’s use the IMDb 50K Movie Reviews dataset, which contains movie reviews, and their evaluated sentiment (positive/negative). The data license is here.

NetworkX is a Python library for the creation and study of complex networks. It is a highly developed package containing extensive documentation that draws networks in many tutorials and e-books. Hagberg et al. (2008), who co-authored the package, present the inner NetworkX structure. It can display various network structures; text data usually require some transformation to serve as the input.

Text networks are often used to display keyword co-occurrences in a text (Shim et al., 2015; Krenn and Zeilinger, 2020, and many others). We will use the same approach, and, as an example use case, we are interested in the associations of movie reviewers to the famous Matrix film.

The data consists of two sets of nodes: the monitored movie title (Matrix) and a group of selected movie titles that reviewers may associate with Matrix. Edges are represented by co-occurrences of the nodes in the same review. The edge only exists if a reviewer mentions the monitored and an associated movie title in the same review.

The research question: which popular sci-fi movies are primarily associated with Matrix?

The data has the following structure:

Image 5. Data, “nodes-edges” example. Image by author

Implementation

After reading the data, let’s do some simple transformation and exploratory steps that helps us to understand the graph and plot it propperly.

  1. Calculate edge size

To quantify edges, we create a separate column edge_width in our data with the size for every edge in the node2 column.

2. Create the graph and print the nodes and edges to prevent possible misinterpretations

3. Plot a network chart

After a brief inspection that no unexpected errors occur, we move on, shape the original G graph into a star graph, keep the properties of the graph in options, and plot it with matplotlib.

The code plots this beautiful network star graphics:

Image 6. Text network plot via NetworkX. Image by author

Findings: the data is not very rich for movie titles, but the network analysis suggests that the reviewer’s mostly associate Matrix with Thor and Tron. It seems obvious after a brief data inspection with such a small dataset. Imagine, however, that you have a larger dataset with a considerable volume of nodes. Here, network analysis greatly helps to describe the dataset.

This article could not provide an entirely exhausting review of text network construction methods. I have omitted a detailed overview of various network structures since there are multiple sources on the internet. Instead, it outlined a couple of methodological points in this area.

To summarize, here are several tips to follow:

  • First, clearly define the research question in your particular project. Text network analysis is an empirical approach to provide the answers.
  • Next, take a look at the dataset structure. If the research question requires data transformation, do it.
  • The created network might not be the final output of the analysis but rather an object for more complex investigations: graphics, machine learning model, forecasting, etc.

The complete code is on my GitHub.

The following article in this series will shed more light on simple and more complex graphs for text data analysis. The final piece will explore the recent state-of-the-art application of semantic networks for forecasting. Stay updated!

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.

[1] Bail, A., C. 2016. Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media. Proceedings of the National Academy of Sciences, vol. 113, no. 42.

[2] Borsboom, et al. 2021. Network analysis of multivariate data in psychological science. Nature Reviews, vol. 1, no. 58.

[3] Hagberg, A., A., Schult, D., A., Swart, P., J. 2008. Exploring network structure, dynamics, and function using NetworkX, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008.

[4] Krenn, M., Zeilinger, A. 2020. Predicting research trends with semantic and neural networks with an application in quantum physics. Proceedings of the National Academy of Sciences, vol. 117, no. 4.

[5] Shim, J., Park, C., Wilding, M. 2015. Identifying policy frames through semantic network analysis: an examination of nuclear energy policy across six countries. Policy Sciences, vol. 48.

[6] Traag, V. A., Waltman, L., Van Eck, N. J. 2019. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, vol. 9, no. 5233.

[7] Yang, Z., Algesheimer, R., Tessone, C. J. 2016. A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific Reports, vol. 6, no. 30750.


A concise, methodical guide, from research question definition to network structure estimation.

Image 1. Text network plot via Textnets. Image by author

This article explores the strategies for constructing network structures from text data. It is the second part of the series on text network analysis in Python. As a prior, please read my opening article that describes the main concepts of text network analysis (the article is here). We will follow the steps defined by (Borsoom et al., 2021) and briefly introduced in the previous article.

Image 2. Schematic representation of the workflow used in network approaches. Adapted from Borsoom et al., (2021). Image by draw.io

Further steps beyond defining the research question depend on the structure of our data. Therefore the key question to ask right at the beginning is: What’s the input to the network model?

We might work with:

  • raw, unprocessed data
  • cleaned data with nodes-edges structure

We can also turn the first into the second one and transform the raw data, clean it and create the nodes-edges structure.

First, let’s start with a question to answer:

Research question: what terminology is shared between research fields in journal article titles?

Research Articles Dataset from Kaggle containing abstracts for journal articles on six topics (Computer Science, Mathematics, Physics, Statistics, Quantitative Biology, and Quantitative Finance) is a great option to illustrate coding in Python. The data license is here.

Here is what it looks like:

Image 3. First rows in Research Articles Dataset

Textnets has been developed as a result of Bail’s (2016) PNAS paper. It exists both in Python and R implementations. By default, it uses the Leiden algorithm for community detection in text data. This group of algorithms helps discover the structure of large and complex networks and identify groups of nodes that are connected among themselves but sparsely connected to the rest of the network (see Traag et al., 2019, Yang et al., 2016). Learn more about other detection algorithms here.

Implementation

Let’s see how it works. First, we import Textnets and Pandas, and read the data. It is important to set index_col='research_field' to draw the graph correctly (see the complete code on my GitHub). Next, we build the corpus from the column of article titles. We use a subset representing 10 article titles from each research field to make the network for illustration simpler.

Textnets then removes stop words, applies stemming, removes punctuation marks, numbers, URLs, and the like, and creates a text network. mind_docs specifies the minimum number of documents a term must appear in to be included in the network.

Now, let’s plot the network. The show_clustersoptions marks the partitions found by the Leiden detection algorithm. It identifies document–term groups that appear to form part of the same theme in the texts.

Here is the net we get:

Image 4. Text network via Textnets. Image by author

Findings

We can clearly distinguish keywords that are shared by more than one research field. These are, e.g., “time” and “risk” (Quantitative Finance — Computer science), “deep” and “empirical” (Mathematics — Statistics — Computer Science), — Mathematics), or “subject” and “memory” (Quantitative Biology — Computer Science).

These findings strongly depend on the sample size. The richer dataset we have, the more precise results we obtain. We can draw the network structure in many other ways, depending on the research question we set at the beginning. Take a look at the Textnets tutorial here.

Photo by Lute on Unsplash

We can work with data with a clear nodes-edges structure that often involves cleaning and pre-processing. To explore possible scenarios, let’s use the IMDb 50K Movie Reviews dataset, which contains movie reviews, and their evaluated sentiment (positive/negative). The data license is here.

NetworkX is a Python library for the creation and study of complex networks. It is a highly developed package containing extensive documentation that draws networks in many tutorials and e-books. Hagberg et al. (2008), who co-authored the package, present the inner NetworkX structure. It can display various network structures; text data usually require some transformation to serve as the input.

Text networks are often used to display keyword co-occurrences in a text (Shim et al., 2015; Krenn and Zeilinger, 2020, and many others). We will use the same approach, and, as an example use case, we are interested in the associations of movie reviewers to the famous Matrix film.

The data consists of two sets of nodes: the monitored movie title (Matrix) and a group of selected movie titles that reviewers may associate with Matrix. Edges are represented by co-occurrences of the nodes in the same review. The edge only exists if a reviewer mentions the monitored and an associated movie title in the same review.

The research question: which popular sci-fi movies are primarily associated with Matrix?

The data has the following structure:

Image 5. Data, “nodes-edges” example. Image by author

Implementation

After reading the data, let’s do some simple transformation and exploratory steps that helps us to understand the graph and plot it propperly.

  1. Calculate edge size

To quantify edges, we create a separate column edge_width in our data with the size for every edge in the node2 column.

2. Create the graph and print the nodes and edges to prevent possible misinterpretations

3. Plot a network chart

After a brief inspection that no unexpected errors occur, we move on, shape the original G graph into a star graph, keep the properties of the graph in options, and plot it with matplotlib.

The code plots this beautiful network star graphics:

Image 6. Text network plot via NetworkX. Image by author

Findings: the data is not very rich for movie titles, but the network analysis suggests that the reviewer’s mostly associate Matrix with Thor and Tron. It seems obvious after a brief data inspection with such a small dataset. Imagine, however, that you have a larger dataset with a considerable volume of nodes. Here, network analysis greatly helps to describe the dataset.

This article could not provide an entirely exhausting review of text network construction methods. I have omitted a detailed overview of various network structures since there are multiple sources on the internet. Instead, it outlined a couple of methodological points in this area.

To summarize, here are several tips to follow:

  • First, clearly define the research question in your particular project. Text network analysis is an empirical approach to provide the answers.
  • Next, take a look at the dataset structure. If the research question requires data transformation, do it.
  • The created network might not be the final output of the analysis but rather an object for more complex investigations: graphics, machine learning model, forecasting, etc.

The complete code is on my GitHub.

The following article in this series will shed more light on simple and more complex graphs for text data analysis. The final piece will explore the recent state-of-the-art application of semantic networks for forecasting. Stay updated!

PS: You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet, you can join here.

[1] Bail, A., C. 2016. Combining natural language processing and network analysis to examine how advocacy organizations stimulate conversation on social media. Proceedings of the National Academy of Sciences, vol. 113, no. 42.

[2] Borsboom, et al. 2021. Network analysis of multivariate data in psychological science. Nature Reviews, vol. 1, no. 58.

[3] Hagberg, A., A., Schult, D., A., Swart, P., J. 2008. Exploring network structure, dynamics, and function using NetworkX, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008.

[4] Krenn, M., Zeilinger, A. 2020. Predicting research trends with semantic and neural networks with an application in quantum physics. Proceedings of the National Academy of Sciences, vol. 117, no. 4.

[5] Shim, J., Park, C., Wilding, M. 2015. Identifying policy frames through semantic network analysis: an examination of nuclear energy policy across six countries. Policy Sciences, vol. 48.

[6] Traag, V. A., Waltman, L., Van Eck, N. J. 2019. From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, vol. 9, no. 5233.

[7] Yang, Z., Algesheimer, R., Tessone, C. J. 2016. A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Scientific Reports, vol. 6, no. 30750.

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.
Leave a comment