Construct a Knowledge Graph on the Neo4j Cloud | by Sixing Huang | May, 2022

By Jessie Hobb On May 24, 2022

How to store your CAZy knowledge in AuraDB

Our world is full of data and information. But it takes effort and time to turn them into knowledge and ultimately wisdom. And one of the key processes is the formatting of data. Suitable formats ease our understanding and make discovery easy. And knowledge graph is one such format.

A knowledge graph is a network that represents knowledge in a specific domain. It is also called a semantic network because it connects nodes of diverse types, e.g. objects, people, or locations, via semantic relations into one web. Even though a knowledge graph can contain different things, it is intuitive, because its organization is similar to how we think. So a user can quickly grasp what it represents. Furthermore, it is visual and searchable. So a user can gain a quick overview by browsing the network interactively or learning about the specifics via database queries.

We are witnessing the explosive growth of knowledge graphs across industries. It puts contents into the infoboxes in Google’s result pages. And Amazon uses knowledge graphs on Amazon.com, Amazon Music, Prime Video, and Alexa. So does Walmart. These companies use knowledge graphs to discover new insights, make recommendations, and develop semantic searches.

In my previous articles, I have written about how to transfer three public medical knowledge graphs into a chatbot called Doctor.ai. And later, I have made an NLP pipeline that uses GPT-3 to extract relationships out of raw texts (here and here). In this article, I am going to use that pipeline to make a CAZy knowledge graph on the Neo4j cloud — AuraDB.

CAZy stands for Carbohydrate-Active enZYmes (CAZy). It is a web portal that provides information about enzymes (CAZymes) that synthesize, modify or degrade carbohydrates (from the enzyme’s perspective, carbohydrates here are substrates). A year ago, I have written an article about analyzing its content in Neo4j. Here I am going to extend that project. I will extract substrates and enzymatic interactions from public research articles and add them to the CAZy data to form a new knowledge graph (Figure 1). And finally, I will create a voice chatbot frontend so that the user can query the knowledge graph in plain English. The chatbot is based on my previous project Doctor.ai.

Figure 1. Construction workflow of the CAZy knowledge graph. Image by author.

The Python code for this project is hosted on my GitHub repository here.

The NLP pipeline for text extraction is based on my previous project. In this project, I have updated the repository with a training file for CAZy relations.

And the chatbot frontend is hosted as a branch called “cazy_kg” under my doctorai_eli5 repository.

Finally, the data model with data for Aura’s Data Importer is hosted here.

https://datathon-medium-file.s3.amazonaws.com/data-importer-2022-05-23.zip

Figure 2. The Neo4j schema in this project. Image by author.

Even though the knowledge graph is small, it consists of many small pieces (Figure 2). There are four types of nodes and nine types of relations. I have downloaded the newest CAZy data via the DOWNLOAD CAZY function on its website. Within this data, we get the genome and CAZy nodes, and the Genome -[:HAS_CAZY]->Cazy relations. Furthermore, we can deduce several relations there: the synteny and homology of some CAZy modules (Cazy -[:COEXISTS]-> Cazy and Cazy -[:RELATES]-> Cazy), the subfamily and family pairs (Cazy -[:IS_A]-> Cazy), and the CAZy-substrate binding pairs (Cazy -[:BINDS]-> Substrate). Finally, I collected four example research articles and extracted the Cazy-[:DEGRADES]-> Substrate relations with my NLP pipeline (description in Section 2). I included the Digital Object Identifier (DOI) of each information source for each relation. I also indicated whether the information source is primary (reports of original findings and ideas) or secondary (general works that are based on primary sources).

The download contains 205,462 genomes and many more relationships. Because the free tier Aura only allows 50,000 nodes and 175,000 relationships, we need to downsize our dataset a bit. In this project, I only keep the 650 genomes from the phylum Bacteroidetes. Bacteroidetes are a group of Bacteria. They are important polysaccharide degraders in the biosphere. A large contingent of their genomes are dedicated to the breakdown of polysaccharides. Members such as Prevotella are regular inhabitants in the rumen of cattle, sheep, and also in the human oral cavities and large intestines. Other members such as Formosa and Zobellia are found in the sea, and they degrade algal polysaccharides there.

Aside from CAZy, we need other data sources to augment the knowledge graph, too. For example, I downloaded the ontology of polysaccharides from NCBI MeSH for the Substrate -[:IS_A]-> Substrate relationships. The data contains the grouping of different polysaccharides. It is worth noting that there are different grouping methods and each polysaccharide can belong to multiple groups. For example, alginic acid is a type of alginates. But it is also a member of hexuronic acids, glucuronic acid, and so on. Here I only consider the grouping of “polysaccharides” in MeSH.

I have adjusted my NLP pipeline for this project. First, I provided new training prompts for GPT-3. As for the engine, I changed the text-davinci-002 engine to the cheaper text-curie-001, because to my surprise, the latter generated fewer noises and hence better results. I used it twice in this project. First it was used to extract the three CBM-related relationships: BINDS, COEXISTS and RELATES. And then I used it to extract the DEGRADES relations from the example research articles. The results were good but not perfect. So a small amount of manual curations were necessary in the end.

Inspired by Tomaz Bratanic’s article, I also added the entity linking function in my NLP pipeline. Its task is to disambiguate nouns. For example, both vitamin C and ascorbic acid will be converted to ascorbic acid. Under the hood, it uses NCBI MeSH to do the conversion over the internet and caches the results. The function will then first consult the caches before it goes to NCBI. This brings two benefits. On the one hand, it saves bandwidth. On the other hand, the user can examine the entity linkages and make the necessary changes. For example, according to MeSH, the most relevant hit for xyloglucan is xyloglucan endotransglycosylase instead of xyloglucan itself. But I can simply correct this error in the cache.tsv file and the pipeline will return the right entity from now on.

Once all the files are ready, we can import them into Aura. First, create an empty Aura instance (Figure 3). Secure your password in a password manager for later use.

Figure 3. Create an empty Aura instance. Image by author.

Afterwards, select the Data Importer and drag all the CSV and TSV files into the Files panel. Then create four types of nodes: Taxon, Genome, Cazy and Substrate and nine types of relations (Figure 4). You can find the data model with data in my link.

Figure 4. The schema or data model in the CAZy knowledge graph. Image by author.

When everything is set, click the Run Import button and Aura should import all the data in one minute.

After the import, we can open Bloom by clicking the Explore button. In Bloom, we can explore the CAZy knowledge graph freely. For example, I can visualize all the BINDS relations with the following query.

Figure 5. All the BINDS relations in the CAZy knowledge graph. Image by author.