MolKGNN: Extending Convolution To Molecules | by Yunchao "Lance" Liu (刘运超)

Understand MolKGNN, an interpretable GNN tailored for drug discovery

Figure 1. An analogy of (A) image convolution and (B) proposed molecular convolution. Image from the original paper.

This blog introduces our latest model Molecular Kernel Graph Neural Network (MolKGNN) from the paper
Interpretable Chirality-Aware Graph Neural Network for Quantitative Structure Activity Relationship Modeling in Drug Discovery

Elucidating molecular structures and their pharmacological activity has been a long-standing problem in the history of drug discovery. In 1859, the German chemist Carl Stahlschmidt demonstrated that the addition of methyl iodide to strychnine and brucine apparently destroyed their physiological action [1]. His work prompted two Scottish scientists, Alxander Crum Brown (1838–1922) and Thomas R. Fraser (1841–1920) to conduct experiments on a series of chemical compounds. Those experiments helped them confirm that there exists a structure activity relationship (SAR) [1]. At that time, there was considerable optimism that a general law describing the relationship between molecular structure and their pharmacological activity would be discovered. Several mathematical/statistical and machine learning methods have been developed since then, trying to predict this relationship. This process is known as quantitative structure activity relationship (QSAR) modeling. Examples of QSAR attempts include multiple linear regression, partial least squares, discriminant analysis, decision tree, genetic algorithms and etc [2]. High hopes are held out for the therapeutic applications of QSAR studies. However, till today, predicting the biological activity of small molecules remains a challenging task.

Task:

QSAR modeling, i.e., predict a binary label 0 (inactive) or 1 (active) from the molecule structure. A molecule is represented as a graph, where nodes are atoms and bonds are edges.

Highlights of the paper:

This paper introduces a novel SE(3)/conformation-invariance model named MolKGNN, tailored for QSAR task in drug discovery.
MolKGNN features in its novel molecular convolution, lightweight chirality calculation, and its interpretability
A realistic drug discovery experiment demonstrates the pragmatic value of the proposed MolKGNN

Figure 2. (A) In 2D image convolution, a higher convolution value indicates a higher visual similarity pattern. (B) The image kernels offer the benefit of interpretability. Image by the author.

MolKGNN draws its inspiration from the 2D image convolution (Figure 2). In 2D image, convolution operation can be regarded as calculating the similarity between the image patch and the image kernel. Larger output values indicate higher visual similarity patterns such as edges, strips, and curves [3]. However, the 2D image convolution cannot be readily extended to 3D molecular graphs due to their irregularity. Therefore, a new molecular convolution is designed to convolute between a molecular neighborhood (1-hop neighbors) and a molecular kernel (1-hop), similar to an image patch convoluted with an image kernel. The molecular convolution has the following properties:

Like image convolution, the more similar a molecular neighborhood is to the molecular kernel, the higher the molecular convolution value should be.
Unlike image convolution, the molecular convolution should be rotation-invariant
The molecular kernel could provide interpretability.

The next question is, how do we design the molecular convolution to have the above-mentioned properties?

The similarity between a molecular neighborhood S and a kernel S’ is quantified by a similarity score ϕ(S, S’). This score is a combination of three subscores ϕ_cs, ϕ_ns, and ϕ_es, which captures the central similarity, neighborhood node similarity, and edge similarity, respectively. The calculations of these are illustrated below (notations are fully described at the end):

Figure 3. Illustration of the similarity score calculation. Image from the original paper.

A similarity score between the neighborhood subgraph and kernel is calculated from the combination of three subscores. The subscores quantify the similarities of different aspects between the neighborhood subgraph and the kernel. The central similarity subscore ϕ_cs captures the similarity between the center node attributes (v and v’). The neighborhood node similarity subscore ϕ_ns captures the similarities of attributes of neighboring nodes (u1 and u1′, u2 and u2′, u3 and u3′). The Neighborhood edge similarity subscore ϕ_es captures the similarities of attributes of neighboring edges.

Since node/edge attributes are vectors, a similarity function sim(⋅) is used to calculate the vector similarity. We use cosine similarity in our implementation. ϕ_cs is calculated between attributes of center node v and v’ as shown in Figure 3(a).

Note that to calculate ϕ_ns, there are multiple ways of matching the neighbors, and each matching gives us a score. We enumerate all matchings and define the one that gives the highest score to be the optimal neighbor matching χ* (e.g, u1 and u2′, u2 and u3′, u3 and u1′ in the right of Figure 3(b)). This enumeration of matchings is feasible because there are at most four neighbors in drug-like molecules.

For ϕ_es, because the neighboring nodes have a one-to-one correspondence to neighboring edges, we can find the optimal edge matching χ^{e,*} based on the optimal neighbor matching χ* (Figure 3(c)).

Next, we would like to integrate chirality calculation. The idea is to use the kernel as an anchor for reference. The molecular neighborhood is then compared with the kernel to see if it is of the same neighbor order as the kernel or not. We leverage the tetrahedron volume calculation in vector form to capture the neighbor ordering [4]. See the illustration below.

Figure 4. Illustration of chirality calculation. Image from the original paper.

In neighborhood 1, three vectors a1, b1, and c1 are made from arbitrarily-chosen neighbors without loss of generality. The tetrahedral volume can be calculated as 1/6* a1×b1⋅c1. Note that this volume can be of positive or negative signs, which indicate the volume direction. The same calculation can be carried out in the kernel for the corresponding neighbor nodes in the optimal matching. If the sign of the tetrahedron volume of the neighborhood1 is the same as the one in the kernel, we know that they have the same neighboring node ordering. In the case of the neighborhood2 above, its volume is of different signs and we know neighborhood2 has different neighbor nodes order. Also, note that the constant 1/6 is trivial in the sign determination and can be omitted in actual implementation.

Finally, we leverage the message passing neural network (MPNN) framework [5] to get a larger receptive field. The idea is to replace the traditional aggregation of attributes of neighboring nodes, with the aggregation of similarities between a molecular neighborhood and kernels. See the illustration below.

Figure 5. Overview of the MolKGNN model. The key idea is to replace the traditional aggregation of attributes of neighboring nodes, with the similarities between a molecular neighborhood and a set of kernels. Image from the original paper.

The final atom embedding can be learned by repeating the process of calculating molecular convolution and propagating messages several times. The final molecular embedding can be obtained by various pooling techniques. The downstream prediction can be made by attaching a classifier such as the Multi-layer Perception (MLP) on top of the molecular embedding.

Realistic datasets from drug discovery are used to benchmark MolKGNN [6, 7]. These datasets are carefully curated to remove false positive signals frequently seen in drug discovery campaigns. Dataset statistics can be seen below and they are available on FigShare.

Table 1. Dataset statistics. They feature in the large size, highly-imbalanced label distribution and diverse protein targets. Image from the original paper.

Two tables below show the results. The logAUC_[0.001, 0.1] is used here to bias the compounds with high predicted scores. This corresponds to the real-world drug discovery scenario: only those predicted with high scores for activity are going to be purchased or synthesized. Hence it’s of more interest to see the model performance on those compounds instead of the general model performance. See the original paper for more results and the experiment details.

Figure 7. Result table. Image from the original paper.

Moreover, MolKGNN is able to capture patterns that align with domain knowledge. Below is an example of the patterns translated from the learned kernel. This pattern is also known as an important structure in medicinal chemistry called trifluoromethyl groups.

Figure 6. A learned kernel shows an important substructure pattern, known as the trifluoromethyl group. Image from the original paper.

In addition, the experiment on the expressiveness of MolKGNN confirms its able to distinguish chiral molecules. The CHIRAL1 dataset [8] is used. that contains 102,389 enantiomer pairs for a single 1,3-icyclohexylpropane skeletal scaffold with one chiral center. The data is labeled as R or S stereocenter and we use accuracy to evaluate the performance. For comparison, we use GCN [9] and a modified version of our model, MolKGNN-NoChi, that removes the chirality calculation module. Our experiments observed GCN and MolKGNN-NoChi achieve 50% accuracy while MolKGNN achieves nearly 100%, which empirically demonstrates our proposed method’s ability to distinguish chiral molecules.

Details of this study can be found in the original paper.

Component of ϕ(S, S′)

Figure 7. the alation study results for components of φ(S, S’). Image from the original paper.

The removal of any of the components has a negative impact on logAUC[0.001,0.1]. In fact, the impact is bigger for logAUC[0.001,0.1] than AUC in terms of the percentage of performance change. Note that in some cases such as the removal of φes, there is an increase in performance according to AUC, but this would significantly hinder the logAUC[0.001,0.1] metric.

Kernel Number

Figure 8. Performance for different kernel numbers. Image from the original paper.

When the number of kernels is too small (< 5), it greatly impacts the performance. However, once it is large enough to a certain point, a larger number of kernels has little impact on the performance.

It may seem to be formidable to enumerate all possible matchings described above. However, most nodes only have one neighbor (e.g., hydrogen, fluorine, chlorine, bromine and iodine). Take AID 1798 for example, 49.03%, 6.12%, 31.08% and 13.77% nodes are with one, two, three
and four neighbors among all nodes, respectively. For nodes with four neighbors, only 12 out of 24 matchings need to be enumerated because of chirality [8]. Since the adjacency matrix of molecular graphs are sparse, most GNNs incur a time complexity of O(|E|). And as analyzed above, the permutation is bounded by up to four neighbors (12 matchings). Thus, finding the optimal matching has a time complexity of O(1). The calculation of molecular convolution is linear to the number of K kernels and hence has a time complexity of O(K). Overall, our method takes a computation time of O(|E|K)

In this work, we introduce a new GNN model named MolKGNN to address the QSAR modeling problem. MolKGNN utilizes a newly-designed molecular convolution, where a molecular neighborhood is compared with a molecular kernel to output a similarity score, which is used as the new atom embedding for the next layer. Comprehensive benchmarking is conducted to evaluate MolKGNN. Well-curated datasets that consist of experimental HTS data from diverse protein target classes are used for the evaluation. The datasets are highly imbalanced, which highlights
the scarcity of positive signals in this real-world problem. For evaluation, not only do we use traditional AUC, but also logAUC[0.001,0.1], to evaluate the method’s performance under a high cutoff condition. This high-cutoff condition is typical of real-world applications and demonstrates the ap-
plicability of MolKGNN to drug discovery. Moreover, this paper provides a theoretical justification and experimental demonstration that MolKGNN is able to distinguish chiral molecules while providing interpretability for its results.

References:

[1] Parascandola, John. “Structure-Activity Relationships — The Early Mirage.” Pharmacy in History 13.1 (1971): 3–10.

[2] Wermuth, Camille Georges, ed. The practice of medicinal chemistry. Academic Press, 2011.

[3] Lin, Zhi-Hao, Sheng-Yu Huang, and Yu-Chiang Frank Wang. “Learning of 3d graph convolution networks for point cloud analysis.” IEEE Transactions on Pattern Analysis and Machine Intelligence 44.8 (2021): 4212–4224.

[4] Sliwoski, Gregory, et al. “BCL:: EMAS — enantioselective molecular asymmetry descriptor for 3D-QSAR.” Molecules 17.8 (2012): 9971–9989.

[5] Gilmer, Justin, et al. “Neural message passing for quantum chemistry.” International conference on machine learning. PMLR, 2017.

[6] Butkiewicz, Mariusz, et al. “Benchmarking ligand-based virtual High-Throughput Screening with the PubChem database.” Molecules 18.1 (2013): 735–756.

[7] Butkiewicz, Mariusz, et al. “High-throughput screening assay datasets from the pubchem database.” Chemical informatics (Wilmington, Del.) 3.1 (2017).

[8] Pattanaik, Lagnajit, et al. “Message passing networks for molecules with tetrahedral chirality.” arXiv preprint arXiv:2012.00094 (2020).

[9] Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).

This work is co-authored by Yu Wang, Oanh Vu, Rocco Moretti, Bobby Bodenheimer, Jens Meiler, and Tyler Derr.

Table 1. Notations. Made by the author.