Graph Machine Learning @ ICML 2022 | by Michael Galkin | Jul, 2022

By Jessie Hobb On Jul 25, 2022

What’s New in GraphML?

Recent advancements and hot trends, July 2022 edition

International Conference on Machine Learning (ICML) is one of the premier venues where researchers publish their best work. ICML 2022 was packed with hundreds of papers and numerous workshops dedicated to graphs. We share the overview of the hottest research areas 🔥 in Graph ML.

Denoising diffusion probabilistic models (DDPMs) are taking over the field of Deep Learning in 2022 in pretty much all domains with stunning generation quality and better theoretical properties than GANs and VAEs, e.g., in image generation (GLIDE, DALL-E 2, Imagen), video generation, text generation (Diffusion-LM), and even diffusion for reinforcement learning. Conceptually, diffusion models gradually add noise to an input object (until it is a Gaussian noise) and learn to predict the added level of noise such that we can subtract it from the object (denoise).

Diffusion might be the biggest trend in GraphML in 2022 — particularly when applied to drug discovery, molecules and conformers generation, and quantum chemistry in general. Often, they are paired with the latest advancements in equivariant GNNs. ICML features several cool implementations of denoising diffusion for graph generation.

➡️ In “Equivariant Diffusion for Molecule Generation in 3D” by Hoogeboom, Satorras, Vignac, and Welling, the authors define an equivariant diffusion model (EDM) for molecule generation that has to maintain E(3) equivariance over atom coordinates x (as to rotation, translation, reflection) and invariance to group transformations over node features h. Importantly, molecules have different feature modalities: atom charge is an ordinal integer, atom types are one-hot categorical features, and atom coordinates are continuous features, so, for instance, you can’t just add Gaussian noise to one-hot features and expect the model to work. Instead, the authors design feature-specific noising processes and loss functions, and scale input features for training stability.

EDM employs a state-of-the-art E(n) GNN as a neural network that predicts noise based on input features and time step. At inference time, we first sample the desired number of atoms M, then we can condition EDM on a desired property c, and ask EDM to generate molecules (defined by features x and h) as x, h ~ p(x,h | c, M).

Experimentally, EDM outperforms normalizing flow- and VAE-based approaches by a large margin in terms of achieved negative log-likelihood, molecule stability, and uniqueness. Ablations demonstrate that an equivariant GNN encoder is crucial as replacing it with a standard MPNN leads to significant performance drops. Code is already available on GitHub, try it!

Forward and backward diffusion. Source: Hoogeboom, Satorras, Vignac, and Welling.

Diffusion-based generation visualization. Source: Twitter

➡️ For 2D graphs, Jo, Lee, and Hwang propose Graph Diffusion via the System of Stochastic Differential Equations (GDSS). While the previous EDM is an instance of denoising diffusion probabilistic model (DDPM), GDSS belongs to a sister branch of DDPMs, namely, score-based models. In fact, it was recently shown (ICLR’21) that DDPMs and score-based models can be unified into the same framework if we describe the forward diffusion process with stochastic differential equations (SDEs).

SDEs allow to model diffusion in continuous time as Wiener process (for simplicity, let’s say it is a fancy term for the process of adding noise) while DDPMs usually discretize it in 1000 steps (with learnable time embedding) although SDEs would require using specific solvers. Compared to previous score-based graph generators, GDSS takes as input (and predicts) both adjacency A and node features X. The forward and backward diffusion expressed as SDEs require computing scores — here gradients of joint log-densities (X, A). For obtaining those densities, we need a score-based model, and here the authors use a GNN with attention pooling (graph multi-head attention).

At training time, we solve a forward SDE and train a score model, while at inference we use the trained score model and solve the reverse-time SDE. Usually, you’d employ something like Langevin dynamics here, e.g., Langevin MCMC, but higher-order Runge-Kutta solvers should, in principle, work here, too. Experimentally, GDSS outperforms autoregressive generative models and one-shot VAEs by a large margin in 2D graph generation tasks, although sampling speed might still be a bit of a bottleneck due to integrating reverse-time SDEs. GDSS code is already available!

GDSS intuition. Source: Jo, Lee, and Hwang

👀 Looking at arxiv those days, we’d expect many more diffusion models to be released this year — DDPMs in graphs deserve their own big blog post, stay tuned!

➡️ Finally, an example of a non-diffusion generation is the work by Martinkus et al who design SPECTRE, a GAN for one-shot graph generation. Apart from othen GANs who would often generate an adjacency matrix right away, the idea of SPECTRE is to condition graph generation on top-k (lowest) eigenvalues and eigenvectors of a Laplacian that already give some notion of clusters and connectivity. 1️⃣ SPECTRE generates k eigenvalues. 2️⃣ the authors use a clever trick of sampling eigenvectors from the Stiefel manifold induced by top-k eigenvectors. The Stiefel manifold presents a bank of orthonormal matrices from which we can sample one n x k matrix. 3️⃣Finally, obtaining a Laplacian, the authors use a Provably Powerful Graph Net to generate the final adjacency.

Experimentally, SPECTRE is orders of magnitude better than other GANs and up to 30x faster than autoregressive graph generators 🚀.

SPECTRE a 3-step process to generate eigenvalues -> eigenvectors -> adjacency. Source: Martinkus et al

We have two papers on improving Graph Transformers at this year’s ICML.

➡️ First, Chen, O’Bray, and Borgwardt propose a Structure-Aware Transformer (SAT). They notice that self-attention can be rewritten as kernel smoothing where the query-key product is an exponential kernel. It then boils down to finding a more generalized kernel — the authors propose to use functions of a node and graph to add structure awareness, namely, k-subtree and k-subgraph features. K-subtrees are essentially k-hop neighborhoods and can be mined relatively fast, but are eventually limited to the expressiveness of 1-WL. On the other hand, k-subgraphs are more expensive to compute (and hardly scale) but provide a provably better distinguishing power.

Whatever featurization you select, those subtrees or subgraphs (extracted for each node) are then encoded through any GNN encoder (eg, PNA), pooled (sum/mean/virtual node), and used as queries and keys in the self-attention computation (see the illustration 👇).

Experimentally, k of 3 or 4 is enough, and k-subgraph features work expectedly better on graphs where we can afford their computation. Interestingly, positional features like Laplacian eigenvectors and Random Walk features are only helpful for the k-subtree SAT being rather useless for k-subgraph SAT.

➡️ Second, Choromanski, Lin, Chen, et al (the team overlaps a lot with the authors of the famous Performer with linear attention) study principled mechanisms to enable sub-quadratic attention. In particular, they consider relative positional encodings (RPEs) and their variations for different data modalities like images, sounds, video, and graphs. Considering graphs, we know from Graphormer that infusing shortest path distances into attention works well, but requires materialization of the full attention matrix (hence, not scalable). Can we approximate the softmax attention without full materialization but still incorporate useful graph inductive biases? 🤔

Yes! And the authors propose 2 such mechanisms. (1) Turns out, we can use Graph Diffusion Kernels (GDK) — a.k.a. heat kernels — that model a diffusion process of heat propagation and serve as a soft version of shortest paths. Diffusion, however, requires calling solvers for computing matrix exponentials, so the authors design another way. (2) Random Walks Graph-Nodes Kernel (RWGNK) which value (for two nodes) encodes the dot product of frequency vectors (of those two nodes) obtained from random walks starting at those two nodes.

Random walks are great, we love random walks 😍 Check out the illustration below for a visual description of diffusion and RW kernel results. The final transformer with the RWGNK kernel is called Graph Kernel Attention Transformer (GKAT) and probed against several tasks from synthetic identification of topological structures in ER graphs to small compbio and social network datasets. GKAT shows much better results on synthetic tasks and performs pretty much on par with GNNs on other graphs. Would be great to see a real scalability study pushing the Transformer to the limits of input set size!

The GNN community continues to study the ways of breaking through the ceiling of 1-WL expressiveness and retaining at least polynomial time complexity.

➡️ Papp and Wattenhofer start with an accurate description of current theoretical studies:

Whenever a new GNN variant is introduced, the corresponding theoretical analysis usually shows it to be more powerful than 1-WL, and sometimes also compares it to the classical k-WL hierarchy… Can we find a more meaningful way to measure the expressiveness of GNN extensions?

The authors categorize the literature of expressive GNNs into 4 families: 1️⃣ k-WL and approximations; 2️⃣ substructure counting (S); 3️⃣ subgraph- and neighborhood-aware GNNs (N) (covered extensively in the recent post by Michael Bronstein); 4️⃣ GNNs with marking — those are node/edge perturbation approaches and node/edge labeling approaches (M). Then, the authors come up with the theoretical framework of how all those k-WL, S, N, and M families are related and which one is more powerful to what extent. The hierarchy is more fine-grained than k-WL to help designing GNNs expressive enough to cover particular downstream tasks and save compute.

The proposed hierarchy of different expressive GNN families. N=subgraph GNNs, S=substructure counting, M=GNNs with markings. Source: Papp and Wattenhofer

➡️ Perhaps the tastiest ICML’22 work is cooked by chefs Morris et al with 🥓SpeqNets 🥓(Speck is bacon in German). Known higher-order k-WL GNNs either operate on k-order tensors or consider all k-node subgraphs, implying an exponential dependence on k in memory requirements and do not adapt to the sparsity of the graph. SpeqNets introduce a new class of heuristics for the graph isomorphism problem, the (k,s)-WL, which offers a more fine-grained control between expressivity and scalability.

Essentially, the algorithm is a variant of the local k-WL but only considers specific tuples to avoid the exponential memory complexity of the k-WL. Concretely, the algorithm only considers k-tuples or subgraphs on k nodes with at most s connected components, effectively exploiting the potential sparsity of the underlying graph — varying k and s leads to a tradeoff between scalability and expressivity on the theoretical side.

The authors derive a new hierarchy of permutation-equivariant graph neural networks, denoted SpeqNets, based on the above combinatorial insights, reaching universality in the limit. These architectures vastly reduce computation times compared to standard higher-order graph networks in the supervised node- and graph-level classification and regression regime, significantly improving standard graph neural network and graph kernel architectures in predictive performance.

The hierarchy of 🥓 SpeqNets 🥓. Source: Morris et al

➡️ Next, Huang et al take an unorthodox look at permutation-invariant GNNs and suggest that carefully designed permutation-sensitive GNNs are actually more expressive. The theory of Janossy pooling says a model becomes invariant to a group of transformations if we show all possible examples of such a transformation, and for permutation of n elements we have an intractable n! permutations. Instead, the authors show that considering only pairwise 2-ary permutations of a node’s neighborhood is enough and is provably more powerful than 2-WL and not less powerful than 3-WL.

Practically, the proposed PG-GNN extends the idea of GraphSAGE and encodes every random permutation of node’s neighborhood through a 2-layer LSTM instead of traditional sum/mean/min/max. Additionally, the authors design a linear permutation sampling approach based on Hamiltonian cycles.

PG-GNN permutation-sensitive aggregation idea. Source: Huang et al

Some other interesting works you might want to check:

Cai and Wang study convergence properties of Invariant Graph Networks, different from vanilla MPNNs in that they operate on node and edge features as equivariant operations over monolithic tensors. Based on the graphon theory, the authors find a class of IGNs that provably converge. More technical details are in the awesome Twitter thread!
Gao and Ribeiro study ⏳ temporal GNNs ⏳devising two families: (1) time-and-graph — where we first embed graph snapshots via some GNN and then apply some RNN; (2) time-then-graph where we first encode all node and edge features (over a unified graph of all snapshots) through an RNN, and only then apply a single GNN pass, e.g., TGN and TGAT can be considered instances of this family. Theoretically, the authors find that time-then-graph are more expressive then time-and-graph when using standard 1-WL GNN encoders like GCN or GIN, and propose a simple model with a GRU time encoder and a GCN graph encoder. The model shows very competitive performance on temporal node classification and regression tasks being 3–10x faster and GPU memory-efficient. Interestingly, the authors find that neither time-and-graph nor time-then-graph is expressive enough for temporal link prediction 🤔.
Finally, “Weisfeiler-Lehman Meets Gromov-Wasserstein” by Chen, Lim, Mémoli, Wan, Wang (a joint 5-first authors paper 👀) derives a polynomial-time WL distance from the WL kernel such that we can measure a dissimilarity of two graphs — the WL distance is 0 if and only if they cannot be distinguished by the WL test, and positive iff they can be distinguished. The authors further realize that the proposed WL distance has deep connections to the Gromov-Wasserstein distance!

How Weisfeiler-Leman meets Gromov-Wasserstein in practice. Should have been in the paper by Chen, Lim, Mémoli, Wan, Wang. Source: Tenor

➡️ Spectral GNNs tend to be overlooked in the mainstream of spatial GNNs , but now there is a reason for you to take a look at spectral GNNs 🧐. In “How Powerful are Spectral Graph Neural Networks” by Wang and Zhang, the authors show that a linear spectral GNN is a universal approximator for any function on a graph under some mild assumptions. What’s even exciting is that the assumptions turn out to be empirically true for real-world graphs, suggesting that a linear spectral GNN is powerful enough for the node classification task.

But how do we explain the difference in the empirical results of spectral GNNs? The authors prove that different parameterizations (specifically, polynomial filters) of the spectral GNNs influence the convergence speed. We know that the condition number of Hessian matrix (how round is the iso-loss line) is highly related to the convergence speed. Based on this intuition, the authors come up with some orthogonal polynomials to benefit optimization. The polynomials, named as Jacobi bases, are a generalization of the Chebyshev bases used in ChebyNet. Jacobi bases are defined by two hyperparameters, a and b. By tuning these hyperparameters, one may find a group of bases in favor of the input graph signal.

Experimentally, JacobiConv performs well on both homophilic and heterophilic datasets, even as a linear model. Probably it’s the time to desert those gaudy GNNs, at least for the node classification task 😏.

➡️ There are two more papers on spectral GNNs. One is Graph Gaussian Convolutional Networks (G2CN) based on spectral concentration analysis that shows good results on heterophilic datasets. The other one from Yang et al analyzes the correlation issue in graph convolutions based on spectral smoothness that shows an exceptionally good result of 0.0698 MAE on ZINC.

As most GNN models are black boxes, it is important to explain the predictions of GNNs for applications in crucial areas. This year we have two awesome papers in this direction, an efficient and powerful post-hoc model from Xiong et al, and an inherently interpretable model from Miao et al.

➡️ Xiong et al extend their previous GNN explanation method, GNN-LRP, to be way more scalable. Unlike other methods (GNNExplainer, PGExplainer, PGM-Explainer), GNN-LRP is a higher-order subgraph attribution method that considers the joint contribution of nodes in a subgraph. Such a property is necessary for tasks where a subgraph is not simply a set of nodes. For example, in molecules, a subgraph of six carbons (hydrogens are ignored) can be either a benzene (a ring) or a hexane (a chain). As shown in the below figure, a higher-order method can figure out such subgraphs (right) while a lower-order method (left) may not.

However, the drawback of GNN-LRP is that it needs to compute the gradient w.r.t. each random walk in a subgraph, which takes O(|S|L) for a subgraph S and L-hop random walks. Here dynamic programming comes to the rescue 😎. Notice that the gradient w.r.t. a random walk is multiplicative (chain rule), and different random walks are aggregated by summation. This can be efficiently computed by the sum-product algorithm. The idea is to use the distributive property of summation over multiplication (more generally, semiring), and aggregate partial random walks at each step. This constitutes the model, subgraph GNN-LRP (sGNN-LRP).

sGNN-LRP also improves over GNN-LRP with a generalized subgraph attribution, which considers both random walks in the subgraph S and its complement G\S. Though complicated as it looks, the generalized subgraph attribution can be computed by two sum-product algorithm passes. Experimentally, sGNN-LRP not only finds attributions better than all existing explanation methods, but also runs as fast as a regular message passing GNN. Might be a useful tool for interpretation and visualization! 🔨

💡 By the way, it is not new to see that models based on random walks are more expressive than simple node or edge models. The NeurIPS’21 paper NBFNet solves knowledge graph reasoning with random walks and dynamic programming, and achieves amazing results in both transductive and inductive settings.

➡️ Miao et al take another perspective and study inherently interpretable GNN models. They show that post-hoc explanation methods, such as GNNExplainer, are subpar for interpretation since they merely use a fixed pretrained GNN model. By contrast, an inherently interpretable GNN that jointly optimizes the predictor and the interpretation modules, is a better solution. Following this idea, the authors derive graph stochastic attention (GSAT) from the graph information bottleneck (GIB) principle. GSAT encodes the input graph, and randomly samples a subgraph (interpretation) from the posterior distribution. It makes the prediction based on the sampled subgraph. As an advantage, GSAT doesn’t need to constrain the size of a sampled subgraph.

Experimentally, GSAT is much better than post-hoc methods in terms of both interpretation and prediction performance. It can also be coupled with a pretrained GNN model. GSAT should be a good candidate if you are building interpretable GNNs for your applications.

This year brought a few works on improving self-supervised capabilities of GNNs that go beyond random edge index perturbations like node/edge dropout.

➡️ Han et al bring the idea of mixups used in image augmentation since 2017 to graphs with G-Mixup (Outstanding Paper Award at ICML 2022 🏅). The idea of mixups is to take two images, mix their features together and mix their labels together (according to a pre-defined weighting factor), and ask the model to predict this label. Such a mixup improves robustness and generalization qualities of classifiers.

But how do we mix two graphs that in general might have different numbers of nodes and edges?

The authors find the elegant answer — let’s mix not graphs, but their graphons — that are, in simple words, graph generators. Graphs coming from the same generator have the same underlying graphon. So the algorithm becomes rather straightforward (see the illustration below) — for a pair of graphs, we 1️⃣ estimate their graphons; 2️⃣ mix up two graphons into a new one through a weighed sum; 3️⃣ sample a graph from the new graphon and its new label; and 4️⃣ send this to a classifier. In the illustrative example, we have two graphs with 2 and 8 connected components, respectively, and after mixing their graphons we get a new graph of 2 major communities with 4 minor in each. Estimating graphons can be done with a step function and several methods different in computational complexity (the authors mostly resort to “largest gap”).

Experimentally, G-Mixup stabilizes model training, performs better or on-par with traditional node/edge perturbation methods, but outperforms them by a large margin in the robustness scenarios with label noise or many added/removed edges. Cool adaptation of a well-known augmentation method to graphs 👏! If you are interested, ICML’22 offers a few more general works on mixups: a study how mixups improve calibration and how to use them in generative models.

➡️ Liu et al take another look at augmentation, particularly in setups where nodes have a small neighborhood. The idea of Local Augmentation GNNs (LA-GNN) is in training a generative model to yield an additional feature vector for each node. The generative model is a conditional VAE trained (on the whole graph) to predict features of connected neighbors conditioned on a center node. That is, once CVAE is trained, we just pass a feature vector of each node and get another feature vector that is supposed to capture more information than plain neighbors.

We then concat two feature vectors per node and send it to any downstream GNN and task. Note that CVAE is pre-trained before and doesn’t need to be trained with a GNN. Interestingly, CVAE can generate features for unseen graphs, i.e., local augmentation can be used in inductive tasks as well! The initial hypothesis is confirmed experimentally — the augmentation approach works particularly well for nodes of small degrees.

The Local Augmentation idea. Source: Liu et al

➡️ Next, Yu, Wang, Wang, et al, tackle the GNN scalability task where using standard neighbor samplers a-la GraphSAGE might lead to exponential neighborhood size expansion and stale historical embeddings. The authors propose GraphFM, a feature momentum approach, where historical node embeddings get updates from their 1-hop neighbors through a momentum step. Generally, momentum updates are often seen in SSL approaches like BYOL and BGRL for updating model parameters of a target network. Here, GraphFM employs momentum to alleviate the variance of historical reprepsentations in different mini-batch sizes and provide an unbiased estimation of feature updates for differently-sized neighborhoods.

Generally, GraphFM has two options GraphFM-InBatch and GraphFM-OutOfBatch. (1) GraphFM-InBatch works for the GraphSAGE-style neighbor sampling by dramatically reducing the number of necessary neighbors — whereas GraphSAGE required 10–20 depending on the level, GraphFM needs only 1 random neighbor per node per layer. Only one 👌! And (2) GraphFM Out-of-batch builds on top of GNNAutoScale where we first apply graph partitioning to cut graphs into k minibatches.

Experimentally, feature momentum looks especially useful for SAGE-style sampling (the in-batch version) — seems like a good default choice for all neighbor sampling-based approaches!

Compared to GNNAutoScale (GAS), historical node states are also updated from new embeddings and feature momentum (moving average). Source: Yu, Wang, Wang, et al

➡️ Finally, Zhao et al propose a clever augmentation trick for link prediction based on counterfactual links. In essence, the authors ask:

“would the link still exist if the graph structure became different from observation?”

It means that we would like to find links that are structurally similar to the given link according to some 💊 treatment (here, those are classical metrics like SBM clustering, k-core decomposition, Louvain, and more) but give the opposite result. With CFLP, the authors hypothesize that training a GNN to correctly predict both true and counterfactual links helps the model to get rid of spurious correlations and capture only meaningful features for inferring a link between two nodes.

After obtaining a set of counterfactual links (pre-processing step based on the chosen treatment function) CFLP is first trained on both factual and counterfactual links, then the link prediction decoder is fine-tuned with some balancing and regularization terms. In some sense, the approach resembles mining hard negatives to augment the set of true positive links 🤔Experimentally, CFLP paired with a GNN encoder largely outperforms results of that single GNN encoder on Cora/Citeseer/Pubmed, and is still in top-3 of OGB-DDI link prediction task!

Counterfactual links (right). Source: Zhao et al

🎆 A huge milestone for the algorithmic reasoning community — the appearance of the CLRS benchmark (named after a classical textbook Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein) by Veličković et al! Now, there is no need to invent toy evaluation tasks — CLRS contains 30 classical algorithms (sort, search, MST, shortest paths, graphs, dynamic programming, and many more) converting an ICPC data generator into an ML dataset 😎.

In CLRS, each dataset element is a trajectory, i.e., a collection of inputs, outputs, and intermediate steps. The underlying representation format is a set of nodes (not often a graph as edges might not be necessary), for example, sorting a list of 5 elements is framed as operations over a set of 5 nodes. Trajectories consist of probes — tuples of format (stage, location, type, values) that encode a current execution step of an algorithm with its states. Output decoder depends on the expected type — in the example illustration 👇 sorting is modeled with pointers.

Split-wise, training and validation trajectories have 16 nodes (e.g., sort lists of length 16), but the test set probes out-of-distribution (OOD) capabilities of models on tasks with 64 nodes. Interestingly, vanilla GNNs and MPNNs fit training data very well but underperform in the OOD setup where Pointer Graph Network shows better numbers. It is a one more data point to the collection of observations that GNNs can’t generalize to larger inference graphs — it’s still an open question how to fix this 🤔 . The code is already available and could be extended with more custom algorithmic tasks.

Representation of hints in CLRS. Source: Veličković et al

➡️ On a more theoretical side, Sanmartín et al generalize the notion of graph metrics through the Algebraic Path Problem (APP). APP is a more high-level framework (with some roots in the category theory) unifying many existing graph metrics like shortest path, commute cost distance, and minimax distance through the notion of semirings — algebraic structures over sets with specific operators and properties. For instance, shortest paths can be described as a semiring with “min” and “+” operators with neutral elements “+inf” and “0”.

Here, the authors create a single APP framework of log-norm distances that allows to interpolate between shortest paths, commute costs, and minimax using only two parameters. In essence, you could vary and mix the influence of edge weights and surrounding graph structure (other paths) on the final distance. Although there are no experiments, this is a solid theoretical contribution — if you are learning category theory as “eating your veggies” 🥦, this paper is a blast to read — and will surely find applications in GNNs. 👏

Log-norm distances. Source: Sanmartín et al

➡️ Finally, we’d add to this category a work “Learning to Infer the Structures of Network Games” by Rossi et al who combine graph theory with game theory. Game theory is used a lot in economics and other multidisciplinary studies, you’ve probably heard about the Nash Equilibrium that defines a solution for non-cooperative games. In this work, the authors consider 3 game types: linear quadratic, linear influence, and Barik-Honorio graphical games. Games are usually defined through their utility functions, but in this work, we assume we don’t know anything about game’s utility function.

Games are defined as N players (nodes in a graph) that take specific actions (for simplicity, let’s say we can describe it with a certain numerical feature, check the illustration below 🖼️). Actions can influence neighboring players — and the task is framed as inferring the graph of players given their actions. In essence, this is a graph generation task — given node features X, predict a (normalized) adjacency matrix A. Usually, a game is played K times, and those are independent games, so the encoder model should be invariant to permutations of games (and equivariant to permutation of nodes in each game). The authors propose the NuGgeT 🍗 encoder-decoder model where a transformer encoder processes K games by N player, yields latent representations, and decoder is an MLP over a sum of a Hadamard product of latent pairwise player features such that the decoder is permutation-invariant to the order of K games.

Experimentally, the model works well on both synthetic and real datasets. The paper is definitely a “broaden your horizon” 🔭 work that you might not expect to see at ICML, but later find a fascinating reading and learning a lot of new concepts 👏.

Knowledge graph reasoning has long been a playground for GraphML methods. In this year’s ICML, there are quite a few interesting papers on this topic. As a trend of this year, we see a significant drift from embedding methods (TransE, ComplEx, RotatE, HAKE) to GNNs and logic rules (in fact, GNNs are also related to logic rules). There are four papers based on GNNs or logic rules, and two papers extending the conventional embedding methods.

➡️ Let’s begin with the cycle basis GNN (CBGNN) proposed by Yan et al. The authors draw an interesting connection between logic rules and cycles. For any chain-like logic rule, the head and the body of the logic rule always form a cycle in the knowledge graph. For example, the right plot of the following figure shows the cycle for (X, part of Y) ∧ (X, lives in, Z) → (Y, located in Z). In other words, the inference of a logic rule can be viewed as predicting the plausibility of a cycle, which boils down to learning the representations of cycles.

Blue and Red triangles are cycles within the bigger Green cycle. Source: Yan et al

An interesting observation is that cycles form a linear space under modulo-2 addition and multiplication. In the above example, the summation of the red ❤️ and blue 💙 cycles, which cancels out their common edge, results in the green 💚 cycle. Therefore, we don’t need to learn the representation of all cycles, but instead, only a few cycle bases of the linear space. The authors generate the cycle bases by picking cycles that have large overlapping with the shortest path tree. To learn the representation of cycles, they create a cycle graph, where each node is a cycle in the original graph, and each edge indicates overlapping between two cycles. A GNN is applied to learn the node (which are cycles of the original graph) representations in the cycle graph.

To apply CBGNN to inductive relation prediction, the authors construct an inductive input representation for each cycle by encoding the relations in the cycle with an LSTM. Experimentally, CBGNN achieves SotA results on the inductive version of FB15k-237/WN18RR/NELL-995.

➡️ Next, Das and Godbole et al propose CBR-SUBG, a case-based reasoning (CBR) method for KBQA. The core idea is to retrieve similar query-answer pairs from the training set when solving a query. We know the idea of retrieval is very popular in OpenQA task (EMDR, RAG, KELM, Mention Memory LMs), but this is the first time to see such an idea adopted on graphs.

Given a natural language query, CBR first retrieves similar k-nearest neighbor (kNN) queries based on the query representation encoded by a pretrained language model. All the retrieved queries are from the training set, and therefore their answers are accessible. Then we generate a local subgraph for each query-answer pair, which is believed to be the reasoning pattern (though not necessarily exact) for the answer. The local subgraph of the current query (for which we can’t access the answer) is generated by following the relation paths in the subgraphs of its kNN queries. CBR-SUBG then applies a GNN to every subgraph, and predicts the answer by comparing the node representations with answers in the KNN queries.

Case-based reasoning intuition. Source: Das and Godbole et al

➡️ There are two neural-symbolic methods for reasoning this year. The first one is hierarchical rule induction (HRI) from Glanois et al. HRI extends a previous work, logic rule induction (LRI) on inductive logic programming. The idea of rule induction is to learn a bunch of rules and apply them to deduce facts like forward chaining.

In both LRI and HRI, each fact P(s,o) is represented by a predicate embedding 𝜃p and a valuation vp (i.e. the probability of the fact being true). Each rule P(X,Y) ← P1(X,Z) ∧ P2(Z,Y) is represented by the embeddings of its predicates. The goal is to iteratively apply rules to deduce new facts. During each iteration, the rules and facts are matched through soft unification, which measures whether two facts satisfy certain rules in the embedding space. Once a rule is selected, a new fact is generated and added to the set of facts. All the embeddings and the soft unification operation are trained end-to-end to maximize the likelihood of observed facts.

The HRI model improves over the LRI model in three aspects: 1) use a hierarchical prior that separates the rules used in each iteration step. 2) use gumbel-softmax to induce a sparse and interpretable solution for soft unification. 3) prove the set of logic rules that HRI can express.

Hierarchical Rule Induction. Source: Glanois et al

➡️ The second one is the GNN-QE paper from Zhu et al (disclaimer: a paper from the authors of this blog post). GNN-QE solves complex logical query on knowledge graphs with GNNs and fuzzy sets. It enjoys the advantages of both neural (e.g. strong performance) and symbolic (e.g. interpretability) methods. As there is a lot of interesting stuff in GNN-QE, we will have a separate blog post for it soon. Stay tuned! 🤗

➡️ Finally, Kamigaito and Hayashi study the theoretical and empirical effects of negative sampling in knowledge graph embeddings. Starting from RotatE, knowledge graph embedding methods use a normalized negative sampling loss, plus a margin binary cross entropy loss. This is different from the negative sampling used in the original word2vec. In this paper, the authors prove that the normalized negative sampling loss is necessary for distance-based models (TransE, RotatE) to reach the optimal solution. The margin also plays an important role in distance-based models. The optimal solution can only be reached if 𝛾 ≥ log|V|, which is consistent with the empirical results. Based on this conclusion, now we can determine the optimal margin without hyperparameter tuning! 😄

Generally, comp bio is represented at ICML pretty well. Here, we’ll have a look at new approaches for molecular linking, protein binding, conformer generation, and molecular property prediction.

Molecular linking is a crucial part in designing Proteolysis targeting chimera (PROTAC) drugs. For us, mere GNN researchers 🤓 without biological background, it means that given two molecules, we want to generate a valid linker molecule that would attach two fragment molecules in a single molecule while retaining all properties of the original fragment molecules (check the illustration below for a good example)

➡️ For generating molecular links, Huang et al created 3DLinker, an E(3)-equivariant generative model (VAE) that sequentially generates atoms (and connecting bonds) with absolute coordinates. Often, equivariant models generate relative coordinates or relative distance matrices, but here, the authors aim at generating absolute (x, y, z) coordinates. To allow a model to generate exact coordinates from equivariant (to coordinates) and invariant (to nodes features) transformations, the authors apply a clever idea of Vector Neurons which is essentially a ReLU-like nonlinearity for preserving feature equivariance with clever orthogonal projection tricks.

The E(3)-equivariant encoder enriched with Vector Neurons encodes features and coordinates while the decoder sequentially generates the link in 3 steps (illustrated belos as well): 1️⃣ predict an anchor node to which the link will be attached; 2️⃣ predict node type for a linker node; 3️⃣ predict edge and its absolute coordinates; 4️⃣ repeat until we hit the stop node in the second fragment. 3DLinker is (so far) the first equivariant model that generates the linker molecule with exact 3D coordinates and predicts the anchor points in fragment molecules — previous models required known anchors before generation — and shows the best experimental results.

➡️ Protein-ligand binding is the other crucial drug discovery task — predicting where a small molecule could potentially attach to a certain region of a bigger protein. First, Stärk, Ganea, et al create EquiBind (ICML Spotlight 💡) that takes as input a protein and a random RDKit conformer of a ligand graph, and outputs precise 3D location of the binding interaction. EquiBind has already garnered a very warm reception and publicity as in MIT News and reading groups so we encourage you to have a detailed look at technical details! EquiBind is orders of magnitude faster than commercial software while maintaining high prediction accuracy.

➡️ If the binding molecule is unknown and we want to generate such a molecule, Liu et al create GraphBP, an autoregressive molecule generation approach that takes as input a target protein site (denoted as initial context). Encoding the context with any 3D GNN (SchNet here), GraphBP generates atom type and spherical coordinates until there are no more contacting atoms available or the desired number of atoms is reached. Once the atoms are generated, the authors resort to OpenBabel to create bonds.

Generating a binding molecule with GraphBP. Source: Liu et al

➡️ In molecular property prediction, Yu and Gao propose a simple and surprisingly powerful idea to enrich molecular representations with a bag of motifs. That is, they first mine a vocabulary of motifs in the training dataset and rank them according to TF-IDF scores (hello from NLP 😉). Then, each molecule can be represented as a bag of motifs (multi-hot encoding) and the whole dataset of molecules is converted to one heterogeneous graph with relations “motif-molecule” if any molecule contains this motif, and “motif-motif” if any two motifs share an edge in any molecule. Edge features are those TF-IDF scores mined before.

The final embedding of a molecule is obtained through a concatenation of any vanilla GNN over the molecule and another heterogeneous GNN over a sampled subgraph from the motif graph. Such a Heterogeneous Motif GNN (HM-GNN) consistently outperforms Graph Substructure Networks (GSN), one of the first GNN architectures that proposed to count triangles in social networks and k-cycles in molecules, and even Cell Isomorphism Networks (CIN), a top-notch higher-order message passing model. HM-GNNs can serve as a simple powerful baseline for subsequent research in the area of higher-order GNNs 💪.

Building a motif vocabulary in HM-GNN. Source: Yu and Gao

➡️ Finally, a work by Stärk et al demonstrates the benefits of pre-training GNNs both on 2D molecular graphs and their 3D conformers with the 3D Infomax approach. The idea of 3D Infomax is in maximizing mutual information between 2D and 3D representations such that at inference time over 2D graphs, when no 3D structure is given, the model could still benefit from implicit knowledge of the 3D structure.

For that, 2D molecules are encoded with the Principal Neighborhood Aggregation (PNA) net, 3D conformers are encoded with the Spherical Message Passing (SMP) net, we take the cosine similarity of their representations and pass through the contrastive loss maximizing the similarity of a molecule with its true 3D conformers and treating other samples as negatives. Having pre-trained 2D and 3D nets, we can fine-tune the weights of the 2D net on a downstream task — QM9 property prediction in this case — and the results definitely show that pretraining works. By the way, if you are further interested in pre-training, you can check out GraphMVP published at ICLR 2022 as another 2D/3D pre-training approach.

In 3D Informax, we first pre-train 2D and 3D nets, and use a trained 2D net at inference time. Source: Stärk et al

Physical simulation along with molecular dynamics received a huge boost with GNNs. A standard setup of physical simulation is a system of particles where node features are several recent velocities and edge features are relative displacements, and the task is to predict where the particles move at the next time step.

⚛️ This year, Rubanova, Sanchez-Gonzalez et al further improve physical simulations by incorporating explicit scalar constraints in the C-GNS (Constraint-based Graph Network Simulator). Conceptually, the output of an MPNN encoder is further refined through a solver that minimizes some learned (or specified at inference time) constraint. The solver itself is a differentiable function (5-iteration gradient descent in this case) so we can backprop through the solver as well. C-GNS is inherently connected to deep implicit layers that are getting more and more visibility including the GNN applications.

Physical simulation works are often a source of fancy simulation visualizations — check out the website with video demos!

Constraint-based Graph Network Simulator. Source: Rubanova, Sanchez-Gonzalez et al

A few other cool applications you might want to have a look at:

Traffic Prediction: Lan, Ma, et al created DSTA-GNN (Dynamic Spatial-Temporal Aware Graph Neural Network) for traffic prediction 🚥evaluated on real-world datasets of busy California roads — predicting traffic with graphs received a boost last year after the massive work by Google and DeepMind’s on improving Google Maps ETA which we covered in 2021 results.
Neural Network Pruning: Yu et al design GNN-RL to iteratively prune weights of deep neural nets given a desired ratio of FLOPs reduction. For that, the authors treat a neural net’s computational graph as a hierarchical graph of blocks and send it to a hierarchical GNN (with intermediate learnable pooling to coarse-grain the NN architecture). Encoded representations are sent to the RL agent that decides which block to prune.
Ranking: He et al tackle an interesting task — given a matrix of pairwise interactions, e.g., between teams in a football league where Aij > 0 means team i got a better score than team j, find the final ranking of nodes (teams) who scored best. In other words, we want to predict who is the winner of a league after seeing pair-wise results of all games. The authors propose GNNRank that represents pairwise results as a directed graph and applies a directional GNN to get latent node states and compute the Fiedler vector of the graph Laplacian. Then, they frame the task as a constrained optimization problem with proximal gradient steps as we can’t easily backprop through the computation of the Fiedler vector.

That’s finally it for ICML 2022! 😅

Looking forward to seeing NeurIPS 2022 papers as well as submissions to the brand-new Learning on Graphs (LoG) conference!