Classical Strategies for Scaling up Graph Neural Networks | by Tiago Toledo Jr. | Jul, 2022

By Jessie Hobb On Jul 19, 2022

A summary of ways one can make their networks run faster and with less memory

In my previous post, I talked about how Graph Neural Networks (GNNs) have become a hot research topic because of the advancements that were achieved on the tasks related to deep learning on complex networks (graphs).

Since this is such an important topic, many pieces of research are being developed focusing on improving the GNNs. Two main objectives are in place: make them more powerful (in terms of predictive performance) and make them run faster and with less memory consumption.

In this post, we will see that today, traditional GNNs that use the Message-Passing framework are very slow to train and require lots of memory. I pretend to guide you on some of the tactics that are already in place to try and scale GNNs to very large graphs.

The message-passing framework consists of aggregating the information from the neighbors of a node, together with the information of the node itself, to generate the node representation.

We already saw a little bit of how this works on this post and its relation to the WL-Test on this post. Now, let’s evaluate how difficult it may be to have a deep network with a large graph.

Let’s remember that each layer added to a GNN represents one more hop on the network.

Now, if we look at the GraphSAGE [1] architecture which selects a fixed number of neighbors for each node on each layer, we can see how the memory consumption starts getting out of hand.

Let F be the number of features you have on the X dataset of the network and let L be the number of layers. To start, you will have a memory consumption of O(LF²) to hold up the weights matrices of the network. This itself is not a big problem.

Now, let’s look to the memory consumption that comes from holding in memory the embeddings of the nodes which you need in order to compute the embedding of each target node during training.

Let b be your batch size and r be the number of nodes you sample using GraphSAGE. Let’s assume that you sample the same amount of nodes per layer. In that case, the amount of memory you will need will be O(bFr^L). More information about this calculation can be found in [2]

The problem lies in the exponential relationship between the number of layers of the network. This shows why several of the architectures that exist today are not able to go much further than 3 or 4 layers.

(It is important to notice here that some research points out the fact that maybe going deeper on GNNs will not yield performance gains, this discussion, however, is out of the scope of this article).

Let’s now review some of the tactics researchers have been employing to try and improve the scalability problem for the GNNs.

Sampling methodologies try to reduce the number of nodes we need to deal with at any given point in time by sampling the computational graph. Depending on how and where this sampling occurs, we may have three different classifications.

Node Sampling

This is the methodology implemented by the GraphSAGE we just saw. The idea is that we may not need to consider the entire node neighborhood during training and we can select a few nodes instead.

As we saw, this still imposes problems with the complexity. Also, random sampling may not be a good strategy since there are no criteria for choosing those nodes, which may negatively impact performance.

Subgraph Sampling

The idea behind subgraph sampling is that there may be a way of splitting a really big complex network into smaller components and then batch these components during training. This methodology has two main examples: the ClusterGCN [2] and the GraphSAINT [3].

ClusterGCN applies a clustering algorithm (or community detection algorithm) to the graph previously to training. Thus, by generating several clusters (subgraphs) one has a ready-to-use batch of subgraphs.

GraphSAINT on the other hand has some sampling methodologies that define which nodes and which edges should be added to the sampled subgraph taking into account the importance and the bias of the sampling.

Both methodologies are successful at improving the time and memory complexity of the training of the models. However, both include a potentially slow pre-processing step to generate the subgraphs. Also, they have some assumptions that may now generalize well to every network. For example, what happens if there is no community structure on the graph? ClusterGCN may not work well in those cases.

Layer Sampling

The layer sampling strategy consists of sampling the nodes that will be considered on each layer. Notice that this is different from node sampling. On node sampling, every node passes through every layer, just the computation of the neighbors’ changes. On layer sampling, one layer may not have one node at all.

The main example of this kind of strategy is the FastGCN [4]. The architecture assumes independence between nodes and then applies some Monte Carlo approximations to generate the sampling.

Notice that independence between nodes is a very strong assumption here. We are dealing with a graph where, by definition, nodes have some relationship between them (an edge).

The other category of strategies to scale GNNs is the simplification of the architectures. The idea behind this kind of strategy is to break the message passing framework as defined in the classical GCN paper [5] in favor of another type of architecture that scales better via precomputation. We will explore two examples of this kind of architecture.

Simplifying Graph Convolutional Neural Networks (SGC)

This architecture [6] starts with the hypothesis that what generates the good performance of the GNNs is the aggregation of the neighbors and not the non-linearities (ReLU and others) between the layers of the GNN.

If you remove the non-linearities, you will see that the entire GNN starts to look like a sequence of matrix multiplications. Take for example a three-layer network:

Where S is the normalized adjacency matrix, X is the feature vector and P are the weight matrices of the network layers.

One can clearly see that the entire left side of these multiplications does not depend on the weights of the network. This means that we can multiply it before training the network.

The authors of the architecture show that this method is competitive with the traditional GNNs showing that one can remove the non-linearities of the network and precompute a lot of the information that previously would happen during training.

Scalable Inception Graph Neural Networks (SIGN)

This architecture [7] is based on the Inception Network from the Computer Vision area. The basic idea behind this network is that one can precompute multiple aggregations without trainable weights and then concatenate them on a model to generate the embeddings.

This is very similar to the idea from the SGC. On SIGN, the authors of the architecture precompute each type of hop of the network (by exponentiation of the adjacency matrix), then multiply it by the feature matrix, and only then do they use a downstream model.

The ideas from these two architectures show that pre-computation is a very prominent area of research that can yield really powerful results for GNNs

This is not, in any way, an extensive review of the scaling methodologies for Graph Neural Networks. However, I hope this was able to show you some of the most famous methodologies and how they fit together in the bigger picture of scaling GNNs.

[1] Hamilton, William & Ying, Rex & Leskovec, Jure. (2017). Inductive Representation Learning on Large Graphs.

[2] Chiang, Wei-Lin & Liu, Xuanqing & Si, Si & Li, Yang & Bengio, Samy & Hsieh, Cho-Jui. (2019). Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks.

[3] Zeng, Hanqing & Zhou, Hongkuan & Srivastava, Ajitesh & Kannan, Rajgopal & Prasanna, Viktor. (2020). GraphSAINT: Graph sampling based inductive learning method.

[4] Chen, Jie & Ma, Tengfei & Xiao, Cao. (2018). FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. ICLR.

[5] Kipf, Thomas & Welling, Max. (2016). Semi-Supervised Classification with Graph Convolutional Networks.

[6] Wu, Felix & Zhang, Tianyi & Souza, Jr & Fifty, Christopher & Yu, Tao & Weinberger, Kilian. (2019). Simplifying Graph Convolutional Networks.

[7] Rossi, Emanuele & Frasca, Fabrizio & Chamberlain, Ben & Eynard, Davide & Bronstein, Michael & Monti, Federico. (2020). SIGN: Scalable Inception Graph Neural Networks.

A summary of ways one can make their networks run faster and with less memory