Scaling GNNs via Neighbor Sampling

One of the challenges of Graph Neural Networks is to scale them to large graphs, e.g., in industrial and social applications. Traditional deep neural networks are known to scale well to large amounts of data by decomposing the training loss into individual samples (called a mini-batch) and approximating exact gradients stochastically. In contrast, applying stochastic mini-batch training in GNNs is challenging since the embedding of a given node depends recursively on all its neighbor’s embeddings, leading to high inter-dependency between nodes that grows exponentially with respect to the number of layers. This phenomenon is often referred to as neighbor explosion. As a simple workaround, GNNs are typically executed in a full-batch fashion (see here for an example), where the GNN has access to all hidden node representations in all its layers. However, this is not feasible in large-scale graphs due to memory limitations and slow convergence.

Scalability techniques are indispensable for applying GNNs to large-scale graphs in order to alleviate the neighbor explosion problem induced by mini-batch training, i.e. node-wise, layer-wise or subgraph-wise sampling techniques, or to decouple propagations from predictions. In this tutorial, we take a closer look at the most common node-wise sampling approach, originally introduced in the “Inductive Representation Learning on Large Graphs” paper.

Neighbor Sampling

PyG implements neighbor sampling via its torch_geometric.loader.NeighborLoader class. Neighbor sampling works by recursively sampling a fixed number of at most \(k\) neighbors for a node \(v \in \mathcal{V}\), i.e. \(\tilde{\mathcal{N}}(v) \subset \mathcal{N}(v)\) with \(|\tilde{\mathcal{N}}| \le k\), leading to an overall bounded \(L\)-hop neighborhood size of \(\mathcal{O}(k^L)\). That is, starting from a set of seed nodes \(\mathcal{B} \subset \mathcal{V}\), we sample at most \(k\) neighbors for every node in \(v \in \mathcal{B}\), and then proceed to sample neighbors for every sampled node in the previous hop, and so on. The resulting graph structure holds a directed \(L\)-hop subgraph around every node in \(v \in \mathcal{B}\), for which it is guaranteed that every node has at least one path of at most length \(L\) to at least one of the seed nodes in \(\mathcal{B}\). As such, a message passing GNN with \(L\) layers will incorporate the full set of sampled nodes in its computation graph.

It is important to note that neighbor sampling can only mitigate the neighbor explosion problem to some extend since the overall neighborhood size still increases exponentially with the number of layers. As a result, sampling for more than two or three iterations is generally not feasible.

Often times, the number of sampled hops and the number of message passing layers is kept in sync. Specifically, it is very wasteful to sample for more hops than there exist message passing layers since the GNN will never be able to incorporate the features of the nodes sampled in later hops into the final node representation of its seed nodes. However, it is nonetheless possible to utilize deeper GNNs, but one needs to be careful to convert the sampled subgraph into a bidirectional variant to ensure correct message passing flow. PyG provides support for this via an additional argument in NeighborLoader, while other mini-batch techniques are designed for this use-case out-of-the-box, e.g., ClusterLoader, GraphSAINTSampler and ShaDowKHopSampler.

Basic Usage

Note

In this section of the tutorial, we will learn how to utilize the Node2Vec class of PyG to train GNNs on single graphs in a mini-batch fashion. A fully working example on large-scale real-world data is available in examples/reddit.py.

The NeighborLoader is initialized from a PyG Data or HeteroData object and defines how sampling should be performed:

input_nodes defines the set of seed nodes from which we want to start sampling from.
num_neighbors defines the number of neighbors to sample for each node in each hop.
batch_size defines the size of seed nodes we want to consider at once.
replace defines whether to sample with or without replacement.
shuffle defines whether seed nodes should be shuffled at every epoch.

import torch
from torch_geometric.data import Data
from torch_geometric.loader import NeighborLoader

x = torch.randn(8, 32)  # Node features of shape [num_nodes, num_features]
y = torch.randint(0, 4, (8, ))  # Node labels of shape [num_nodes]
edge_index = torch.tensor([
    [2, 3, 3, 4, 5, 6, 7],
    [0, 0, 1, 1, 2, 3, 4]],
)

#   0  1
#  / \/ \
# 2  3  4
# |  |  |
# 5  6  7

data = Data(x=x, y=y, edge_index=edge_index)

loader = NeighborLoader(
    data,
    input_nodes=torch.tensor([0, 1]),
    num_neighbors=[2, 1],
    batch_size=1,
    replace=False,
    shuffle=False,
)

Here, we initialize the NeigborLoader to sample subgraphs for the first two nodes, where we want to sample 2 neighbors in the first hop, and 1 neighbor in the second hop. Our batch_size is set to 1, such that input_nodes will be split into chunks of size 1.

In the execution of NeighborLoader, we expect that the seed node 0 samples nodes 2 and 3 in the first hop. In the second hop, node 2 samples node 5, and node 3 samples node 6. Let’s confirm by looking at the output of the loader:

batch = next(iter(loader))

batch.edge_index
>>> tensor([[1, 2, 3, 4],
            [0, 0, 1, 2]])

 batch.n_id
 >>> tensor([0, 2, 3, 5, 6])

 batch.batch_size
 >>> 1

The NeighborLoader will return a Data object, which contains the following attributes:

batch.edge_index contain the edge indices of the subgraph.
batch.n_id contains the original node indices of all the sampled nodes.
batch.batch_size contains the number of seed nodes/the batch size.

In addition, node and edge features will be filtered to only contain the features of sampled nodes/edges, respectively.

Importantly, batch.edge_index contains the sampled subgraph with relabeled node indices, such that its indices range from 0 to batch.num_nodes - 1. If you want to reconstruct the original node indices of batch.edge_index, do:

batch.n_id[batch.edge_index]
>>> tensor([[2, 3, 5, 6],
            [0, 0, 2, 3]])

Furthermore, while NeighborLoader starts sampling from seed nodes, the resulting subgraph will hold edges that point to the seed nodes. This aligns well with the default PyG message passing flow from source to destination nodes.

Lastly, nodes in the output of NeighborLoader are guaranteed to be sorted. In particular, the first batch_size sampled nodes will exactly match with the seed nodes that were used for sampling:

batch.n_id[:batch.batch_size]
>>> tensor([0])

Afterwards, we can use NeighborLoader as a data loading routine to train GNNs on large-scale graphs in mini-batch fashion. For this, let’s create a simple two-layer GraphSAGE model:

from torch_geometric.nn import GraphSAGE

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = GraphSAGE(
    in_channels=32,
    hidden_channels=64,
    out_channels=4,
    num_layers=2
).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

We can now combine the loader and model to define our training routine:

import torch.nn.functional as F

for batch in loader:
    optimizer.zero_grad()
    batch = batch.to(device)
    out = model(batch.x, batch.edge_index)

    # NOTE Only consider predictions and labels of seed nodes:
    y = batch.y[:batch.batch_size]
    out = out[:batch.batch_size]

    loss = F.cross_entropy(out, y)
    loss.backward()
    optimizer.step()

The training loop follows a similar design to any other PyTorch training loop. The only important difference is that by default the model will output a matrix of shape [batch.num_nodes, *], while we are only interested in the predictions of the seed nodes. As such, we can use efficient slicing both on the node predictions and the ground-truth information batch.y to only obtain predictions and ground-truth information of actual seed nodes. This ensures that we are only making use of the first batch_size many nodes for loss and metric computation.

Hierarchical Extension

A drawback of Neighborloader is that it computes a representations for all sampled nodes at all depths of the network. However, nodes sampled in later hops no longer contribute to the node representations of seed nodes in later GNN layers, thus performing useless computation. NeighborLoader will be marginally slower since we are computing node embeddings for nodes we no longer need. This is a trade-off we make to obtain a clean, modular and experimental-friendly GNN design, which does not tie the definition of the model to its utilized data loader routine. The Hierarchical Neighborhood Sampling tutorial shows how to eliminate this overhead and speed up training and inference in mini-batch GNNs further.

Advanced Options

NeighborLoader provides many more features for advanced usage. In particular,

NeighborLoader supports both sampling on homogeneous and heterogeneous graphs out-of-the-box. For sampling on heterogeneous graphs, simply initialize it with a HeteroData object. Sampling on heterogeneous graphs via NeighborLoader allows for fine-granular control of sampling parameters, e.g., it allows to specify the number of neighbors to sample for each edge type individually. Take a look at the Heterogeneous Graph Learning tutorial for additional information.
By default, NeighborLoader fuses sampled nodes across different seed nodes into a single subgraph. This way, shared neighbors of seed nodes will not be duplicated in the resulting subgraph and hence save memory. You can disable this behavior by passing the disjoint=True option to the NeighborLoader.
By default, the subgraphs returned from NeighborLoader will be directed, which restricts its use to GNNs with equal depth to the number of sampling hops. If you want to utilize deeper GNNs, specify the subgraph_type option. If set to "bidirectional", sampled edges are converted to bidirectional edges. If set to "induced", the returned subgraph will contain the induced subgraph of all sampled nodes.
NeighborLoader is designed to perform sampling from individual seed nodes. As such, it is not directly applicable in a link prediction scenario. For this use-cases, we developed the LinkNeighborLoader, which expects a set of input edges, and will return subgraphs that were created via neighbor sampling from both source and destination nodes.