torch_geometric.datasets
Homogeneous Datasets
Zachary's karate club network from the "An Information Flow Model for Conflict and Fission in Small Groups" paper, containing 34 nodes, connected by 156 (undirected and unweighted) edges. 

A variety of graph kernel benchmark datasets, .e.g., 

A variety of artificially and semiartificially generated graph datasets from the "Benchmarking Graph Neural Networks" paper. 

The citation network datasets 

The NELL dataset, a knowledge graph from the "Toward an Architecture for NeverEnding Language Learning" paper. 

The full citation network datasets from the "Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking" paper. 

Alias for 

The Coauthor CS and Coauthor Physics networks from the "Pitfalls of Graph Neural Network Evaluation" paper. 

The Amazon Computers and Amazon Photo networks from the "Pitfalls of Graph Neural Network Evaluation" paper. 

The proteinprotein interaction networks from the "Predicting Multicellular Function through Multilayer Tissue Networks" paper, containing positional gene sets, motif gene sets and immunological signatures as features (50 in total) and gene ontology sets as labels (121 in total). 

The Reddit dataset from the "Inductive Representation Learning on Large Graphs" paper, containing Reddit posts belonging to different communities. 

The Reddit dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing Reddit posts belonging to different communities. 

The Flickr dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing descriptions and common properties of images. 

The Yelp dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing customer reviewers and their friendship. 

The Amazon dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing products and its categories. 

The QM7b dataset from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, consisting of 7,211 molecules with 14 regression targets. 

The QM9 dataset from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, consisting of about 130,000 molecules with 19 regression targets. 

A variety of abinitio molecular dynamics trajectories from the authors of sGDML. 

The ZINC dataset from the ZINC database and the "Automatic Chemical Design Using a DataDriven Continuous Representation of Molecules" paper, containing about 250,000 molecular graphs with up to 38 heavy atoms. 

The AQSOL dataset from the Benchmarking Graph Neural Networks paper based on AqSolDB, a standardized database of 9,982 molecular graphs with their aqueous solubility values, collected from 9 different data sources. 

The MoleculeNet benchmark collection from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, containing datasets from physical chemistry, biophysics and physiology. 

The PCQM4Mv2 dataset from the "OGBLSC: A LargeScale Challenge for Machine Learning on Graphs" paper. 

The relational entities networks 

The relational link prediction datasets from the "Modeling Relational Data with Graph Convolutional Networks" paper. 

The GED datasets from the "Graph Edit Distance Computation via Graph Neural Networks" paper. 

A variety of attributed graph datasets from the "Scaling Attributed Network Embedding to Massive Graphs" paper. 

MNIST superpixels dataset from the "Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs" paper, containing 70,000 graphs with 75 nodes each. 

The FAUST humans dataset from the "FAUST: Dataset and Evaluation for 3D Mesh Registration" paper, containing 100 watertight meshes representing 10 different poses for 10 different subjects. 

The dynamic FAUST humans dataset from the "Dynamic FAUST: Registering Human Bodies in Motion" paper. 

The ShapeNet part level segmentation dataset from the "A Scalable Active Framework for Region Annotation in 3D Shape Collections" paper, containing about 17,000 3D shape point clouds from 16 shape categories. 

The ModelNet10/40 datasets from the "3D ShapeNets: A Deep Representation for Volumetric Shapes" paper, containing CAD models of 10 and 40 categories, respectively. 

The CoMA 3D faces dataset from the "Generating 3D faces using Convolutional Mesh Autoencoders" paper, containing 20,466 meshes of extreme expressions captured over 12 different subjects. 

The SHREC 2016 partial matching dataset from the "SHREC'16: Partial Matching of Deformable Shapes" paper. 

The TOSCA dataset from the "Numerical Geometry of NonRidig Shapes" book, containing 80 meshes. 

The PCPNet dataset from the "PCPNet: Learning Local Shape Properties from Raw Point Clouds" paper, consisting of 30 shapes, each given as a point cloud, densely sampled with 100k points. 

The (preprocessed) Stanford LargeScale 3D Indoor Spaces dataset from the "3D Semantic Parsing of LargeScale Indoor Spaces" paper, containing point clouds of six largescale indoor parts in three buildings with 12 semantic elements (and one clutter class). 

Synthetic dataset of various geometric shapes like cubes, spheres or pyramids. 

The BitcoinOTC dataset from the "EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs" paper, consisting of 138 whotrustswhom networks of sequential time steps. 

The (reduced) version of the Global Database of Events, Language, and Tone (GDELT) dataset used in the "Do We Really Need Complicated Model Architectures for Temporal Networks?" paper, consisting of events collected from 2016 to 2020. 

The Integrated Crisis Early Warning System (ICEWS) dataset used in the, e.g., "Recurrent Event Network for Reasoning over Temporal Knowledge Graphs" paper, consisting of events collected from 1/1/2018 to 10/31/2018 (24 hours time granularity). 

The Global Database of Events, Language, and Tone (GDELT) dataset used in the, e.g., "Recurrent Event Network for Reasoning over Temporal Knowledge Graphs" paper, consisting of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity). 

The WILLOWObjectClass dataset from the "Learning Graphs to Match" paper, containing 10 equal keypoints of at least 40 images in each category. 

The Pascal VOC 2011 dataset with Berkely annotations of keypoints from the "Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations" paper, containing 0 to 23 keypoints per example over 20 categories. 

The PascalPF dataset from the "Proposal Flow" paper, containing 4 to 16 keypoints per example over 20 categories. 

A variety of graph datasets collected from SNAP at Stanford University. 

A suite of sparse matrix benchmarks known as the Suite Sparse Matrix Collection collected from a wide range of applications. 

The WordNet18 dataset from the "Translating Embeddings for Modeling MultiRelational Data" paper, containing 40,943 entities, 18 relations and 151,442 fact triplets, e.g., furniture includes bed. 

The WordNet18RR dataset from the "Convolutional 2D Knowledge Graph Embeddings" paper, containing 40,943 entities, 11 relations and 93,003 fact triplets. 

The FB15K237 dataset from the "Translating Embeddings for Modeling MultiRelational Data" paper, containing 14,541 entities, 237 relations and 310,116 fact triples. 

The semisupervised Wikipediabased dataset from the "WikiCS: A WikipediaBased Benchmark for Graph Neural Networks" paper, containing 11,701 nodes, 216,123 edges, 10 classes and 20 different training splits. 

The WebKB datasets used in the "GeomGCN: Geometric Graph Convolutional Networks" paper. 

The Wikipedia networks introduced in the "Multiscale Attributed Node Embedding" paper. 

The heterophilous graphs 

The actoronly induced subgraph of the filmdirectoractorwriter network used in the "GeomGCN: Geometric Graph Convolutional Networks" paper. 

The treestructured fake news propagation graph classification dataset from the "User Preferenceaware Fake News Detection" paper. 

The GitHub Web and ML Developers dataset introduced in the "Multiscale Attributed Node Embedding" paper. 

The Facebook PagePage network dataset introduced in the "Multiscale Attributed Node Embedding" paper. 

The LastFM Asia Network dataset introduced in the "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models" paper. 

The Deezer Europe dataset introduced in the "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models" paper. 

The Deezer User Network datasets introduced in the "GEMSEC: Graph Embedding with Self Clustering" paper. 

The Twitch Gamer networks introduced in the "Multiscale Attributed Node Embedding" paper. 

The Airports dataset from the "struc2vec: Learning Node Representations from Structural Identity" paper, where nodes denote airports and labels correspond to activity levels. 

The "Long Range Graph Benchmark (LRGB)" datasets which is a collection of 5 graph learning datasets with tasks that are based on longrange dependencies in graphs. 

The MalNet Tiny dataset from the "A LargeScale Database for Graph Representation Learning" paper. 

The Organic Materials Database (OMDB) of bulk organic crystals. 

The Political Blogs dataset from the "The Political Blogosphere and the 2004 US Election: Divided they Blog" paper. 

An email communication network of a large European research institution, taken from the "Local Higherorder Graph Clustering" paper. 

A variety of nonhomophilous graph datasets from the "Large Scale Learning on NonHomophilous Graphs: New Benchmarks and Strong Simple Methods" paper. 

The Elliptic Bitcoin dataset of Bitcoin transactions from the "AntiMoney Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics" paper. 

The timestep aware Elliptic Bitcoin dataset of Bitcoin transactions from the "AntiMoney Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics" paper. 

The DGraphFin networks from the "DGraph: A LargeScale Financial Dataset for Graph Anomaly Detection" paper. 

The HydroNet dataest from the "HydroNet: Benchmark Tasks for Preserving Intermolecular Interactions and Structural Motifs in Predictive and Generative Models for Molecular Data" paper, consisting of 5 million water clusters held together by hydrogen bonding networks. 

The AirfRANS dataset from the "AirfRANS: High Fidelity Computational Fluid Dynamics Dataset for Approximating ReynoldsAveraged NavierStokes Solutions" paper, consisting of 1,000 simulations of steadystate aerodynamics over 2D airfoils in a subsonic flight regime. 

The temporal graph datasets from the "JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks" paper. 

The Wikidata5M dataset from the "KEPLER: A Unified Model for Knowledge Embedding and Pretrained Language Representation" paper, containing 4,594,485 entities, 822 relations, 20,614,279 train triples, 5,163 validation triples, and 5,133 test triples. 

The Myket Android Application Install dataset from the "Effect of Choosing Loss Function when Using TBatching for Representation Learning on Dynamic Networks" paper. 

The breast cancer (BRCA TCGA PanCancer Atlas) dataset consisting of patients with survival information and gene expression data from cBioPortal and a network of biological interactions between those nodes from Pathway Commons. 

The NeuroGraph benchmark datasets from the "NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics" paper. 
Heterogeneous Datasets
The DBP15K dataset from the "Crosslingual Entity Alignment via Joint AttributePreserving Embedding" paper, where Chinese, Japanese and French versions of DBpedia were linked to its English version. 

The heterogeneous AMiner dataset from the "metapath2vec: Scalable Representation Learning for Heterogeneous Networks" paper, consisting of nodes from type 

The 

A subset of the DBLP computer science bibliography website, as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper. 

A heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of nodes of type 

The MovieLens 100K heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of movies (1,682 nodes) and users (943 nodes) with 100K ratings between them. 

The MovieLens 1M heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of movies (3,883 nodes) and users (6,040 nodes) with approximately 1 million ratings between them. 

A subset of the Internet Movie Database (IMDB), as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper. 

A subset of the last.fm music website keeping track of users' listining information from various sources, as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper. 

A variety of heterogeneous graph benchmark datasets from the "Are We Really Making Much Progress? Revisiting, Benchmarking, and Refining Heterogeneous Graph Neural Networks" paper. 

Taobao is a dataset of user behaviors from Taobao offered by Alibaba, provided by the Tianchi Alicloud platform. 

The useritem heterogeneous rating datasets 

A subset of the AmazonBook rating dataset from the "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation" paper. 

The heterogeneous H&M dataset from the Kaggle H&M Personalized Fashion Recommendations challenge. 

A dataset describing the Product ecology of the Open Source Ecology's iconoclastic Global Village Construction Set. 

The risk commodity detection dataset (RCDD) from the "Datasets and Interfaces for Benchmarking Heterogeneous Graph Neural Networks" paper. 
Synthetic Datasets
A fake dataset that returns randomly generated 

A fake dataset that returns randomly generated 

A synthetic graph dataset generated by the stochastic block model. 

The random partition graph dataset from the "How to Find Your Friendly Neighborhood: Graph Attention Design with SelfSupervision" paper. 

The MixHop synthetic dataset from the "MixHop: HigherOrder Graph Convolutional Architectures via Sparsified Neighborhood Mixing" paper, containing 10 graphs, each with varying degree of homophily (ranging from 0.0 to 0.9). 

Generates a synthetic dataset for evaluating explainabilty algorithms, as described in the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper. 

Generates a synthetic infection dataset for evaluating explainabilty algorithms, as described in the "Explainability Techniques for Graph Convolutional Networks" paper. 

The synthetic BA2motifs graph classification dataset for evaluating explainabilty algorithms, as described in the "Parameterized Explainer for Graph Neural Network" paper. 

The synthetic BAMultiShapes graph classification dataset for evaluating explainabilty algorithms, as described in the "Global Explainability of GNNs via Logic Combination of Learned Concepts" paper. 

The BAShapes dataset from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper, containing a BarabasiAlbert (BA) graph with 300 nodes and a set of 80 "house"structured graphs connected to it. 
Graph Generators
An abstract base class for generating synthetic graphs. 

Generates random BarabasiAlbert (BA) graphs. 

Generates random ErdosRenyi (ER) graphs. 

Generates twodimensional grid graphs. 
Motif Generators
An abstract base class for generating a motif. 

Generates a motif based on a custom structure coming from a 

Generates the housestructured motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper, containing 5 nodes and 6 undirected edges. 

Generates the cycle motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper. 