torch_geometric.datasets

Homogeneous Datasets

KarateClub

Zachary's karate club network from the "An Information Flow Model for Conflict and Fission in Small Groups" paper, containing 34 nodes, connected by 156 (undirected and unweighted) edges.

TUDataset

A variety of graph kernel benchmark datasets, .e.g., "IMDB-BINARY", "REDDIT-BINARY" or "PROTEINS", collected from the TU Dortmund University.

GNNBenchmarkDataset

A variety of artificially and semi-artificially generated graph datasets from the "Benchmarking Graph Neural Networks" paper.

Planetoid

The citation network datasets "Cora", "CiteSeer" and "PubMed" from the "Revisiting Semi-Supervised Learning with Graph Embeddings" paper.

NELL

The NELL dataset, a knowledge graph from the "Toward an Architecture for Never-Ending Language Learning" paper.

CitationFull

The full citation network datasets from the "Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking" paper.

CoraFull

Alias for CitationFull with name="Cora".

Coauthor

The Coauthor CS and Coauthor Physics networks from the "Pitfalls of Graph Neural Network Evaluation" paper.

Amazon

The Amazon Computers and Amazon Photo networks from the "Pitfalls of Graph Neural Network Evaluation" paper.

PPI

The protein-protein interaction networks from the "Predicting Multicellular Function through Multi-layer Tissue Networks" paper, containing positional gene sets, motif gene sets and immunological signatures as features (50 in total) and gene ontology sets as labels (121 in total).

Reddit

The Reddit dataset from the "Inductive Representation Learning on Large Graphs" paper, containing Reddit posts belonging to different communities.

Reddit2

The Reddit dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing Reddit posts belonging to different communities.

Flickr

The Flickr dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing descriptions and common properties of images.

Yelp

The Yelp dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing customer reviewers and their friendship.

AmazonProducts

The Amazon dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing products and its categories.

QM7b

The QM7b dataset from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, consisting of 7,211 molecules with 14 regression targets.

QM9

The QM9 dataset from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, consisting of about 130,000 molecules with 19 regression targets.

MD17

A variety of ab-initio molecular dynamics trajectories from the authors of sGDML.

ZINC

The ZINC dataset from the ZINC database and the "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules" paper, containing about 250,000 molecular graphs with up to 38 heavy atoms.

AQSOL

The AQSOL dataset from the Benchmarking Graph Neural Networks paper based on AqSolDB, a standardized database of 9,982 molecular graphs with their aqueous solubility values, collected from 9 different data sources.

MoleculeNet

The MoleculeNet benchmark collection from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, containing datasets from physical chemistry, biophysics and physiology.

PCQM4Mv2

The PCQM4Mv2 dataset from the "OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs" paper.

Entities

The relational entities networks "AIFB", "MUTAG", "BGS" and "AM" from the "Modeling Relational Data with Graph Convolutional Networks" paper.

RelLinkPredDataset

The relational link prediction datasets from the "Modeling Relational Data with Graph Convolutional Networks" paper.

GEDDataset

The GED datasets from the "Graph Edit Distance Computation via Graph Neural Networks" paper.

AttributedGraphDataset

A variety of attributed graph datasets from the "Scaling Attributed Network Embedding to Massive Graphs" paper.

MNISTSuperpixels

MNIST superpixels dataset from the "Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs" paper, containing 70,000 graphs with 75 nodes each.

FAUST

The FAUST humans dataset from the "FAUST: Dataset and Evaluation for 3D Mesh Registration" paper, containing 100 watertight meshes representing 10 different poses for 10 different subjects.

DynamicFAUST

The dynamic FAUST humans dataset from the "Dynamic FAUST: Registering Human Bodies in Motion" paper.

ShapeNet

The ShapeNet part level segmentation dataset from the "A Scalable Active Framework for Region Annotation in 3D Shape Collections" paper, containing about 17,000 3D shape point clouds from 16 shape categories.

ModelNet

The ModelNet10/40 datasets from the "3D ShapeNets: A Deep Representation for Volumetric Shapes" paper, containing CAD models of 10 and 40 categories, respectively.

CoMA

The CoMA 3D faces dataset from the "Generating 3D faces using Convolutional Mesh Autoencoders" paper, containing 20,466 meshes of extreme expressions captured over 12 different subjects.

SHREC2016

The SHREC 2016 partial matching dataset from the "SHREC'16: Partial Matching of Deformable Shapes" paper.

TOSCA

The TOSCA dataset from the "Numerical Geometry of Non-Ridig Shapes" book, containing 80 meshes.

PCPNetDataset

The PCPNet dataset from the "PCPNet: Learning Local Shape Properties from Raw Point Clouds" paper, consisting of 30 shapes, each given as a point cloud, densely sampled with 100k points.

S3DIS

The (pre-processed) Stanford Large-Scale 3D Indoor Spaces dataset from the "3D Semantic Parsing of Large-Scale Indoor Spaces" paper, containing point clouds of six large-scale indoor parts in three buildings with 12 semantic elements (and one clutter class).

GeometricShapes

Synthetic dataset of various geometric shapes like cubes, spheres or pyramids.

BitcoinOTC

The Bitcoin-OTC dataset from the "EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs" paper, consisting of 138 who-trusts-whom networks of sequential time steps.

GDELTLite

The (reduced) version of the Global Database of Events, Language, and Tone (GDELT) dataset used in the "Do We Really Need Complicated Model Architectures for Temporal Networks?" paper, consisting of events collected from 2016 to 2020.

ICEWS18

The Integrated Crisis Early Warning System (ICEWS) dataset used in the, e.g., "Recurrent Event Network for Reasoning over Temporal Knowledge Graphs" paper, consisting of events collected from 1/1/2018 to 10/31/2018 (24 hours time granularity).

GDELT

The Global Database of Events, Language, and Tone (GDELT) dataset used in the, e.g., "Recurrent Event Network for Reasoning over Temporal Knowledge Graphs" paper, consisting of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

WILLOWObjectClass

The WILLOW-ObjectClass dataset from the "Learning Graphs to Match" paper, containing 10 equal keypoints of at least 40 images in each category.

PascalVOCKeypoints

The Pascal VOC 2011 dataset with Berkely annotations of keypoints from the "Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations" paper, containing 0 to 23 keypoints per example over 20 categories.

PascalPF

The Pascal-PF dataset from the "Proposal Flow" paper, containing 4 to 16 keypoints per example over 20 categories.

SNAPDataset

A variety of graph datasets collected from SNAP at Stanford University.

SuiteSparseMatrixCollection

A suite of sparse matrix benchmarks known as the Suite Sparse Matrix Collection collected from a wide range of applications.

WordNet18

The WordNet18 dataset from the "Translating Embeddings for Modeling Multi-Relational Data" paper, containing 40,943 entities, 18 relations and 151,442 fact triplets, e.g., furniture includes bed.

WordNet18RR

The WordNet18RR dataset from the "Convolutional 2D Knowledge Graph Embeddings" paper, containing 40,943 entities, 11 relations and 93,003 fact triplets.

FB15k_237

The FB15K237 dataset from the "Translating Embeddings for Modeling Multi-Relational Data" paper, containing 14,541 entities, 237 relations and 310,116 fact triples.

WikiCS

The semi-supervised Wikipedia-based dataset from the "Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks" paper, containing 11,701 nodes, 216,123 edges, 10 classes and 20 different training splits.

WebKB

The WebKB datasets used in the "Geom-GCN: Geometric Graph Convolutional Networks" paper.

WikipediaNetwork

The Wikipedia networks introduced in the "Multi-scale Attributed Node Embedding" paper.

HeterophilousGraphDataset

The heterophilous graphs "Roman-empire", "Amazon-ratings", "Minesweeper", "Tolokers" and "Questions" from the "A Critical Look at the Evaluation of GNNs under Heterophily: Are We Really Making Progress?" paper.

Actor

The actor-only induced subgraph of the film-director-actor-writer network used in the "Geom-GCN: Geometric Graph Convolutional Networks" paper.

UPFD

The tree-structured fake news propagation graph classification dataset from the "User Preference-aware Fake News Detection" paper.

GitHub

The GitHub Web and ML Developers dataset introduced in the "Multi-scale Attributed Node Embedding" paper.

FacebookPagePage

The Facebook Page-Page network dataset introduced in the "Multi-scale Attributed Node Embedding" paper.

LastFMAsia

The LastFM Asia Network dataset introduced in the "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models" paper.

DeezerEurope

The Deezer Europe dataset introduced in the "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models" paper.

GemsecDeezer

The Deezer User Network datasets introduced in the "GEMSEC: Graph Embedding with Self Clustering" paper.

Twitch

The Twitch Gamer networks introduced in the "Multi-scale Attributed Node Embedding" paper.

Airports

The Airports dataset from the "struc2vec: Learning Node Representations from Structural Identity" paper, where nodes denote airports and labels correspond to activity levels.

LRGBDataset

The "Long Range Graph Benchmark (LRGB)" datasets which is a collection of 5 graph learning datasets with tasks that are based on long-range dependencies in graphs.

MalNetTiny

The MalNet Tiny dataset from the "A Large-Scale Database for Graph Representation Learning" paper.

OMDB

The Organic Materials Database (OMDB) of bulk organic crystals.

PolBlogs

The Political Blogs dataset from the "The Political Blogosphere and the 2004 US Election: Divided they Blog" paper.

EmailEUCore

An e-mail communication network of a large European research institution, taken from the "Local Higher-order Graph Clustering" paper.

LINKXDataset

A variety of non-homophilous graph datasets from the "Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods" paper.

EllipticBitcoinDataset

The Elliptic Bitcoin dataset of Bitcoin transactions from the "Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics" paper.

EllipticBitcoinTemporalDataset

The time-step aware Elliptic Bitcoin dataset of Bitcoin transactions from the "Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics" paper.

DGraphFin

The DGraphFin networks from the "DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection" paper.

HydroNet

The HydroNet dataest from the "HydroNet: Benchmark Tasks for Preserving Intermolecular Interactions and Structural Motifs in Predictive and Generative Models for Molecular Data" paper, consisting of 5 million water clusters held together by hydrogen bonding networks.

AirfRANS

The AirfRANS dataset from the "AirfRANS: High Fidelity Computational Fluid Dynamics Dataset for Approximating Reynolds-Averaged Navier-Stokes Solutions" paper, consisting of 1,000 simulations of steady-state aerodynamics over 2D airfoils in a subsonic flight regime.

JODIEDataset

The temporal graph datasets from the "JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks" paper.

Wikidata5M

The Wikidata-5M dataset from the "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation" paper, containing 4,594,485 entities, 822 relations, 20,614,279 train triples, 5,163 validation triples, and 5,133 test triples.

MyketDataset

The Myket Android Application Install dataset from the "Effect of Choosing Loss Function when Using T-Batching for Representation Learning on Dynamic Networks" paper.

BrcaTcga

The breast cancer (BRCA TCGA Pan-Cancer Atlas) dataset consisting of patients with survival information and gene expression data from cBioPortal and a network of biological interactions between those nodes from Pathway Commons.

NeuroGraphDataset

The NeuroGraph benchmark datasets from the "NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics" paper.

Heterogeneous Datasets

DBP15K

The DBP15K dataset from the "Cross-lingual Entity Alignment via Joint Attribute-Preserving Embedding" paper, where Chinese, Japanese and French versions of DBpedia were linked to its English version.

AMiner

The heterogeneous AMiner dataset from the "metapath2vec: Scalable Representation Learning for Heterogeneous Networks" paper, consisting of nodes from type "paper", "author" and "venue".

OGB_MAG

The ogbn-mag dataset from the "Open Graph Benchmark: Datasets for Machine Learning on Graphs" paper.

DBLP

A subset of the DBLP computer science bibliography website, as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper.

MovieLens

A heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of nodes of type "movie" and "user".

MovieLens100K

The MovieLens 100K heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of movies (1,682 nodes) and users (943 nodes) with 100K ratings between them.

MovieLens1M

The MovieLens 1M heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of movies (3,883 nodes) and users (6,040 nodes) with approximately 1 million ratings between them.

IMDB

A subset of the Internet Movie Database (IMDB), as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper.

LastFM

A subset of the last.fm music website keeping track of users' listining information from various sources, as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper.

HGBDataset

A variety of heterogeneous graph benchmark datasets from the "Are We Really Making Much Progress? Revisiting, Benchmarking, and Refining Heterogeneous Graph Neural Networks" paper.

Taobao

Taobao is a dataset of user behaviors from Taobao offered by Alibaba, provided by the Tianchi Alicloud platform.

IGMCDataset

The user-item heterogeneous rating datasets "Douban", "Flixster" and "Yahoo-Music" from the "Inductive Matrix Completion Based on Graph Neural Networks" paper.

AmazonBook

A subset of the AmazonBook rating dataset from the "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation" paper.

HM

The heterogeneous H&M dataset from the Kaggle H&M Personalized Fashion Recommendations challenge.

OSE_GVCS

A dataset describing the Product ecology of the Open Source Ecology's iconoclastic Global Village Construction Set.

RCDD

The risk commodity detection dataset (RCDD) from the "Datasets and Interfaces for Benchmarking Heterogeneous Graph Neural Networks" paper.

Synthetic Datasets

FakeDataset

A fake dataset that returns randomly generated Data objects.

FakeHeteroDataset

A fake dataset that returns randomly generated HeteroData objects.

StochasticBlockModelDataset

A synthetic graph dataset generated by the stochastic block model.

RandomPartitionGraphDataset

The random partition graph dataset from the "How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision" paper.

MixHopSyntheticDataset

The MixHop synthetic dataset from the "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" paper, containing 10 graphs, each with varying degree of homophily (ranging from 0.0 to 0.9).

ExplainerDataset

Generates a synthetic dataset for evaluating explainabilty algorithms, as described in the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper.

InfectionDataset

Generates a synthetic infection dataset for evaluating explainabilty algorithms, as described in the "Explainability Techniques for Graph Convolutional Networks" paper.

BA2MotifDataset

The synthetic BA-2motifs graph classification dataset for evaluating explainabilty algorithms, as described in the "Parameterized Explainer for Graph Neural Network" paper.

BAMultiShapesDataset

The synthetic BA-Multi-Shapes graph classification dataset for evaluating explainabilty algorithms, as described in the "Global Explainability of GNNs via Logic Combination of Learned Concepts" paper.

BAShapes

The BA-Shapes dataset from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper, containing a Barabasi-Albert (BA) graph with 300 nodes and a set of 80 "house"-structured graphs connected to it.

Graph Generators

GraphGenerator

An abstract base class for generating synthetic graphs.

BAGraph

Generates random Barabasi-Albert (BA) graphs.

ERGraph

Generates random Erdos-Renyi (ER) graphs.

GridGraph

Generates two-dimensional grid graphs.

TreeGraph

Generates tree graphs.

Motif Generators

MotifGenerator

An abstract base class for generating a motif.

CustomMotif

Generates a motif based on a custom structure coming from a torch_geometric.data.Data or networkx.Graph object.

HouseMotif

Generates the house-structured motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper, containing 5 nodes and 6 undirected edges.

CycleMotif

Generates the cycle motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper.

GridMotif

Generates the grid-structured motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper.