torch_geometric.datasets
Homogeneous Datasets
Zachary's karate club network from the "An Information Flow Model for Conflict and Fission in Small Groups" paper, containing 34 nodes, connected by 156 (undirected and unweighted) edges. |
|
A variety of graph kernel benchmark datasets, .e.g., |
|
A variety of artificially and semi-artificially generated graph datasets from the "Benchmarking Graph Neural Networks" paper. |
|
The citation network datasets |
|
The NELL dataset, a knowledge graph from the "Toward an Architecture for Never-Ending Language Learning" paper. |
|
The full citation network datasets from the "Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking" paper. |
|
Alias for |
|
The Coauthor CS and Coauthor Physics networks from the "Pitfalls of Graph Neural Network Evaluation" paper. |
|
The Amazon Computers and Amazon Photo networks from the "Pitfalls of Graph Neural Network Evaluation" paper. |
|
The protein-protein interaction networks from the "Predicting Multicellular Function through Multi-layer Tissue Networks" paper, containing positional gene sets, motif gene sets and immunological signatures as features (50 in total) and gene ontology sets as labels (121 in total). |
|
The Reddit dataset from the "Inductive Representation Learning on Large Graphs" paper, containing Reddit posts belonging to different communities. |
|
The Reddit dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing Reddit posts belonging to different communities. |
|
The Flickr dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing descriptions and common properties of images. |
|
The Yelp dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing customer reviewers and their friendship. |
|
The Amazon dataset from the "GraphSAINT: Graph Sampling Based Inductive Learning Method" paper, containing products and its categories. |
|
The QM7b dataset from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, consisting of 7,211 molecules with 14 regression targets. |
|
The QM9 dataset from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, consisting of about 130,000 molecules with 19 regression targets. |
|
A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. |
|
The ZINC dataset from the ZINC database and the "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules" paper, containing about 250,000 molecular graphs with up to 38 heavy atoms. |
|
The AQSOL dataset from the Benchmarking Graph Neural Networks paper based on AqSolDB, a standardized database of 9,982 molecular graphs with their aqueous solubility values, collected from 9 different data sources. |
|
The MoleculeNet benchmark collection from the "MoleculeNet: A Benchmark for Molecular Machine Learning" paper, containing datasets from physical chemistry, biophysics and physiology. |
|
The PCQM4Mv2 dataset from the "OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs" paper. |
|
The relational entities networks |
|
The relational link prediction datasets from the "Modeling Relational Data with Graph Convolutional Networks" paper. |
|
The GED datasets from the "Graph Edit Distance Computation via Graph Neural Networks" paper. |
|
A variety of attributed graph datasets from the "Scaling Attributed Network Embedding to Massive Graphs" paper. |
|
MNIST superpixels dataset from the "Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs" paper, containing 70,000 graphs with 75 nodes each. |
|
The FAUST humans dataset from the "FAUST: Dataset and Evaluation for 3D Mesh Registration" paper, containing 100 watertight meshes representing 10 different poses for 10 different subjects. |
|
The dynamic FAUST humans dataset from the "Dynamic FAUST: Registering Human Bodies in Motion" paper. |
|
The ShapeNet part level segmentation dataset from the "A Scalable Active Framework for Region Annotation in 3D Shape Collections" paper, containing about 17,000 3D shape point clouds from 16 shape categories. |
|
The ModelNet10/40 datasets from the "3D ShapeNets: A Deep Representation for Volumetric Shapes" paper, containing CAD models of 10 and 40 categories, respectively. |
|
The CoMA 3D faces dataset from the "Generating 3D faces using Convolutional Mesh Autoencoders" paper, containing 20,466 meshes of extreme expressions captured over 12 different subjects. |
|
The SHREC 2016 partial matching dataset from the "SHREC'16: Partial Matching of Deformable Shapes" paper. |
|
The TOSCA dataset from the "Numerical Geometry of Non-Ridig Shapes" book, containing 80 meshes. |
|
The PCPNet dataset from the "PCPNet: Learning Local Shape Properties from Raw Point Clouds" paper, consisting of 30 shapes, each given as a point cloud, densely sampled with 100k points. |
|
The (pre-processed) Stanford Large-Scale 3D Indoor Spaces dataset from the "3D Semantic Parsing of Large-Scale Indoor Spaces" paper, containing point clouds of six large-scale indoor parts in three buildings with 12 semantic elements (and one clutter class). |
|
Synthetic dataset of various geometric shapes like cubes, spheres or pyramids. |
|
The Bitcoin-OTC dataset from the "EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs" paper, consisting of 138 who-trusts-whom networks of sequential time steps. |
|
The (reduced) version of the Global Database of Events, Language, and Tone (GDELT) dataset used in the "Do We Really Need Complicated Model Architectures for Temporal Networks?" paper, consisting of events collected from 2016 to 2020. |
|
The Integrated Crisis Early Warning System (ICEWS) dataset used in the, e.g., "Recurrent Event Network for Reasoning over Temporal Knowledge Graphs" paper, consisting of events collected from 1/1/2018 to 10/31/2018 (24 hours time granularity). |
|
The Global Database of Events, Language, and Tone (GDELT) dataset used in the, e.g., "Recurrent Event Network for Reasoning over Temporal Knowledge Graphs" paper, consisting of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity). |
|
The WILLOW-ObjectClass dataset from the "Learning Graphs to Match" paper, containing 10 equal keypoints of at least 40 images in each category. |
|
The Pascal VOC 2011 dataset with Berkely annotations of keypoints from the "Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations" paper, containing 0 to 23 keypoints per example over 20 categories. |
|
The Pascal-PF dataset from the "Proposal Flow" paper, containing 4 to 16 keypoints per example over 20 categories. |
|
A variety of graph datasets collected from SNAP at Stanford University. |
|
A suite of sparse matrix benchmarks known as the Suite Sparse Matrix Collection collected from a wide range of applications. |
|
The WordNet18 dataset from the "Translating Embeddings for Modeling Multi-Relational Data" paper, containing 40,943 entities, 18 relations and 151,442 fact triplets, e.g., furniture includes bed. |
|
The WordNet18RR dataset from the "Convolutional 2D Knowledge Graph Embeddings" paper, containing 40,943 entities, 11 relations and 93,003 fact triplets. |
|
The FB15K237 dataset from the "Translating Embeddings for Modeling Multi-Relational Data" paper, containing 14,541 entities, 237 relations and 310,116 fact triples. |
|
The semi-supervised Wikipedia-based dataset from the "Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks" paper, containing 11,701 nodes, 216,123 edges, 10 classes and 20 different training splits. |
|
The WebKB datasets used in the "Geom-GCN: Geometric Graph Convolutional Networks" paper. |
|
The Wikipedia networks introduced in the "Multi-scale Attributed Node Embedding" paper. |
|
The heterophilous graphs |
|
The actor-only induced subgraph of the film-director-actor-writer network used in the "Geom-GCN: Geometric Graph Convolutional Networks" paper. |
|
The tree-structured fake news propagation graph classification dataset from the "User Preference-aware Fake News Detection" paper. |
|
The GitHub Web and ML Developers dataset introduced in the "Multi-scale Attributed Node Embedding" paper. |
|
The Facebook Page-Page network dataset introduced in the "Multi-scale Attributed Node Embedding" paper. |
|
The LastFM Asia Network dataset introduced in the "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models" paper. |
|
The Deezer Europe dataset introduced in the "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models" paper. |
|
The Deezer User Network datasets introduced in the "GEMSEC: Graph Embedding with Self Clustering" paper. |
|
The Twitch Gamer networks introduced in the "Multi-scale Attributed Node Embedding" paper. |
|
The Airports dataset from the "struc2vec: Learning Node Representations from Structural Identity" paper, where nodes denote airports and labels correspond to activity levels. |
|
The "Long Range Graph Benchmark (LRGB)" datasets which is a collection of 5 graph learning datasets with tasks that are based on long-range dependencies in graphs. |
|
The MalNet Tiny dataset from the "A Large-Scale Database for Graph Representation Learning" paper. |
|
The Organic Materials Database (OMDB) of bulk organic crystals. |
|
The Political Blogs dataset from the "The Political Blogosphere and the 2004 US Election: Divided they Blog" paper. |
|
An e-mail communication network of a large European research institution, taken from the "Local Higher-order Graph Clustering" paper. |
|
A variety of non-homophilous graph datasets from the "Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods" paper. |
|
The Elliptic Bitcoin dataset of Bitcoin transactions from the "Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics" paper. |
|
The time-step aware Elliptic Bitcoin dataset of Bitcoin transactions from the "Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics" paper. |
|
The DGraphFin networks from the "DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection" paper. |
|
The HydroNet dataest from the "HydroNet: Benchmark Tasks for Preserving Intermolecular Interactions and Structural Motifs in Predictive and Generative Models for Molecular Data" paper, consisting of 5 million water clusters held together by hydrogen bonding networks. |
|
The AirfRANS dataset from the "AirfRANS: High Fidelity Computational Fluid Dynamics Dataset for Approximating Reynolds-Averaged Navier-Stokes Solutions" paper, consisting of 1,000 simulations of steady-state aerodynamics over 2D airfoils in a subsonic flight regime. |
|
The temporal graph datasets from the "JODIE: Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks" paper. |
|
The Wikidata-5M dataset from the "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation" paper, containing 4,594,485 entities, 822 relations, 20,614,279 train triples, 5,163 validation triples, and 5,133 test triples. |
|
The Myket Android Application Install dataset from the "Effect of Choosing Loss Function when Using T-Batching for Representation Learning on Dynamic Networks" paper. |
|
The breast cancer (BRCA TCGA Pan-Cancer Atlas) dataset consisting of patients with survival information and gene expression data from cBioPortal and a network of biological interactions between those nodes from Pathway Commons. |
|
The NeuroGraph benchmark datasets from the "NeuroGraph: Benchmarks for Graph Machine Learning in Brain Connectomics" paper. |
|
The WebQuestionsSP dataset of the "The Value of Semantic Parse Labeling for Knowledge Base Question Answering" paper. |
Heterogeneous Datasets
The DBP15K dataset from the "Cross-lingual Entity Alignment via Joint Attribute-Preserving Embedding" paper, where Chinese, Japanese and French versions of DBpedia were linked to its English version. |
|
The heterogeneous AMiner dataset from the "metapath2vec: Scalable Representation Learning for Heterogeneous Networks" paper, consisting of nodes from type |
|
The |
|
A subset of the DBLP computer science bibliography website, as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper. |
|
A heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of nodes of type |
|
The MovieLens 100K heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of movies (1,682 nodes) and users (943 nodes) with 100K ratings between them. |
|
The MovieLens 1M heterogeneous rating dataset, assembled by GroupLens Research from the MovieLens web site, consisting of movies (3,883 nodes) and users (6,040 nodes) with approximately 1 million ratings between them. |
|
A subset of the Internet Movie Database (IMDB), as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper. |
|
A subset of the last.fm music website keeping track of users' listining information from various sources, as collected in the "MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding" paper. |
|
A variety of heterogeneous graph benchmark datasets from the "Are We Really Making Much Progress? Revisiting, Benchmarking, and Refining Heterogeneous Graph Neural Networks" paper. |
|
Taobao is a dataset of user behaviors from Taobao offered by Alibaba, provided by the Tianchi Alicloud platform. |
|
The user-item heterogeneous rating datasets |
|
A subset of the AmazonBook rating dataset from the "LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation" paper. |
|
The heterogeneous H&M dataset from the Kaggle H&M Personalized Fashion Recommendations challenge. |
|
A dataset describing the Product ecology of the Open Source Ecology's iconoclastic Global Village Construction Set. |
|
The risk commodity detection dataset (RCDD) from the "Datasets and Interfaces for Benchmarking Heterogeneous Graph Neural Networks" paper. |
|
The heterogeneous OPF data from the "Large-scale Datasets for AC Optimal Power Flow with Topological Perturbations" paper. |
Hypergraph Datasets
A collection of temporal higher-order network datasets from the "Simplicial Closure and higher-order link prediction" paper. |
Synthetic Datasets
A fake dataset that returns randomly generated |
|
A fake dataset that returns randomly generated |
|
A synthetic graph dataset generated by the stochastic block model. |
|
The random partition graph dataset from the "How to Find Your Friendly Neighborhood: Graph Attention Design with Self-Supervision" paper. |
|
The MixHop synthetic dataset from the "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" paper, containing 10 graphs, each with varying degree of homophily (ranging from 0.0 to 0.9). |
|
Generates a synthetic dataset for evaluating explainabilty algorithms, as described in the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper. |
|
Generates a synthetic infection dataset for evaluating explainabilty algorithms, as described in the "Explainability Techniques for Graph Convolutional Networks" paper. |
|
The synthetic BA-2motifs graph classification dataset for evaluating explainabilty algorithms, as described in the "Parameterized Explainer for Graph Neural Network" paper. |
|
The synthetic BA-Multi-Shapes graph classification dataset for evaluating explainabilty algorithms, as described in the "Global Explainability of GNNs via Logic Combination of Learned Concepts" paper. |
|
The BA-Shapes dataset from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper, containing a Barabasi-Albert (BA) graph with 300 nodes and a set of 80 "house"-structured graphs connected to it. |
Graph Generators
An abstract base class for generating synthetic graphs. |
|
Generates random Barabasi-Albert (BA) graphs. |
|
Generates random Erdos-Renyi (ER) graphs. |
|
Generates two-dimensional grid graphs. |
|
Generates tree graphs. |
Motif Generators
An abstract base class for generating a motif. |
|
Generates a motif based on a custom structure coming from a |
|
Generates the house-structured motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper, containing 5 nodes and 6 undirected edges. |
|
Generates the cycle motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper. |
|
Generates the grid-structured motif from the "GNNExplainer: Generating Explanations for Graph Neural Networks" paper. |