torch_geometric.datasets.GraphLandDataset

class GraphLandDataset(root: str, name: str, split: str, numerical_features_transform: Optional[str] = 'default', fraction_features_transform: Optional[str] = 'default', categorical_features_transform: Optional[str] = 'one_hot_encoding', regression_targets_transform: Optional[str] = 'default', numerical_features_nan_imputation_strategy: Optional[str] = 'most_frequent', fraction_features_nan_imputation_strategy: Optional[str] = 'most_frequent', to_undirected: bool = True, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

The graph datasets from the “GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data” paper.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – The name of the dataset ("hm-categories", "pokec-regions", "web-topics", "tolokers-2", "city-reviews", "artnet-exp", "web-fraud", "hm-prices", "avazu-ctr", "city-roads-M", "city-roads-L", "twitch-views", "artnet-views", "web-traffic").

  • split (str) – The type of dataset split/setting ("RL", "RH", "TH", "THI"). "RL" is for “random low” split — a 10%/10%/80% random stratified train/val/test split. "RH" is for “random high” split — a 50%/25%/25% random stratified train/val/test split. "TH" is for “temporal high” split — a 50%/25%/25% temporal train/val/test split. "THI" is for “temporal high” split with the inductive setting, which means that the graph is evolving over time, thus val and test nodes are not seen at train time, and test nodes are not seen at val time. The "RL", "RH", and "TH" splits correspond to the transductive setting and thus will return a dataset with a single graph and three masks (for train, val, and test nodes). In contrast, the "THI" split corresponds to the inductive setting, and thus will return a dataset with three graphs (a train graph, a val graph, and a test graph), which are three snapshots of an evolving network captured at different timestamps. Each of the three graphs has a mask specifying which of the nodes should be used for training (in the train graph) and evaluation (in the val and test graphs). "TH" and "THI" splits are not available for the following datasets: "city-reviews", "city-roads-M", "city-roads-L", "web-traffic".

  • numerical_features_transform (str, optional) – A transform applied to numerical features (None, "standard_scaler", "min_max_scaler", "quantile_transform_normal", "quantile_transform_uniform", "default"). Since numerical features can have widely different scales and distributions, it is typically useful to apply some transform to them before passing them to a neural model. This transform is applied to all numerical features except for those that are also categorized as fraction features. The "default" value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. (default: "default")

  • fraction_features_transform (str, optional) – A transform applied to fraction features (None, "standard_scaler", "min_max_scaler", "quantile_transform_normal", "quantile_transform_uniform", "default"). Fraction features are a subset of numerical features that have the meaning of fractions and are thus always in [0, 1] range. Since their range is bounded, it is not neccessary but may still be useful to apply some transform to them before passing them to a neural model. The "default" value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. (default: "default")

  • categorical_features_transform (str, optional) – A transform applied to categorical features (None, "one_hot_encoding"). It is most often useful to apply one-hot encoding to categorical features before passing them to a neural model. (default: "one_hot_encoding")

  • regression_targets_transform (str, optional) – A transform applied to regression targets (None, "standard_scaler", "min_max_scaler", "default"). Depending on their range, it may or may not be useful to apply a transform to regression targets before fitting a neural model to them. The "default" value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. This argument does not affect classification datasets. (default: "default")

  • numerical_features_nan_imputation_strategy (str, optional) – Defines which value to fill NaNs in numerical features with (None, "mean", "median", "most_frequent"). This imputation strategy is applied to all numerical features except for those that are also categorized as fraction features. (default: "most_frequent")

  • fraction_features_nan_imputation_strategy (str, optional) – Defines which value to fill NaNs in fraction features with (None, "mean", "median", "most_frequent"). (default: "most_frequent")

  • to_undirected (bool, optional) – Whether to convert a directed graph to an undirected one. Does not affect undirected graphs. (default: True)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Name

#nodes

#edges

is directed

task

hm-categories

46,563

21,461,990

False

multiclass

pokec-regions

1,632,803

30,622,564

True

multiclass

web-topics

2,890,331

12,895,369

True

multiclass

tolokers-2

11,758

1,038,000

False

binclass

city-reviews

148,801

2,330,830

False

binclass

artnet-exp

50,405

560,696

False

binclass

web-fraud

2,890,331

12,895,369

True

binclass

hm-prices

46,563

21,461,990

False

regression

avazu-ctr

76,269

21,968,154

False

regression

city-roads-M

57,073

132,571

True

regression

city-roads-L

142,257

279,062

True

regression

twitch-views

168,114

13,595,114

False

regression

artnet-views

50,405

560,696

False

regression

web-traffic

2,890,331

12,895,369

True

regression