torch_geometric.datasets.GraphLandDataset

class GraphLandDataset(root: str, name: str, split: str, numerical_features_transform: Optional[str] = 'default', fraction_features_transform: Optional[str] = 'default', categorical_features_transform: Optional[str] = 'one_hot_encoding', regression_targets_transform: Optional[str] = 'default', numerical_features_nan_imputation_strategy: Optional[str] = 'most_frequent', fraction_features_nan_imputation_strategy: Optional[str] = 'most_frequent', to_undirected: bool = True, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

The graph datasets from the “GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data” paper.

Parameters:

root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset ("hm-categories", "pokec-regions", "web-topics", "tolokers-2", "city-reviews", "artnet-exp", "web-fraud", "hm-prices", "avazu-ctr", "city-roads-M", "city-roads-L", "twitch-views", "artnet-views", "web-traffic").
split (str) – The type of dataset split/setting ("RL", "RH", "TH", "THI"). "RL" is for “random low” split — a 10%/10%/80% random stratified train/val/test split. "RH" is for “random high” split — a 50%/25%/25% random stratified train/val/test split. "TH" is for “temporal high” split — a 50%/25%/25% temporal train/val/test split. "THI" is for “temporal high” split with the inductive setting, which means that the graph is evolving over time, thus val and test nodes are not seen at train time, and test nodes are not seen at val time. The "RL", "RH", and "TH" splits correspond to the transductive setting and thus will return a dataset with a single graph and three masks (for train, val, and test nodes). In contrast, the "THI" split corresponds to the inductive setting, and thus will return a dataset with three graphs (a train graph, a val graph, and a test graph), which are three snapshots of an evolving network captured at different timestamps. Each of the three graphs has a mask specifying which of the nodes should be used for training (in the train graph) and evaluation (in the val and test graphs). "TH" and "THI" splits are not available for the following datasets: "city-reviews", "city-roads-M", "city-roads-L", "web-traffic".
numerical_features_transform (str, optional) – A transform applied to numerical features (None, "standard_scaler", "min_max_scaler", "quantile_transform_normal", "quantile_transform_uniform", "default"). Since numerical features can have widely different scales and distributions, it is typically useful to apply some transform to them before passing them to a neural model. This transform is applied to all numerical features except for those that are also categorized as fraction features. The "default" value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. (default: "default")
fraction_features_transform (str, optional) – A transform applied to fraction features (None, "standard_scaler", "min_max_scaler", "quantile_transform_normal", "quantile_transform_uniform", "default"). Fraction features are a subset of numerical features that have the meaning of fractions and are thus always in [0, 1] range. Since their range is bounded, it is not neccessary but may still be useful to apply some transform to them before passing them to a neural model. The "default" value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. (default: "default")
categorical_features_transform (str, optional) – A transform applied to categorical features (None, "one_hot_encoding"). It is most often useful to apply one-hot encoding to categorical features before passing them to a neural model. (default: "one_hot_encoding")
regression_targets_transform (str, optional) – A transform applied to regression targets (None, "standard_scaler", "min_max_scaler", "default"). Depending on their range, it may or may not be useful to apply a transform to regression targets before fitting a neural model to them. The "default" value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. This argument does not affect classification datasets. (default: "default")
numerical_features_nan_imputation_strategy (str, optional) – Defines which value to fill NaNs in numerical features with (None, "mean", "median", "most_frequent"). This imputation strategy is applied to all numerical features except for those that are also categorized as fraction features. (default: "most_frequent")
fraction_features_nan_imputation_strategy (str, optional) – Defines which value to fill NaNs in fraction features with (None, "mean", "median", "most_frequent"). (default: "most_frequent")
to_undirected (bool, optional) – Whether to convert a directed graph to an undirected one. Does not affect undirected graphs. (default: True)
transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)
pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)
force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Name	#nodes	#edges	is directed	task
`hm-categories`	46,563	21,461,990	False	multiclass
`pokec-regions`	1,632,803	30,622,564	True	multiclass
`web-topics`	2,890,331	12,895,369	True	multiclass
`tolokers-2`	11,758	1,038,000	False	binclass
`city-reviews`	148,801	2,330,830	False	binclass
`artnet-exp`	50,405	560,696	False	binclass
`web-fraud`	2,890,331	12,895,369	True	binclass
`hm-prices`	46,563	21,461,990	False	regression
`avazu-ctr`	76,269	21,968,154	False	regression
`city-roads-M`	57,073	132,571	True	regression
`city-roads-L`	142,257	279,062	True	regression
`twitch-views`	168,114	13,595,114	False	regression
`artnet-views`	50,405	560,696	False	regression
`web-traffic`	2,890,331	12,895,369	True	regression