torch_geometric.datasets.GraphLandDataset
- class GraphLandDataset(root: str, name: str, split: str, numerical_features_transform: Optional[str] = 'default', fraction_features_transform: Optional[str] = 'default', categorical_features_transform: Optional[str] = 'one_hot_encoding', regression_targets_transform: Optional[str] = 'default', numerical_features_nan_imputation_strategy: Optional[str] = 'most_frequent', fraction_features_nan_imputation_strategy: Optional[str] = 'most_frequent', to_undirected: bool = True, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, force_reload: bool = False)[source]
Bases:
InMemoryDatasetThe graph datasets from the “GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data” paper.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – The name of the dataset (
"hm-categories","pokec-regions","web-topics","tolokers-2","city-reviews","artnet-exp","web-fraud","hm-prices","avazu-ctr","city-roads-M","city-roads-L","twitch-views","artnet-views","web-traffic").split (str) – The type of dataset split/setting (
"RL","RH","TH","THI")."RL"is for “random low” split — a 10%/10%/80% random stratified train/val/test split."RH"is for “random high” split — a 50%/25%/25% random stratified train/val/test split."TH"is for “temporal high” split — a 50%/25%/25% temporal train/val/test split."THI"is for “temporal high” split with the inductive setting, which means that the graph is evolving over time, thus val and test nodes are not seen at train time, and test nodes are not seen at val time. The"RL","RH", and"TH"splits correspond to the transductive setting and thus will return a dataset with a single graph and three masks (for train, val, and test nodes). In contrast, the"THI"split corresponds to the inductive setting, and thus will return a dataset with three graphs (a train graph, a val graph, and a test graph), which are three snapshots of an evolving network captured at different timestamps. Each of the three graphs has a mask specifying which of the nodes should be used for training (in the train graph) and evaluation (in the val and test graphs)."TH"and"THI"splits are not available for the following datasets:"city-reviews","city-roads-M","city-roads-L","web-traffic".numerical_features_transform (str, optional) – A transform applied to numerical features (
None,"standard_scaler","min_max_scaler","quantile_transform_normal","quantile_transform_uniform","default"). Since numerical features can have widely different scales and distributions, it is typically useful to apply some transform to them before passing them to a neural model. This transform is applied to all numerical features except for those that are also categorized as fraction features. The"default"value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. (default:"default")fraction_features_transform (str, optional) – A transform applied to fraction features (
None,"standard_scaler","min_max_scaler","quantile_transform_normal","quantile_transform_uniform","default"). Fraction features are a subset of numerical features that have the meaning of fractions and are thus always in[0, 1]range. Since their range is bounded, it is not neccessary but may still be useful to apply some transform to them before passing them to a neural model. The"default"value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. (default:"default")categorical_features_transform (str, optional) – A transform applied to categorical features (
None,"one_hot_encoding"). It is most often useful to apply one-hot encoding to categorical features before passing them to a neural model. (default:"one_hot_encoding")regression_targets_transform (str, optional) – A transform applied to regression targets (
None,"standard_scaler","min_max_scaler","default"). Depending on their range, it may or may not be useful to apply a transform to regression targets before fitting a neural model to them. The"default"value selects a dataset-specific transform from the other options that was determined to be a safe and likely optimal choice for this dataset based on experiments with various GNNs. This argument does not affect classification datasets. (default:"default")numerical_features_nan_imputation_strategy (str, optional) – Defines which value to fill NaNs in numerical features with (
None,"mean","median","most_frequent"). This imputation strategy is applied to all numerical features except for those that are also categorized as fraction features. (default:"most_frequent")fraction_features_nan_imputation_strategy (str, optional) – Defines which value to fill NaNs in fraction features with (
None,"mean","median","most_frequent"). (default:"most_frequent")to_undirected (bool, optional) – Whether to convert a directed graph to an undirected one. Does not affect undirected graphs. (default:
True)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Dataobject and returns a transformed version. The data object will be transformed before every access. (default:None)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Dataobject and returns a transformed version. The data object will be transformed before being saved to disk. (default:None)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False)
STATS:
Name
#nodes
#edges
is directed
task
hm-categories46,563
21,461,990
False
multiclass
pokec-regions1,632,803
30,622,564
True
multiclass
web-topics2,890,331
12,895,369
True
multiclass
tolokers-211,758
1,038,000
False
binclass
city-reviews148,801
2,330,830
False
binclass
artnet-exp50,405
560,696
False
binclass
web-fraud2,890,331
12,895,369
True
binclass
hm-prices46,563
21,461,990
False
regression
avazu-ctr76,269
21,968,154
False
regression
city-roads-M57,073
132,571
True
regression
city-roads-L142,257
279,062
True
regression
twitch-views168,114
13,595,114
False
regression
artnet-views50,405
560,696
False
regression
web-traffic2,890,331
12,895,369
True
regression