torch_geometric.datasets.TAGDataset

class TAGDataset(root: str, dataset: InMemoryDataset, tokenizer_name: str, text: Optional[List[str]] = None, split_idx: Optional[Dict[str, Tensor]] = None, tokenize_batch_size: int = 256, token_on_disk: bool = False, text_on_disk: bool = False, force_reload: bool = False)[source]

Bases: InMemoryDataset

The Text Attributed Graph datasets from the “Learning on Large-scale Text-attributed Graphs via Variational Inference “ paper. This dataset is aiming on transform ogbn products, ogbn arxiv into Text Attributed Graph that each node in graph is associate with a raw text, that dataset can be adapt to DataLoader (for LM training) and NeighborLoader(for GNN training). In addition, this class can be use as a wrapper class by convert a InMemoryDataset with Tokenizer and text into Text Attributed Graph.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • dataset (InMemoryDataset) – The name of the dataset ("ogbn-products", "ogbn-arxiv").

  • tokenizer_name (str) – The tokenizer name for language model, Be sure to use same tokenizer name as your model id of model repo on huggingface.co.

  • text (List[str]) – list of raw text associate with node, the order of list should be align with node list

  • split_idx (Optional[Dict[str, torch.Tensor]]) – Optional dictionary, for saving split index, it is required that if your dataset doesn’t have get_split_idx function

  • tokenize_batch_size (int) – batch size of tokenizing text, the tokenizing process will run on cpu, default: 256

  • token_on_disk (bool) – save token as .pt file on disk or not, default: False

  • text_on_disk (bool) – save given text(list of str) as dataframe on disk or not, default: False

  • force_reload (bool) – default: False

Note

See example/llm_plus_gnn/glem.py for example usage