torch_geometric.datasets.TAGDataset
- class TAGDataset(root: str, dataset: InMemoryDataset, tokenizer_name: str, text: Optional[List[str]] = None, split_idx: Optional[Dict[str, Tensor]] = None, tokenize_batch_size: int = 256, token_on_disk: bool = False, text_on_disk: bool = False, force_reload: bool = False)[source]
Bases:
InMemoryDataset
The Text Attributed Graph datasets from the “Learning on Large-scale Text-attributed Graphs via Variational Inference “ paper. This dataset is aiming on transform ogbn products, ogbn arxiv into Text Attributed Graph that each node in graph is associate with a raw text, that dataset can be adapt to DataLoader (for LM training) and NeighborLoader(for GNN training). In addition, this class can be use as a wrapper class by convert a InMemoryDataset with Tokenizer and text into Text Attributed Graph.
- Parameters:
root (str) – Root directory where the dataset should be saved.
dataset (InMemoryDataset) – The name of the dataset (
"ogbn-products"
,"ogbn-arxiv"
).tokenizer_name (str) – The tokenizer name for language model, Be sure to use same tokenizer name as your model id of model repo on huggingface.co.
text (List[str]) – list of raw text associate with node, the order of list should be align with node list
split_idx (Optional[Dict[str, torch.Tensor]]) – Optional dictionary, for saving split index, it is required that if your dataset doesn’t have get_split_idx function
tokenize_batch_size (int) – batch size of tokenizing text, the tokenizing process will run on cpu, default: 256
token_on_disk (bool) – save token as .pt file on disk or not, default: False
text_on_disk (bool) – save given text(list of str) as dataframe on disk or not, default: False
force_reload (bool) – default: False
Note
See example/llm_plus_gnn/glem.py for example usage