class PCQM4Mv2(root: str, split: str = 'train', transform: Optional[Callable] = None, backend: str = 'sqlite', from_smiles: Optional[Callable] = None)[source]

Bases: OnDiskDataset

The PCQM4Mv2 dataset from the “OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs” paper. PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC project. The task is to predict the DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs.


This dataset uses the OnDiskDataset base class to load data dynamically from disk.

  • root (str) – Root directory where the dataset should be saved.

  • split (str, optional) – If "train", loads the training dataset. If "val", loads the validation dataset. If "test", loads the test dataset. If "holdout", loads the holdout dataset. (default: "train")

  • transform (callable, optional) – A function/transform that takes in an object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • backend (str) – The Database backend to use. (default: "sqlite")

  • from_smiles (callable, optional) – A custom function that takes a SMILES string and outputs a Data object. If not set, defaults to from_smiles(). (default: None)