torch_geometric.datasets.MD17

class MD17(root: str, name: str, train: Optional[bool] = None, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. This class provides access to the original MD17 datasets, their revised versions, and the CCSD(T) trajectories.

For every trajectory, the dataset contains the Cartesian positions of atoms (in Angstrom), their atomic numbers, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. The latter two are the regression targets for this collection.

Note

Data objects contain no edge indices as these are most commonly constructed via the torch_geometric.transforms.RadiusGraph transform, with its cut-off being a hyperparameter.

The original MD17 dataset contains ten molecule trajectories. This version of the dataset was found to suffer from high numerical noise. The revised MD17 dataset contains the same molecules, but the energies and forces were recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. The third version of the dataset contains fewer molecules, computed at the CCSD(T) level of theory. The benzene molecule at the DFT FHI-aims level of theory was released separately.

Check the table below for detailed information on the molecule, level of theory and number of data points contained in each dataset. Which trajectory is loaded is determined by the name argument. For the coupled cluster trajectories, the dataset comes with pre-defined training and testing splits which are loaded separately via the train argument.

Molecule	Level of Theory	Name	#Examples
Benzene	DFT	`benzene`	627,983
Uracil	DFT	`uracil`	133,770
Naphthalene	DFT	`naphthalene`	326,250
Aspirin	DFT	`aspirin`	211,762
Salicylic acid	DFT	`salicylic acid`	320,231
Malonaldehyde	DFT	`malonaldehyde`	993,237
Ethanol	DFT	`ethanol`	555,092
Toluene	DFT	`toluene`	442,790
Paracetamol	DFT	`paracetamol`	106,490
Azobenzene	DFT	`azobenzene`	99,999
Benzene (R)	DFT (PBE/def2-SVP)	`revised benzene`	100,000
Uracil (R)	DFT (PBE/def2-SVP)	`revised uracil`	100,000
Naphthalene (R)	DFT (PBE/def2-SVP)	`revised naphthalene`	100,000
Aspirin (R)	DFT (PBE/def2-SVP)	`revised aspirin`	100,000
Salicylic acid (R)	DFT (PBE/def2-SVP)	`revised salicylic acid`	100,000
Malonaldehyde (R)	DFT (PBE/def2-SVP)	`revised malonaldehyde`	100,000
Ethanol (R)	DFT (PBE/def2-SVP)	`revised ethanol`	100,000
Toluene (R)	DFT (PBE/def2-SVP)	`revised toluene`	100,000
Paracetamol (R)	DFT (PBE/def2-SVP)	`revised paracetamol`	100,000
Azobenzene (R)	DFT (PBE/def2-SVP)	`revised azobenzene`	99,988
Benzene	CCSD(T)	`benzene CCSD(T)`	1,500
Aspirin	CCSD	`aspirin CCSD`	1,500
Malonaldehyde	CCSD(T)	`malonaldehyde CCSD(T)`	1,500
Ethanol	CCSD(T)	`ethanol CCSD(T)`	2,000
Toluene	CCSD(T)	`toluene CCSD(T)`	1,501
Benzene	DFT FHI-aims	`benzene FHI-aims`	49,863

Warning

It is advised to not train a model on more than 1,000 samples from the original or revised MD17 dataset.

Parameters:

root (str) – Root directory where the dataset should be saved.
name (str) – Keyword of the trajectory that should be loaded.
train (bool, optional) – Determines whether the train or test split gets loaded for the coupled cluster trajectories. (default: None)
transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)
pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)
pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)
force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Name	#graphs	#nodes	#features	#tasks
Benzene	627,983	12	1	2
Uracil	133,770	12	1	2
Naphthalene	326,250	10	1	2
Aspirin	211,762	21	1	2
Salicylic acid	320,231	16	1	2
Malonaldehyde	993,237	9	1	2
Ethanol	555,092	9	1	2
Toluene	442,790	15	1	2
Paracetamol	106,490	20	1	2
Azobenzene	99,999	24	1	2
Benzene (R)	100,000	12	1	2
Uracil (R)	100,000	12	1	2
Naphthalene (R)	100,000	10	1	2
Aspirin (R)	100,000	21	1	2
Salicylic acid (R)	100,000	16	1	2
Malonaldehyde (R)	100,000	9	1	2
Ethanol (R)	100,000	9	1	2
Toluene (R)	100,000	15	1	2
Paracetamol (R)	100,000	20	1	2
Azobenzene (R)	99,988	24	1	2
Benzene CCSD-T	1,500	12	1	2
Aspirin CCSD-T	1,500	21	1	2
Malonaldehyde CCSD-T	1,500	9	1	2
Ethanol CCSD-T	2000	9	1	2
Toluene CCSD-T	1,501	15	1	2
Benzene FHI-aims	49,863	12	1	2