torch_geometric.datasets.MD17

class MD17(root: str, name: str, train: Optional[bool] = None, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None)[source]

Bases: InMemoryDataset

A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. This class provides access to the original MD17 datasets as well as all other datasets released by sGDML since then (15 in total).

For every trajectory, the dataset contains the Cartesian positions of atoms (in Angstrom), their atomic numbers, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. The latter two are the regression targets for this collection.

Note

Data objects contain no edge indices as these are most commonly constructed via the torch_geometric.transforms.RadiusGraph transform, with its cut-off being a hyperparameter.

Some of the trajectories were computed at different levels of theory, and for most molecules there exists two versions: a long trajectory on DFT level of theory and a short trajectory on coupled cluster level of theory. Check the table below for detailed information on the molecule, level of theory and number of data points contained in each dataset. Which trajectory is loaded is determined by the name argument. For the coupled cluster trajectories, the dataset comes with pre-defined training and testing splits which are loaded separately via the train argument.

When using these datasets, make sure to cite the appropriate publications listed on the sGDML website.

Molecule

Level of Theory

Name

#Examples

Benzene

DFT

benzene

49,863

Benzene

DFT FHI-aims

benzene FHI-aims

627,983

Benzene

CCSD(T)

benzene CCSD(T)

1,500

Uracil

DFT

uracil

133,770

Naphthalene

DFT

napthalene

326,250

Aspirin

DFT

aspirin

211,762

Aspirin

CCSD

aspirin CCSD

1,500

Salicylic acid

DFT

salicylic acid

320,231

Malonaldehyde

DFT

malonaldehyde

993,237

Malonaldehyde

CCSD(T)

malonaldehyde CCSD(T)

1,500

Ethanol

DFT

ethanol

555,092

Ethanol

CCSD(T)

ethanol CCSD(T)

2,000

Toluene

DFT

toluene

442,790

Toluene

CCSD(T)

toluene CCSD(T)

1,501

Paracetamol

DFT

paracetamol

106,490

Azobenzene

DFT

azobenzene

99,999

Parameters
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – Keyword of the trajectory that should be loaded.

  • train (bool, optional) – Determines whether the train or test split gets loaded for the coupled cluster trajectories. (default: None)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

STATS:

Name

#graphs

#nodes

#edges

#features

#classes

Benzene FHI-aims

49,863

12

0

0

0

Benzene

627,983

12

0

0

0

Benzene CCSD-T

1,500

12

0

0

0

Uracil

133,770

12

0

0

0

Naphthalene

326,250

10

0

0

0

Aspirin

211,762

21

0

0

0

Aspirin CCSD-T

1,500

21

0

0

0

Salicylic acid

320,231

16

0

0

0

Malonaldehyde

993,237

9

0

0

0

Malonaldehyde CCSD-T

1,500

9

0

0

0

Ethanol

555,092

9

0

0

0

Ethanol CCSD-T

2000

9

0

0

0

Toluene

442,790

15

0

0

0

Toluene CCSD-T

1,501

15

0

0

0

Paracetamol

106,490

20

0

0

0

Azobenzene

99,999

24

0

0

0