torch_geometric.datasets.MD17

class MD17(root: str, name: str, train: Optional[bool] = None, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None, force_reload: bool = False)[source]

Bases: InMemoryDataset

A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. This class provides access to the original MD17 datasets, their revised versions, and the CCSD(T) trajectories.

For every trajectory, the dataset contains the Cartesian positions of atoms (in Angstrom), their atomic numbers, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. The latter two are the regression targets for this collection.

Note

Data objects contain no edge indices as these are most commonly constructed via the torch_geometric.transforms.RadiusGraph transform, with its cut-off being a hyperparameter.

The original MD17 dataset contains ten molecule trajectories. This version of the dataset was found to suffer from high numerical noise. The revised MD17 dataset contains the same molecules, but the energies and forces were recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. The third version of the dataset contains fewer molecules, computed at the CCSD(T) level of theory. The benzene molecule at the DFT FHI-aims level of theory was released separately.

Check the table below for detailed information on the molecule, level of theory and number of data points contained in each dataset. Which trajectory is loaded is determined by the name argument. For the coupled cluster trajectories, the dataset comes with pre-defined training and testing splits which are loaded separately via the train argument.

Molecule

Level of Theory

Name

#Examples

Benzene

DFT

benzene

627,983

Uracil

DFT

uracil

133,770

Naphthalene

DFT

napthalene

326,250

Aspirin

DFT

aspirin

211,762

Salicylic acid

DFT

salicylic acid

320,231

Malonaldehyde

DFT

malonaldehyde

993,237

Ethanol

DFT

ethanol

555,092

Toluene

DFT

toluene

442,790

Paracetamol

DFT

paracetamol

106,490

Azobenzene

DFT

azobenzene

99,999

Benzene (R)

DFT (PBE/def2-SVP)

revised benzene

100,000

Uracil (R)

DFT (PBE/def2-SVP)

revised uracil

100,000

Naphthalene (R)

DFT (PBE/def2-SVP)

revised napthalene

100,000

Aspirin (R)

DFT (PBE/def2-SVP)

revised aspirin

100,000

Salicylic acid (R)

DFT (PBE/def2-SVP)

revised salicylic acid

100,000

Malonaldehyde (R)

DFT (PBE/def2-SVP)

revised malonaldehyde

100,000

Ethanol (R)

DFT (PBE/def2-SVP)

revised ethanol

100,000

Toluene (R)

DFT (PBE/def2-SVP)

revised toluene

100,000

Paracetamol (R)

DFT (PBE/def2-SVP)

revised paracetamol

100,000

Azobenzene (R)

DFT (PBE/def2-SVP)

revised azobenzene

99,988

Benzene

CCSD(T)

benzene CCSD(T)

1,500

Aspirin

CCSD

aspirin CCSD

1,500

Malonaldehyde

CCSD(T)

malonaldehyde CCSD(T)

1,500

Ethanol

CCSD(T)

ethanol CCSD(T)

2,000

Toluene

CCSD(T)

toluene CCSD(T)

1,501

Benzene

DFT FHI-aims

benzene FHI-aims

49,863

Warning

It is advised to not train a model on more than 1,000 samples from the original or revised MD17 dataset.

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • name (str) – Keyword of the trajectory that should be loaded.

  • train (bool, optional) – Determines whether the train or test split gets loaded for the coupled cluster trajectories. (default: None)

  • transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an torch_geometric.data.Data object and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • pre_filter (callable, optional) – A function that takes in an torch_geometric.data.Data object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • force_reload (bool, optional) – Whether to re-process the dataset. (default: False)

STATS:

Name

#graphs

#nodes

#edges

#features

#tasks

Benzene

627,983

12

0

1

2

Uracil

133,770

12

0

1

2

Naphthalene

326,250

10

0

1

2

Aspirin

211,762

21

0

1

2

Salicylic acid

320,231

16

0

1

2

Malonaldehyde

993,237

9

0

1

2

Ethanol

555,092

9

0

1

2

Toluene

442,790

15

0

1

2

Paracetamol

106,490

20

0

1

2

Azobenzene

99,999

24

0

1

2

Benzene (R)

100,000

12

0

1

2

Uracil (R)

100,000

12

0

1

2

Naphthalene (R)

100,000

10

0

1

2

Aspirin (R)

100,000

21

0

1

2

Salicylic acid (R)

100,000

16

0

1

2

Malonaldehyde (R)

100,000

9

0

1

2

Ethanol (R)

100,000

9

0

1

2

Toluene (R)

100,000

15

0

1

2

Paracetamol (R)

100,000

20

0

1

2

Azobenzene (R)

99,988

24

0

1

2

Benzene CCSD-T

1,500

12

0

1

2

Aspirin CCSD-T

1,500

21

0

1

2

Malonaldehyde CCSD-T

1,500

9

0

1

2

Ethanol CCSD-T

2000

9

0

1

2

Toluene CCSD-T

1,501

15

0

1

2

Benzene FHI-aims

49,863

12

0

1

2