torch_geometric.datasets.MD17
- class MD17(root: str, name: str, train: Optional[bool] = None, transform: Optional[Callable] = None, pre_transform: Optional[Callable] = None, pre_filter: Optional[Callable] = None, force_reload: bool = False)[source]
Bases:
InMemoryDataset
A variety of ab-initio molecular dynamics trajectories from the authors of sGDML. This class provides access to the original MD17 datasets, their revised versions, and the CCSD(T) trajectories.
For every trajectory, the dataset contains the Cartesian positions of atoms (in Angstrom), their atomic numbers, as well as the total energy (in kcal/mol) and forces (kcal/mol/Angstrom) on each atom. The latter two are the regression targets for this collection.
Note
Data objects contain no edge indices as these are most commonly constructed via the
torch_geometric.transforms.RadiusGraph
transform, with its cut-off being a hyperparameter.The original MD17 dataset contains ten molecule trajectories. This version of the dataset was found to suffer from high numerical noise. The revised MD17 dataset contains the same molecules, but the energies and forces were recalculated at the PBE/def2-SVP level of theory using very tight SCF convergence and very dense DFT integration grid. The third version of the dataset contains fewer molecules, computed at the CCSD(T) level of theory. The benzene molecule at the DFT FHI-aims level of theory was released separately.
Check the table below for detailed information on the molecule, level of theory and number of data points contained in each dataset. Which trajectory is loaded is determined by the
name
argument. For the coupled cluster trajectories, the dataset comes with pre-defined training and testing splits which are loaded separately via thetrain
argument.Molecule
Level of Theory
Name
#Examples
Benzene
DFT
benzene
627,983
Uracil
DFT
uracil
133,770
Naphthalene
DFT
napthalene
326,250
Aspirin
DFT
aspirin
211,762
Salicylic acid
DFT
salicylic acid
320,231
Malonaldehyde
DFT
malonaldehyde
993,237
Ethanol
DFT
ethanol
555,092
Toluene
DFT
toluene
442,790
Paracetamol
DFT
paracetamol
106,490
Azobenzene
DFT
azobenzene
99,999
Benzene (R)
DFT (PBE/def2-SVP)
revised benzene
100,000
Uracil (R)
DFT (PBE/def2-SVP)
revised uracil
100,000
Naphthalene (R)
DFT (PBE/def2-SVP)
revised napthalene
100,000
Aspirin (R)
DFT (PBE/def2-SVP)
revised aspirin
100,000
Salicylic acid (R)
DFT (PBE/def2-SVP)
revised salicylic acid
100,000
Malonaldehyde (R)
DFT (PBE/def2-SVP)
revised malonaldehyde
100,000
Ethanol (R)
DFT (PBE/def2-SVP)
revised ethanol
100,000
Toluene (R)
DFT (PBE/def2-SVP)
revised toluene
100,000
Paracetamol (R)
DFT (PBE/def2-SVP)
revised paracetamol
100,000
Azobenzene (R)
DFT (PBE/def2-SVP)
revised azobenzene
99,988
Benzene
CCSD(T)
benzene CCSD(T)
1,500
Aspirin
CCSD
aspirin CCSD
1,500
Malonaldehyde
CCSD(T)
malonaldehyde CCSD(T)
1,500
Ethanol
CCSD(T)
ethanol CCSD(T)
2,000
Toluene
CCSD(T)
toluene CCSD(T)
1,501
Benzene
DFT FHI-aims
benzene FHI-aims
49,863
Warning
It is advised to not train a model on more than 1,000 samples from the original or revised MD17 dataset.
- Parameters:
root (str) – Root directory where the dataset should be saved.
name (str) – Keyword of the trajectory that should be loaded.
train (bool, optional) – Determines whether the train or test split gets loaded for the coupled cluster trajectories. (default:
None
)transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before every access. (default:None
)pre_transform (callable, optional) – A function/transform that takes in an
torch_geometric.data.Data
object and returns a transformed version. The data object will be transformed before being saved to disk. (default:None
)pre_filter (callable, optional) – A function that takes in an
torch_geometric.data.Data
object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default:None
)force_reload (bool, optional) – Whether to re-process the dataset. (default:
False
)
STATS:
Name
#graphs
#nodes
#edges
#features
#tasks
Benzene
627,983
12
0
1
2
Uracil
133,770
12
0
1
2
Naphthalene
326,250
10
0
1
2
Aspirin
211,762
21
0
1
2
Salicylic acid
320,231
16
0
1
2
Malonaldehyde
993,237
9
0
1
2
Ethanol
555,092
9
0
1
2
Toluene
442,790
15
0
1
2
Paracetamol
106,490
20
0
1
2
Azobenzene
99,999
24
0
1
2
Benzene (R)
100,000
12
0
1
2
Uracil (R)
100,000
12
0
1
2
Naphthalene (R)
100,000
10
0
1
2
Aspirin (R)
100,000
21
0
1
2
Salicylic acid (R)
100,000
16
0
1
2
Malonaldehyde (R)
100,000
9
0
1
2
Ethanol (R)
100,000
9
0
1
2
Toluene (R)
100,000
15
0
1
2
Paracetamol (R)
100,000
20
0
1
2
Azobenzene (R)
99,988
24
0
1
2
Benzene CCSD-T
1,500
12
0
1
2
Aspirin CCSD-T
1,500
21
0
1
2
Malonaldehyde CCSD-T
1,500
9
0
1
2
Ethanol CCSD-T
2000
9
0
1
2
Toluene CCSD-T
1,501
15
0
1
2
Benzene FHI-aims
49,863
12
0
1
2