GAP fit workflow with many wfl use-case examples#

This notebook walks though a lightweight workflow of fitting a MLIP to showcase usage of a wide range of wfl functions and ways they can be used.

The main steps are:

Create some molecules
Run GFN2-xTB MD
Filter by force components
Calculate global SOAP descriptor
Perform CUR decomposition to select diverse-ish training and testing sets
Fit a GAP potential
Evaluate structures with GAP
Plot atomization energy and force component correlation plots.

Imports#

In addition to standard packages or wfl dependencies, we make use of three external packages:

quip and quippy which provide interface for fitting and evaluating GAP.

Documentation: https://pypi.org/project/quippy-ase/

Installation: pip install quippy-ase
GFN2-xTB: a semi-empirical method designed for molecular systems, used as a reference method.

Documentation:
- https://xtb-docs.readthedocs.io/en/latest/contents.html
- https://xtb-python.readthedocs.io/en/latest/
Installation: conda install -c conda-forge xtb-python
RDKit: a chemoinformatics package that wfl uses to convert 2D SMILES strings (e.g. “CCO” for ethanol) into 3D Atoms objects.

Documentation: https://rdkit.org/

Installation: conda install -c conda-forge rdkit

import numpy as np

from ase import Atoms

from xtb.ase.calculator import XTB

from quippy.potential import Potential

from wfl.configset import ConfigSet, OutputSpec
from wfl.generate import md
import wfl.descriptors.quippy
import wfl.select.by_descriptor
import wfl.fit.gap.simple
from wfl.calculators import generic
from wfl.autoparallelize import AutoparaInfo
from wfl.autoparallelize import RemoteInfo
from wfl.generate import smiles
from wfl.utils.configs import atomization_energy
from wfl.select.simple import by_bool_func
from wfl.fit import error
import wfl.map
from pathlib import Path

from expyre.resources import Resources

# set random seed, so that MD runs, etc are reproducible and we can check for RMSEs.
rng = np.random.default_rng(20230301)

Reference calculator#

The calculator object given to autoparalellize-wrapped functions need to be pickle-able, so it can be executed on the parallel Python subprocesses with multiprocessing.pool. The calculators that can’t be pickled need to be given to workflow functions as

(Initalizer, [args], {kwargs})

e.g. xtb would normally be called with

xtb_calc = XTB(method="GFN2-xTB")

but instead in wfl scripts we define it as

xtb_calc = (XTB, [], {"method": "GFN2-xTB"})

Prepare isolated atoms#

GAP requires reference (e0) energies for fitting. We construct Atoms objects with a single atom, evaluate them with the reference GFN2-xTB method and store in a file to later combine them with the training set.

isolated_at_fname='isolated_atoms.xtb.xyz'
isolated_atoms = [Atoms(element, positions=[(0, 0, 0)], cell=[50, 50, 50]) for element in ["H", "C"]]
inputs = ConfigSet(isolated_atoms)
outputs=OutputSpec(isolated_at_fname, tags={"config_type": "isolated_atom"})

# calculate reference energy
isolated_atoms = generic.calculate(
    inputs=inputs,
    outputs=outputs,
    calculator=xtb_calc,
    properties=["energy"],
    output_prefix="xtb_")

Generate initial structures#

We build this example on a small number of hydrocarbon molecules. Their connectivity is represented as SMILES strings and use RDKit to them into reasonable 3D geometries to start the molecular dynamics simulation with.

all_smiles = [
    'CC1=CCC=CC(C)=C1C(C)C',
    'CC1(c2ccc(CC3CC=CC3)cc2)CC1',
    'C#CC[C@@H](CCC=C(C)C)C1CC1',
    'Cc1ccccc1CCCC1=CCCCC1',
    'C=CC1=CC[C@@H]2C[C@H]1C2(C)C',
    'C1=CCC(Cc2ccc(CC3CC3)cc2)C1',
    'C1=CC(c2ccccc2)=CCC1',
    'C/C=C/CCCC[C@H](C)C(C)(C)C',
    'C=C[C@@H]1C/C=C/CCCCCCCC1',
    'C[C@H](CC(C)(C)C)[C@@H](C)C(C)(C)C',
    'CC/C=C\\C[C@@H](C)c1cccc(C)c1C',
    'C=C1CC2c3ccccc3C1c1ccccc12']

outputs = OutputSpec("1.ch.rdkit.xyz")
smiles_configs = smiles.smiles(all_smiles, outputs=outputs)

# regenerate smiles_configs with a random seed set for testing purposes
# this cell is hidden from docs.

outputs = OutputSpec("1.ch.rdkit.xyz", overwrite=True)
# set seed for smiles generation to a value from our (reproducible) random number generator
smiles_configs = smiles.smiles(all_smiles, outputs=outputs, randomSeed=int(rng.integers(np.iinfo(np.int32).max,
                                                                                        dtype=np.int32)))

Run Molecular Dynamics simulation#

We run the MD at 300 K with an NVT Berendsen thermostat to collect a pool of structures from which we will select diverse structures for the training set.

outputs = OutputSpec("2.ch.rdkit.md.traj.xyz")

md_params = {
    "steps": 80,
    "dt": 0.5,  # fs
    "temperature": 300,  # K
    "temperature_tau": 500,
    "results_prefix": "xtb_",
    "traj_step_interval": 5}

remote_info = {
    "sys_name" : "github",
    "job_name" : "md",
    "resources" : {
        "max_time" : "15m",
        "num_cores" : 2,
        "partitions" : "standard"},
    "check_interval": 5,
    "num_inputs_per_queued_job" :20,
}

md_sample = md.md(
    inputs=smiles_configs,
    outputs=outputs,
    calculator=xtb_calc,
    autopara_info = AutoparaInfo(
        remote_info=remote_info,
        num_python_subprocesses=2),
    **md_params)

Filter structures#

While diverse training set leads to better model extrapolation, structures too dissimilar to the region of interest are fitted at the expense of accuracy elsewhere. One way to spot structures somewhat distant from equilibrium is by checking for high force components. Below we exclude such structures via one of wfl’s filtering functions.

def are_forces_reasonable(at):
    force_comps = at.arrays["xtb_forces"]
    return np.all(np.linalg.norm(force_comps, axis=1) < 8)

outputs = OutputSpec("3.ch.rdkit.md.traj.filtered.xyz")

md_sample = by_bool_func(
    inputs = md_sample,
    outputs = outputs,
    at_filter = are_forces_reasonable
)

Calculate SOAP descriptor#

outputs = OutputSpec("4.ch.rdkit.md.traj.local_soap.xyz")

descriptor_key = "SOAP"
# Descriptor string, just as it would go into quip.
# dictionary can have a descriptor per species, e.g.
# descriptor = {
#   "H": "soap ...",
#   "C": "soap ..."}
# `None` for dictionary keys just means that the same descriptor is used
# for all elements.
descriptor =   {
        None: "soap l_max=3 n_max=6 cutoff=4 delta=1 covariance_type=dot_product zeta=4 atom_gaussian_width=0.3"
    }

# this function isn't parallelised here, but can be
# by setting WFL_NUM_PYTHON_SUBPROCESSES or
# WFL_EXPYRE_INFO
md_soap_local = wfl.descriptors.quippy.calculate(
    inputs=md_sample,
    outputs=outputs,
    descs=descriptor,
    key=descriptor_key,
    per_atom=True)

def get_average_soap(at, descriptor_key):
    at_desc = at.arrays.pop(descriptor_key)
    at_desc = np.sum(at_desc, axis=0)
    at_desc /= np.linalg.norm(at_desc)
    at.info[descriptor_key] = at_desc
    return at

md_soap_global = wfl.map.map(
    inputs = md_soap_local,
    outputs = OutputSpec(),
    map_func = get_average_soap,
    args = [descriptor_key])

Sub-select with CUR#

Select diverse structures for training and testing sets with CUR.

outputs = OutputSpec("5.ch.rdkit.md.traj.soap.cur_selection.xyz")
cur_selection = wfl.select.by_descriptor.CUR_conf_global(
    inputs=md_soap_global,
    outputs=outputs,
    num=100,                    # target number of structures to pick
    at_descs_info_key="SOAP", rng=rng)


train_fname = "6.1.train.xyz"
test_fname = "6.2.test.xyz"
gap_fname='gap.xml'

def process(at):
    at.cell = [50, 50, 50]
    # For now, SOAP descriptor in atoms.info cannot be parsed by the xyz reader
    del at.info["SOAP"]
    return at


processed_cur_selection = wfl.map.map(
    cur_selection,
    OutputSpec(),
    map_func = process)


# label and save training and testing sets
train_inputs = ConfigSet(list(processed_cur_selection)[0::2])
test_inputs = ConfigSet(list(processed_cur_selection)[1::2])

OutputSpec(train_fname, tags={"config_type": "train"}).write(train_inputs)
OutputSpec(test_fname, tags={"config_type": "test"}).write(test_inputs)

Fit GAP#

The gap parameter dictionary is almost directly converted to a command for gap_fit.

gap_params = {
    "gap_file": gap_fname,
    "energy_parameter_name": "xtb_energy",
    "force_parameter_name": "xtb_forces",
    "default_sigma": [0.001, 0.01, 0.0, 0.0],
    "config_type_kernel_regularisation": {"isolated_atom":[0.0001,0.0001,0.0,0.0]},
    "_gap": [{
            "soap": True,
            "l_max": 3,
            "n_max": 6,
            "cutoff": 3,
            "delta": 0.1,
            "covariance_type": "dot_product",
            "zeta": 4,
            "n_sparse":20,
            "sparse_method": "cur_points",
            "atom_gaussian_width":0.3,
            "cutoff_transition_width": 0.5},
         {
            "soap": True,
            "l_max": 3,
            "n_max": 6,
            "cutoff": 6,
            "delta": 0.1,
            "covariance_type": "dot_product",
            "zeta": 4,
            "n_sparse":20,
            "sparse_method": "cur_points",
            "atom_gaussian_width":0.6,
            "cutoff_transition_width": 1},
        {
            "distance_2b": True,
            "cutoff": 7,
            "covariance_type": "ard_se",
            "delta": 1,
            "theta_uniform": 1.0,
            "sparse_method": "uniform",
            "n_sparse": 10
        }
    ]
}

remote_info = {
    "sys_name" : "github",
    "job_name" : "gap-fit",
    "resources" : {
        "max_time" : "15m",
        "num_cores" : 2,
        "partitions" : "standard"},
    "check_interval": 5,
}

train_configs = ConfigSet([train_fname, isolated_at_fname])
wfl.fit.gap.simple.run_gap_fit(
    fitting_configs=train_configs,
    fitting_dict=gap_params,
    stdout_file='gap_fit.out',
    skip_if_present=True,
    remote_info=remote_info)

Evaluate structures with GAP#

train_fn_with_gap = "7.1.train.gap.xyz"
test_fn_with_gap = "7.2.test.gap.xyz"
isolated_at_fn_with_gap = isolated_at_fname.replace('.xyz', '.gap.xyz')

inputs = ConfigSet([train_fname, test_fname, isolated_at_fname])
outputs = OutputSpec([train_fn_with_gap, test_fn_with_gap, isolated_at_fn_with_gap])

gap_calc = (Potential, [], {"param_filename":"gap.xml"})

resources = Resources(
    max_time = "15m",
    num_cores = 2,
    partitions = "standard")

# note - set OMP_NUM_THREADS to prevent GAP evaluation from using OpenMP,
# as it clashes with multiprocessing.pool, which wfl uses
remote_info = RemoteInfo(
    sys_name = "github",
    job_name = "gap-eval",
    resources = resources,
    check_interval = 10,
    env_vars = ["OMP_NUM_THREADS=1"],
    input_files = ["gap.xml*"])

gap_calc_autopara_info = AutoparaInfo(
    remote_info=remote_info)

generic.calculate(
    inputs=inputs,
    outputs=outputs,
    calculator=gap_calc,
    properties=["energy", "forces"],
    output_prefix="gap_",
    autopara_info=gap_calc_autopara_info,
    )

Evaluate error & plot correlation#

wfl has simple convenience functions to compare fitted model’s performance to the reference method. Here we calculate atomization energy, evaluate RMSE and plot the parity plots for atomization energy per atom and force components.

# calculate atomization energies
for fn in [train_fn_with_gap, test_fn_with_gap]:
    configset = ConfigSet([fn, isolated_at_fn_with_gap])
    for prop_prefix in ["xtb_", "gap_"]:
        configset = atomization_energy(
            inputs=configset,
            outputs=OutputSpec([fn, isolated_at_fn_with_gap], overwrite=True),
            prop_prefix=prop_prefix)

# calculate errors
inputs = ConfigSet([train_fn_with_gap, test_fn_with_gap])
errors, diffs, parity = error.calc(
    inputs=inputs,
    calc_property_prefix='gap_',
    ref_property_prefix='xtb_',
    config_properties=["atomization_energy/atom"],
    atom_properties=["forces/comp"])

print(error.errors_dumps(errors))

# plot parity and error plots
error.value_error_scatter(
    all_errors = errors,
    all_diffs=diffs,
    all_parity=parity,
    output="gap_rmses.png",
    ref_property_prefix="xtb_",
    calc_property_prefix="gap_"
)

GAP fit workflow with many wfl use-case examples

Contents