Selection or Sampling of Structures

Selection or Sampling of Structures#

Training of machine learning potentials often requires picking out a few structures from a large configurations-database (for example, an MD trajectory). This can be achieved in a few different ways in the Workflow package depending on one’s choice of the selection criteria (see wfl.select).

Selection of a set of individually unique structures can be done by comparing descriptors for each configuration in the database (see wfl.select.by_descriptor). Here, you can find functions to process descriptors as well as functions which perform two different selection algorithms, namely leverage-score CUR and greedy farthest-point-first (FPS). Additional features including exclusion of a list of structures or consideration of previously selected structures can be passed as arguments in the selection criteria.

In this example, we show how FPS can be used for selection of “n=8” datapoints from an MD trajectory by comparing “average SOAP descriptors” of all path configurations. This is done in two steps: 1. Assigning a global descriptor for each configuration in the trajectory followed by 2. A call of the greedy_fps_conf_global function

To assign a per-config descriptor we calculate the average SOAP vector for every frame in the MD trajectory. The greedy-FPS algorithm would use them to measure similarities across the datapoints and select 10 unique structures.

Overall this requires two input files: the database and the descriptors (“md.traj” and “params.yaml”)

Tip: params.yaml can either be self-written or automatically generated from a “Univeral_SOAP-template” processed by multi-stage gap fit (see wfl.fit.gap.multistage)

 
import numpy as np
import yaml
import pathlib
import wfl
from wfl.configset import ConfigSet, OutputSpec
from wfl.descriptors.quippy import calculate as calc_descriptors
from wfl.select.by_descriptor import greedy_fps_conf_global

work_dir = pathlib.Path(wfl.__file__).parents[1]/"docs/source/examples_files/select_fps"

# Step 1: Assign descriptors to the database
md        = ConfigSet(work_dir/"md.traj")
md_desc   = OutputSpec(files=work_dir/"md_desc.xyz")

with open(work_dir/'params.yaml', 'r') as foo:
    desc_dict = yaml.safe_load(foo)
desc_dicts = [d for d in desc_dict if 'soap' in d.keys()] # filtering out only SOAP descriptor
per_atom = True
for param in desc_dicts:
    param['average']= True # to create global (per-conf) descriptors instead of local (per-atom)
    per_atom = False
md_desc = calc_descriptors(inputs=md, outputs=md_desc, descs=desc_dicts, key='desc', per_atom=per_atom)

# Step 2: Sampling
fps_out          = OutputSpec(files=work_dir/"out_fps.xyz")
nsamples         = 8
selected_configs = greedy_fps_conf_global(inputs=md_desc, outputs=fps_out, num=nsamples, at_descs_info_key='desc', keep_descriptor_info=False, rng=np.random.default_rng())