Overall design#

The main goal of the Workflow package is to make it easy operate on sets of input atomic configurations, typically doing the same operation to each one, and returning corresponding sets of output configurations. There are also functions that do not fit this structure, but use the same data types, or are otherwise useful.

Most of operations in Workflow take in

an iterator, usually a ConfigSet (see below) which returns ASE Atoms objects
an OutputSpec (see below) indicating where to store the returned ASE Atoms objects

and return

a ConfigSet containing the output configurations.

These two classes abstract the storage of atomic configurations in memory, files (CURRENTLY UNSUPPORTED: or the ABCD database). A ConfigSet used for input may be initialised with

a list (or list of lists) of ASE Atoms objects in memory
one or more filenames that can be read by ase.io.read(), such as .extxyz
a list of ConfigSet objects that use the same type of storage
[CURRENTLY UNSUPPORTED] a query to an ABCD database

Similarly, returned configurations can be held in memory or file(s) [currently unsupported: ABCD], depending on the arguments to the OutputSpec constructor. The workflow function returns a ConfigSet generated by OutputSpec.to_ConfigSet(), which can be used to access the output configs. This way, an operation may iterate over a ConfigSet and write Atoms to OutputSpec, regardless of how the input configs were supplied or how or where to the output configs are going to be collected.

In addition to this abstraction of Atoms storage, the workflow makes it easy to parallelize operations over sets of configurations and/or run them as (possibly remote) queued jobs, and this has been implemented for most of its operations. This is achieved by wrapping the operation in a call to wfl.pipeline.autoparallelize. In addition to parallelising on readily accessible cores, the operations may be executed in a number of independently queued jobs on an HPC cluster with the help of ExPyRe.

Some parts of Workflow (e.g. how many parallel processes to run) are controlled via environment variables. The most commonly used ones are

WFL_NUM_PYTHON_SUBPROCESSES which controls how many python processes (on the same node) are used to parallelize a single operation
WFL_EXPYRE_INFO which controls what HPC resources will be used for a remote job