Overall design#
The main goal of the Workflow package is to make it easy operate on sets of input atomic configurations, typically doing the same operation to each one, and returning corresponding sets of output configurations. There are also functions that do not fit this structure, but use the same data types, or are otherwise useful.
Most of operations in Workflow take in
an iterator, usually a
ConfigSet
(see below) which returns ASEAtoms
objectsan
OutputSpec
(see below) indicating where to store the returned ASEAtoms
objects
and return
a
ConfigSet
containing the output configurations.
These two classes abstract the storage of atomic configurations in memory,
files (CURRENTLY UNSUPPORTED: or the ABCD database). A ConfigSet
used for input may be
initialised with
a list (or list of lists) of ASE
Atoms
objects in memoryone or more filenames that can be read by
ase.io.read()
, such as.extxyz
a list of
ConfigSet
objects that use the same type of storage[CURRENTLY UNSUPPORTED] a query to an ABCD database
Similarly, returned configurations can be held in memory or
file(s) [currently unsupported: ABCD], depending on the arguments to the OutputSpec
constructor. The workflow function returns a ConfigSet
generated by
OutputSpec.to_ConfigSet()
, which can be used to access the output
configs. This way, an operation may iterate over a ConfigSet
and write Atoms
to OutputSpec
, regardless of how the input
configs were supplied or how or where to the output configs are going
to be collected.
In addition to this abstraction of Atoms
storage, the workflow makes
it easy to parallelize operations over sets of configurations and/or
run them as (possibly remote) queued jobs, and this has been implemented
for most of its operations. This is achieved by wrapping the operation in a
call to wfl.pipeline.autoparallelize
. In addition to parallelising
on readily accessible cores, the operations may be executed in a number
of independently queued jobs on an HPC cluster with the help of
ExPyRe.
Some parts of Workflow (e.g. how many parallel processes to run) are controlled via environment variables. The most commonly used ones are
WFL_NUM_PYTHON_SUBPROCESSES
which controls how many python processes (on the same node) are used to parallelize a single operationWFL_EXPYRE_INFO
which controls what HPC resources will be used for a remote job