Overall design#
The main goal of the Workflow package is to make it easy operate on sets of input atomic configurations, typically doing the same operation to each one, and returning corresponding sets of output configurations. There are also functions that do not fit this structure, but use the same data types, or are otherwise useful.
Most of operations in Workflow take in
an iterator, usually a
ConfigSet(see below) which returns ASEAtomsobjectsan
OutputSpec(see below) indicating where to store the returned ASEAtomsobjects
and return
a
ConfigSetcontaining the output configurations.
These two classes abstract the storage of atomic configurations in memory,
files (CURRENTLY UNSUPPORTED: or the ABCD database). A ConfigSet used for input may be
initialised with
a list (or list of lists) of ASE
Atomsobjects in memoryone or more filenames that can be read by
ase.io.read(), such as.extxyza list of
ConfigSetobjects that use the same type of storage[CURRENTLY UNSUPPORTED] a query to an ABCD database
Similarly, returned configurations can be held in memory or
file(s) [currently unsupported: ABCD], depending on the arguments to the OutputSpec
constructor. The workflow function returns a ConfigSet generated by
OutputSpec.to_ConfigSet(), which can be used to access the output
configs. This way, an operation may iterate over a ConfigSet
and write Atoms to OutputSpec, regardless of how the input
configs were supplied or how or where to the output configs are going
to be collected.
In addition to this abstraction of Atoms storage, the workflow makes
it easy to parallelize operations over sets of configurations and/or
run them as (possibly remote) queued jobs, and this has been implemented
for most of its operations. This is achieved by wrapping the operation in a
call to wfl.pipeline.autoparallelize. In addition to parallelising
on readily accessible cores, the operations may be executed in a number
of independently queued jobs on an HPC cluster with the help of
ExPyRe.
Some parts of Workflow (e.g. how many parallel processes to run) are controlled via environment variables. The most commonly used ones are
WFL_NUM_PYTHON_SUBPROCESSESwhich controls how many python processes (on the same node) are used to parallelize a single operationWFL_EXPYRE_INFOwhich controls what HPC resources will be used for a remote job