The gap_fit program

In order to fit an interatomic potential to data with gap_fit, you need

  1. Data

  2. A definition of descriptors and kernels that serve as basis functions

  3. Some parameters that control the least squares fit

We will go over each of these in turn, and reference the relevant gap_fit options.


The input file to gap_fit is a series of atomic structures (they could be molecules, or periodic systems), with atomic numbers, cartesian positions of the atoms, and some associated data. The required file format is extended XYZ, which is similar to a concatenation of ordinary XYZ, except that the second line (after the first line that just contains the number of atoms) is not free format, but has to be a series of key=value pairs, with some mandatory keys and other restrictions. See the detailed definition [here].

The data associated with each structure is an energy, optionally forces, and optionally virial stress. All of these are specified in the extended XYZ file. Following the definition of the extended XYZ file, the energy needs to be in units of eV, the forces in units of eV/A and the virial stress in units of eV (this is the regular stress multiplied by the volume). The atomic positions need to be in Angstroms, and the periodic unit cell is specified by three cartesian lattice vectors.

An example structure in extended XYZ format:

config_type=isolated_atom gap_energy=-157.72725320 dft_virial="0.00000000       0.00000000       0.00000000       0.00000000       0.00000000       0.00000000       0.00000000       0.00000000       0.00000000" dft_energy=-158.54496821 nneightol=1.20000000 pbc="T T T" Lattice="20.00000000       0.00000000       0.00000000       0.00000000      20.00000000       0.00000000       0.00000000       0.00000000      20.00000000" Properties=species:S:1:pos:R:3:Z:I:1:map_shift:I:3:n_neighb:I:1:gap_force:R:3:dft_force:R:3
Si             10.00000000     10.00000000     10.00000000      14      -1      -1      -1       0      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000      0.00000000```

The structure above has lots of data in it, not all of it is useful for fitting (e.g. gap_energy happens to be a prediction by some model, not relevant for making models, or nneightol which was inserted by some earlier manupulation of the file). Note how the value of the mandatory Properties key specifies the names and types of the columns in the rest of the configuration. S stands for ‘string’, R stands for ‘real’, I is for ‘integer’ and L is for logical. Columns species and pos (atom types and cartesian positions, respectively) are mandatory.

All structures in the extended XYZ format need a periodic lattice unit cell (and defined by the ``Lattice`` key), even if a particular direction (or any direction) is not supposed to be periodic. Just use a lattice vector which is bigger than twice the cutoff of any potential you plan to create or use.

To get good fits, it is extremely helpful of the energy and the forces are consistent, i.e. up to discretisation error (e.g. in k-space in case of a DFT code), the forces are the negative derivatives of the energy. E.g. if the data comes from density functional theory with a finite electronic temperature (as is almost always used to help convergence, even in the case of insulators, when there are defects), it is the free energy (i.e. including the electronic entropy) that is consistent with the Hellman-Feynman forces, rather than the potential energy. In ASE, this is obtained by specifying the force_consistent=true option to Atoms.get_potential_energy().

Fits are tremendously improved by including forces. For 3N atoms, there is only one energy per structure, but 3N force components (obtained at almost the same cost as the single energy), so therefore contributing a lot more data. Virial stresses are again really important, especially for obtaining elastic constants for periodic solids. This is because the forces are always zero for an isotropic change away from equilibrium, and give no information on the curvature of the potential energy surface.

The type of data included can vary from configuration to configuration, there is no restriction on having to include the same type of data for all configurations.

Data field names

Conceptually, the data the potential is fitted to are the energy and its derivatives (forces and stresses), but the names with which these data are referred to in the XYZ file are arbitrary (but have to be consistent in the whole XYZ file). Unless otherwise indicated, gap_fit looks for data with keys energy, force and virial, and this can be controlled using the command line options energy_parameter name, force_parameter_name, and virial_parameter_name, respectively. This facilitates specifying several types of energies and related quantities in the same XYZ, and selected the one you want to fit to on the command line. For example, energies computed with different DFT functionals can be stored in the same XYZ file, or even with a completely different method such as QMC.

The above example has the keys dft_energy for example, so when it is used in fitting, the appropriate key names need to be specified.

Isolated atom

Most of the time, it makes sense to fit an interatomic potential to the binding energy, i.e. the total energy minus the energy of isolated atoms. There are several mechanisms in gap_fit to aid this. The default behaviour is to look for a configuration among the input structures in the input XYZ file that contains only one atom, and has an energy specified for it. This energy will then be taken and used to calculate the binding energies of all configurations before the fit is made. We give it the name e0, and it is also stored in the XML file that specifies the potential. When predictions are made, the e0 value is added back to the prediction of the binding energy. If the input XYZ contains multiple atom types, each one needs to appear once by itself among the configurations.

Alternatively, if isolated atoms are not part of the input file, a value for e0 can be specified on the command line, in case of multiple elements as a series of numbers: e0={H:123.456:O:789.10...}. There is also an option to have an e0 value be taken as the average per-atom energy of the input data. This only makes sense if there is only one type of atom.

Note that if you create a dataset where the specified energies are already binding energies, you should still specify 0.0 as the isolated atom energy (either via a configuration, preferred, or on the command line), otherwise the interatomic potential might (depending on the descriptor) not formally be zero for isolated atoms.

Descriptors and Kernels

blah blah

blah blah

Global fit parameters

blah blah

blah blah

Command line example

Here is an annotated fitting example.

gap_fit # input data in extended XYZ format
  gap={                              # start of descriptor and kernel spec
  distance_Nb                       # first descriptor is interatomic distance based
  order=2                           # descriptor is 2-body (i.e. a pair potential)
  cutoff=5.0                        # distance cutoff in the kernel, in Angstrom
  n_sparse=15                       # number of representative points, M in Sec. II
  covariance_type=ard_se            # form of kernel: squared exponential (Gaussian)
  delta=2.0                         # scaling of kernel, per descriptor, here it is per atom pair, in eV
  theta_uniform=2.5                 # length scale in Gaussian kernel in Angstrom
  sparse_method=uniform             # choice of representative points, here a uniform grid up to the cutoff
  compact_clusters=T                # how cutoff is applied, here a spherical manner around each atom
  :                                 # separator between descriptors
  soap                              # second descriptor is a SOAP
  l_max=6 n_max=12                  # number of angular and radial basis functions for SOAP
  atom_sigma=0.5                    # Gaussian smearing width of atom density for SOAP, in Angstrom
  cutoff=5.0                        # distance cutoff in the kernel, in Angstrom
  radial_scaling=-0.5               # exponent of atom density scaling, power of distance
  cutoff_transition_width=1.0       # distance across which kernel is smoothly taken to zero, in Angstrom
  central_weight=1.0                # relative weight of central atom in atom density for SOAP
  n_sparse=8000                     # number of representative points, M in Sec. II
  delta=0.2                         # scaling of kernel, per descriptor, here for SOAP it is per atom, in eV
  covariance_type=dot_product       # form of kernel
  zeta=4                            # power kernel is raised to - together with dot_product gives a polynomial kernel
  sparse_method=cur_points          # choice of representative points, here CUR decomposition of descriptor matrix
  }                                 # end of descriptor and kernel spec
 default_sigma={0.002 0.2 0.2 0.0}  # default regularisation corresponding to energy, force, virial, hessian
 config_type_sigma={                # start of per configuration-group regularisation spec, using groups defined in the input data file
  }                                 # end of per configuration-group regularisation spec
 energy_parameter_name=energy       # name of the key in the input data file corresponding to the total energy
 force_parameter_name=forces        # name of the key in the input data file corresponding to the forces
 virial_parameter_name=virial       # name of the key in the input data file corresponding to the virial stress
 sparse_jitter=1.0e-8               # extra diagonal regulariser
 do_copy_at_file=F                  # copy input data into potential XML file?
 sparse_separate_file=T             # write representative point data into a separate file not in the main potential XML
 gp_file=gap.xml                    # name of output potential XML file
 core_param_file=P_r6_innercut.xml  # name of XML file containing the baseline potential (QUIP format)
 core_ip_args={IP Glue}             # initialisation string to call baseline potential

Command line options

GAP options

sparse_method options are:
  • RANDOM: default, chooses n_sparse random datapoints

  • PIVOT: based on the full covariance matrix finds the n_sparse “pivoting” points

  • CLUSTER: based on the full covariance matrix performs a k-medoid clustering into n_sparse clusters, returning the medoids

  • UNIFORM: makes a histogram of the data based on n_sparse and returns a data point from each bin

  • KMEANS: k-means clustering based on the data points

  • COVARIANCE: greedy data point selection based on the sparse covariance matrix, to minimise the GP variance of all datapoints

  • UNIQ: selects unique datapoints from the dataset

  • FUZZY: fuzzy k-means clustering

  • FILE: reads sparse points from a file

  • INDEX_FILE: reads indices of sparse points from a file

  • CUR_COVARIANCE: CUR, based on the full covariance matrix

  • CUR_POINTS: CUR, based on the datapoints