The following sections elaborate on the design principles on the software side. The goal is to make it clear how different modules in ESPEI fit together and where to find specific functionality to override or improve.
ESPEI provides tools to
- Parameterize CALPHAD models by optimizing the compromise between model accuracy and complexity. We typically call this parameter generation or model selection.
- Fit parameterized CALPHAD models to multi-phase or other custom data with uncertainty quantification via Markov chain Monte Carlo
ESPEI has two levels of API that users should expect to interact with:
- Input from YAML files on the command line (via
espei --input <input_file>or by Python via the
- Work directly with the Python functions for parameter selection
YAML files are the recommended way to use ESPEI and should have a way to express most if not all of the options that
the Python functions support. The schema for YAML files is located in the root of the ESPEI directory as
and is validated in the
espei_script.py module by the Cerberus package.
espei_script.pyis the main entry point for the YAML input API.
optimzersis a package that defines an
OptimizerBaseclass for writing optimizers.
error_functionsis a package with modules for each type of likelihood function.
paramselect.pyis where parameter generation happens.
mcmc.pycreates the likelihood function and runs MCMC. Deprecated. In the future, users should use
parameter_selectionis a package with core pieces of parameter selection.
utils.pyare utilities with reuse potential across several parts of ESPEI.
plot.pyholds plotting functions.
datasets.pymanages validating and loading datasets into a TinyDB in memory database.
core_utils.pyare legacy utility functions that should be refactored out to be closer to individual modules and packages where they are used.
Parameter selection goes through the
generate_parameters function in the
The goal of parameter selection is go through each phase (one at a time) and fit a CALPHAD model to the data.
For each phase, the endmembers are fit first, followed by binary and ternary interactions. For each individual endmember or interaction to fit, a series of candidate models are generated that have increasing complexity in both temperature and interaction order (an L0 excess parameter, L0 and L1, …).
Each model is then fit by
espei.parameter_selection.selection.fit_model, which currently uses a simple
pseudo-inverse linear model from scikit-learn. Then the tradeoff between the goodness of fit and the model complexity
is scored by the AICc in
The optimal scoring model is accepted as the model with the fit model parameters set as degrees of freedom for the MCMC step.
The main principle is that ESPEI transforms the data and candidate models to vectors and matricies that fit a typical machine learning type problem of \(Ax = b\). Extending ESPEI to use different or custom models in the current scheme basically comes down to formulating candidate models in terms of this type of problem. The main ways to improve on the fitting or scoring methods used in parameter selection is to override the fit and score functions.
Currently the capabilities for providing custom models or contributions (e.g. magnetic data) in the form of generic pycalphad Model objects are limited. This is also true for custom types of data that one would use in fitting a custom model.
MCMC optimization and uncertainty quantification¶
Most of the Markov chain Monte Carlo optimization and uncertainty quantification happen in the
espei.optimizers.opt_mcmc.py module through the
EmceeOptimizer is a subclass of
OptimizerBase, which defines an interface for performing opitmizations of parameters. It defines several methods:
fit takes a list of symbol names and datasets to fit to. It calls an
_fit method that returns an
OptNode representing the parameters that result from the fit to the datasets.
fit evaluates the parameters by calling the objective function on some parameters (an array of values) and a context in the
predict method, which is overridden by
There is also an interface for storing a history of successive fits to different parameter sets, using the
commit method, which will store the history of the calls to
fit in a graph of fitting steps.
The idea is that users can generate a graph of fitting results and go back to specific points on the graph and test fitting different sets of parameters or different datasets, creating a unique history of committed parameter sets and optimization paths, similar to a history in version control software like git.
The main reason ESPEI’s parameter selection and MCMC routines are split up is that custom Models or existing TDB files can be provided and fit. In other words, if you are using a model that doesn’t need parameter selection or is for a property that is not Gibbs energy, MCMC can fit it with uncertainty quantification.
The general process is
- Take a database with degrees of freedom as database symbols named
####is a number, e.g.
0001. The symbols correspond to
FUNCTIONin the TDB files.
- Initialize those degrees of freedom to a starting distribution for ensemble MCMC.
The starting distribution is controlled by the
EmceeOptimizer.initialize_new_chainsfunction, which currently supports initializing the parameters to a Gaussian ball.
- Use the emcee package to run ensemble MCMC
ESPEI’s MCMC is quite flexible for customization. To fit a custom model, it just needs to be read by pycalphad and
have correctly named degrees of freedom (
To fit an existing or custom model to new types of data, just write a function that takes in datasets and the parameters
that are required to calculate the values (e.g. pycalphad Database, components, phases, …) and returns the error.
Then override the
EmceeOptimizer.predict function to include your custom error contribution.
There are examples of these functions
espei.error_functions that ESPEI uses by default.
Modifications to how parameters are initialized can be made by subclassing
Many other modifications can be made by subclassing