Many applications in science require that computational models and data be combined. In a Bayesian framework, this is usually done by defining likelihoods based on the mismatch of model outputs and data. However, matching model outputs and data in this way can be unnecessary or impossible. For example, using large amounts of steady state data is unnecessary because these data are redundant. It is numerically difficult to assimilate data in chaotic systems. It is often impossible to assimilate data of a complex system into a low-dimensional model. As a specific example, consider a low-dimensional stochastic model for the dipole of the Earth's magnetic field, while other field components are ignored in the model. The above issues can be addressed by selecting features of the data, and defining likelihoods based on the features, rather than by the usual mismatch of model output and data. Our goal is to contribute to a fundamental understanding of such a feature-based approach that allows us to assimilate selected aspects of data into models. We also explain how the feature-based approach can be interpreted as a method for reducing an effective dimension and derive new noise models, based on perturbed observations, that lead to computationally efficient solutions. Numerical implementations of our ideas are illustrated in four examples.

The author's copyright for this publication is
transferred to the United States Government. The United States Government
retains and the publisher, by accepting the article for publication,
acknowledges that the United States Government retains a nonexclusive,
paid-up, irrevocable, worldwide license to publish or reproduce the published
form of this paper, or allow others to do so, for United States Government
purposes. The U.S. Department of Energy will provide public access to these
results of federally sponsored research in accordance with the DOE Public
Access Plan (

The basic idea of data assimilation is to update a computational model with
information from sparse and noisy data so that the updated model can be used
for predictions. Data assimilation is at the core of computational
geophysics, e.g., in numerical weather prediction

The posterior distribution is proportional to the product of a prior
distribution and a likelihood. The likelihood connects the model and its
parameters to the data and is often based on the mismatch of model output and
data. A typical example is the squared two-norm of the difference of model
output and data. However, estimating model parameters based on such a direct
mismatch of model outputs and data may not be required or feasible. It is not
required, for example, if the data are intrinsically low-dimensional, or if
the data are redundant (we discuss a specific example in
Sect.

The above issues can be addressed by adapting ideas from machine learning to
data assimilation. Machine learning algorithms expand the data into a
suitable basis of “feature vectors”

As a specific example, consider a viscously damped harmonic oscillator, defined by damping and stiffness coefficients (we assume we know its mass). An experiment may be to pull on the mass and then to release it and to measure the displacement of the mass from equilibrium as a function of time. These data can be compressed into features in various ways. For example, a feature could be the statement that “the system exhibits oscillations”. Based on this feature, one can infer that the damping coefficient is less than 1. Other features may be the decay rate or observed oscillation frequency. One can compute the damping and stiffness coefficients using classical formulas, if these quantities were known exactly. The idea of feature-based data assimilation is to make such inferences in view of uncertainties associated with the features.

Another example is Lagrangian data assimilation for fluid flow, where the
data are trajectories of tracers and where a natural candidate for a feature
is a coherent structure

Our goal is to contribute to a fundamental understanding of the feature-based
approach to data assimilation and to extend the numerical framework for
solving feature-based data assimilation problems. We also discuss the
conditions under which the feature-based approach is appropriate. In this
context, we distinguish two problem classes. First, the compression of the
data into a feature may lead to no or little loss of information, in which
case the feature-based problem and the “original” problem, as well as their
solutions, are similar. Specific examples are intrinsically low-dimensional
data or redundant (steady state) data. Second, the features extracted from
the data may be designed to deliberately neglect information in the data.
This second case is more interesting because we can assimilate selected
aspects of data into low-dimensional models for complex systems and we can
formulate feature-based problems that lead to useful parameter estimates for
chaotic systems, for which a direct approach is computationally expensive or
infeasible. We give interpretations of these ideas in terms of effective
dimensions of data assimilation problems (

Nonetheless, the feature-based likelihood can be cumbersome to evaluate. The
reason is that an evaluation of a feature-based likelihood may involve
repeated solution of stochastic equations, followed by a compression of a
large amount of simulation data into features and it is unclear how to assess
the error statistics of the features. In fact, the inaccessible likelihood
prevents application of the typical numerical methods for data assimilation,
e.g., Monte Carlo sampling or optimization. We suggest overcoming this
difficulty by adapting ideas from stochastic ensemble Kalman filters

Details of the numerical solution of feature-based data assimilation problems
are discussed in the context of four examples, two of which involve “real”
data. Each example represents its own challenges and we suggest appropriate
numerical techniques, including Markov Chain Monte Carlo (MCMC;

Ideas related to ours were recently discussed by

We briefly review the typical data assimilation problem formulation and several methods for its numerical solution. The descriptions of the numerical techniques may not be sufficient to fully comprehend the advantages or disadvantages of each method, but these are explained in the references we cite.

Suppose you have a mathematical or computational model

In addition to Eq. (

Data assimilation problems of this kind appear in science and engineering,
e.g., in numerical weather prediction, oceanography and geomagnetism

Computational methods for data assimilation can be divided into three groups.
The first group is based on the Kalman filter

In variational data assimilation one finds the parameter set

In MCMC, a Markov chain is generated by drawing a
new sample

In direct sampling (sometimes called importance sampling) one generates
independent samples using a proposal density

The basic idea of feature-based data assimilation is to replace the data
assimilation problem defined by a prior

Evaluating the feature-based posterior distribution is difficult because
evaluating the feature-based likelihood is cumbersome. Even under simplifying
assumptions of additive Gaussian noise in Eq. (

Difficulties with evaluating the feature-based likelihood arise because we
assume that Eq. (

We explore this idea and consider an additive Gaussian noise model for the
feature. This amounts to replacing Eq. (

Our simplified approach requires that one defines the distribution of the
errors

Note that the rank of the covariance

One may also question why

Feature-based data assimilation requires that one defines and selects
relevant features. In principle, much of the machine learning technology can
be applied to extract generic features from data. For example, one can define

The choice of the feature suggests the numerical methods for the solution of
the feature-based problem. One issue here is that, even with our simplifying
assumption of additive (Gaussian) noise in the feature, evaluating a
feature-based likelihood can be noisy. This happens in particular when the
feature is defined in terms of averages over solutions of stochastic or
chaotic equations. Due to limited computational budgets, such averages are
computed using a small sample size. Thus, sampling error is large and
evaluation of a feature-based likelihood is noisy, i.e., evaluations of the
feature-based likelihood, even for the same set of parameters

A natural question is the following:

It may be possible that data can be compressed into features without
significant loss of information, for example, if observations are collected
while a system is in steady state. Steady state data are redundant, make
negligible contributions to the likelihood and posterior distributions and,
therefore, can be ignored. This suggests that features can be based on
truncated data and that the resulting parameter estimates and posterior
distributions are almost identical to the estimates and posterior
distributions based on

In some applications a posterior distribution defined by all of the data may
not be practical or computable. An example is estimation of initial
conditions and other parameters based on (noisy) observations of a chaotic
system over long timescales. In a “direct” approach one tries to estimate
initial conditions that lead to trajectories that are near the observations
at all times. Due to the sensitivity to initial conditions a point-wise match
of model output and data is numerically difficult to achieve. In a
feature-based approach one does not insist on a point-by-point match of
model output and data, i.e., the feature-based approach

The feature-based approach is essential for problems for which the numerical
model and the data are characterized by different scales (spatial, temporal
or both). Features can be designed to filter out fine scales that may be
present in the data, but which are not represented by the numerical model.
This is particularly important when a low-dimensional model is used to
represent certain aspects of a complex system. Specific examples of
low-dimensional models for complex processes can be found in the modeling of
clouds or the geomagnetic dipole

Cases (i) and (ii) can be understood more formally using the concept of an
“effective dimension”. The basic idea is that a high-dimensional data
assimilation problem is more difficult than a low-dimensional problem.
However, it is not only the number of parameters that defines dimension in
this context, but rather a combination of the number of parameters, the
assumed distributions of errors and prior probability, as well as the number
of data points (

Case (i) above is characterized by features that do not change
(significantly) the posterior distribution and, hence, the features do not
alter the effective dimension of the problem. It follows that the computed
solutions and the required computational cost of the feature-based or
“direct” approach are comparable. In case (ii), however, using the feature
rather than the data themselves indeed changes the problem and its solution,
i.e., the feature-based posterior

We illustrate the above ideas with four numerical examples. In the examples,
we also discuss appropriate numerical techniques for solving feature-based
data assimilation problems. The first example illustrates that contributions
from redundant data are negligible. The second example uses “real data” and
a predator–prey model to illustrate the use of a PCA feature. Examples 1
and 2 are simple enough to solve by “classical” data assimilation, matching
model outputs and data directly and serve as an illustration of problems of
type (i) in Sect.

We wish to remind the reader that the choices of prior distributions are critical for the Bayesian approach to parameter estimation. However, the focus of this paper is on new formulations of the likelihood using features. In the examples below we make reasonable choices for the priors, but other choices of priors will lead to different posterior distributions and, hence, different parameter estimates. In examples 1, 2 and 4, we do not have any information about the values of the parameters and we choose uniform priors over large intervals. In example 3, we use a sequential data assimilation approach and build priors informed by previous assimilations, as is typical in sequential data assimilation.

We illustrate that a data assimilation problem with fewer data points can be
as useful as one with significantly more, but redundant, data points. We
consider a mass–spring–damper system

We investigate this idea by solving data assimilation problems with
experiment durations between

Each experiment is in itself a random event because the measurement noise is
random. The KL divergence between the various posterior distributions is,
thus, also random and we address this issue by performing 1000 independent
experiments and then average the KL divergences. Our results are shown in
Fig.

We now consider a feature that compresses the data into two numbers. The
first component of our feature is the average of the last 50 data points.
This average is directly related to the natural frequency since

The covariance matrix

where

We solve this feature-based problem for an experiment of duration

Finally, we show triangle plots of the posterior distribution

We consider the Lotka–Volterra (LV) equations

We use the lynx and hare data of the Hudson's Bay Company

We define a feature

Triangle plot of histograms of all one- and two-dimensional marginals of the feature-based posterior distribution.

We use the MATLAB implementation of the affine invariant ensemble sampler to
solve the feature-based data assimilation problem; see

We show a triangle plot of the feature-based posterior distribution,
consisting of histograms of all one- and two-dimensional marginals, in
Fig.

We observe that there is strong correlation between the parameters

We plot the trajectories of the LV equations corresponding to 100 samples of
the feature-based posterior distribution in Fig.

We note that the trajectories pass near the 22 original data points (shown as
orange dots in Fig.

Raw data (orange dots), trajectories corresponding to the
feature-based posterior mode (red) and 100 trajectories of hares (turquoise)
in panel

We consider the Earth's magnetic dipole field over timescales of tens of
millions of years. On such timescales, the geomagnetic dipole exhibits
reversals, i.e., the North Pole becomes the South Pole and vice versa. The
occurrence of dipole reversals is well documented over the past 150 Myr by
the “geomagnetic polarity timescale”

In both models, the drift,

The geomagnetic polarity timescale shows that the Earth's MCD varies over
the past 150 Myr. For example, there were 125 reversals between today and
30.9 Myr ago (

The feature we extract from the geomagnetic polarity timescale is the MCD,
which we compute by using a sliding window average over 10 Myr. We compute
the MCD every 1 Myr, so that the “feature data”,

The geomagnetic polarity timescale and the MCD feature are shown in
Fig.

We note that the averaging window of 10 Myr is too short during long chrons,
especially during the “cretaceous superchron” that lasted almost 40 Myr
(from about 120 to 80 Myr ago). We set the MCD to be

To sequentially assimilate the feature data, we assume that the parameter

For the modified B13 model we add one more step. The numerical solutions of
this model tend to exhibit short chrons (a few thousand years) during a
“proper reversal,” i.e., when the state transitions from one polarity (

We investigate how to choose the random variable

For each value of

We base our feature-error model

MCD as a function of

A feature

Our results are illustrated in Fig.

We further illustrate the results of the feature-based data assimilation in
Fig.

The advantage of the feature-based approach in this problem is that it allows
us to calibrate the modified B13 and P09 models to yield a time-varying MCD
in good agreement with the data (geomagnetic polarity timescale), where
“good agreement” is to be interpreted in the feature-based sense. Our
approach may be particularly useful for studying how flow structure at the
core affects the occurrence of superchrons. A thorough investigation of what
our results imply about the physics of geomagnetic dipole reversals will be
the subject of future work. In particular, we note that other choices for the
standard deviation

We consider the Kuramoto–Sivashinsky equation

For computations we discretize the KS equation by the spectral method and
exponential time differencing with

The data are 100 snapshots of the solution of the KS equation obtained as
follows. For a given

The feature we extract from the data is as follows. We interpolate the
snapshots onto a coarser

We choose this feature because the parameter

It is important to note that the feature we construct does not depend on the
initial conditions. This is the main advantage of the feature-based approach.
Using the feature, rather than the trajectories, enables estimation of the
parameter

Illustration of the computed feature. Eigenvalues of covariance
matrices of snapshots (dots) and log-linear fit (solid lines). Blue dots and
red line correspond to a run with

The feature-based likelihood is defined by the equation

Draw random initial conditions and obtain 100 snapshots of the solution of
the KS equation with parameter

Interpolate snapshots onto

Compute largest eigenvalues of the sample covariance matrix and compute a log-linear fit.

Finally, we need to choose a covariance matrix

We address these issues by using a variational approach and compute an a
posteriori estimate of

We need to decide on a numerical method for solving the optimization problem.
Since the function

GP model of the function

A GP model for

To improve our GP model of

The updated GP model is illustrated in the right panel of
Fig.

We have discussed a feature-based approach to data assimilation. The basic
idea is to compress the data into features and to compute parameter estimates
on posterior distributions defined in terms of the features, rather than the
raw data. The feature-based approach has the advantage that one can calibrate
numerical models to selected aspects of the data which can help to bridge
gaps between low-dimensional models for complex processes. The feature-based
approach can also break computational barriers in data assimilation with
chaotic systems. Our main conclusions are as follows.

Constructing noise models directly for the features leads to straightforward numerical implementation of the feature-based approach and enables the use of numerical methods familiar from data assimilation.

The feature-based approach can reduce computational requirements by reducing an effective dimension. This reduction in complexity comes at the expense of a relaxation of how much that data constrain the parameters.

Code for the numerical examples will be made available on
github:

The authors declare that they have no conflict of interest.

The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government DOE/NV/25946-3357.

We thank Alexandre J. Chorin and John B. Bell from Lawrence Berkeley National Laboratory for interesting discussion and encouragement. We thank Bruce Buffett of the University of California at Berkeley for inspiration, encouragement and for providing his code for the B13 model. We thank Joceline Lega of the University of Arizona for providing code for numerical solution of the Kuramoto–Sivashinsky equation.

We thank three anonymous reviewers for insightful and careful comments which improved the paper.

Matthias Morzfeld, Spencer Lunderman and Rafael Orozco gratefully acknowledge support by the National Science Foundation under grant DMS-1619630.

Matthias Morzfeld acknowledges support by the Office of Naval Research (grant number N00173-17-2-C003) and by the Alfred P. Sloan Foundation.

Matthias Morzfeld and Jesse Adams were supported, in part, by National Security Technologies, LLC, under contract no. DE-AC52-06NA25946 with the U.S. Department of Energy, National Nuclear Security Administration, Office of Defense Programs, and supported by the Site-Directed Research and Development Program. Edited by: Amit Apte Reviewed by: three anonymous referees