Introduction
Understanding and forecasting the evolution of a given system is a crucial
topic in an ever-increasing number of application domains. To achieve this
goal, one can rely on multiple sources of information, namely observations of
the system, numerical model describing its behavior and additional
a priori knowledge such as statistical information or previous
forecasts. To combine these heterogeneous sources of observation it is common
practice to use so-called data assimilation methods (e.g., see
reference books ). They have multiple
aims: finding the initial and/or boundary conditions, parameter
estimation, reanalysis, and so on. They are extensively used in numerical
weather forecasting, for instance (e.g., see reviews in the books
).
The estimation of the different elements to be sought, the control vector, is
performed using data assimilation through the comparison between the
observations and their model counterparts. The control vector should be
adjusted such that its model outputs would fit the observations, while taking
into account that these observations are imperfect and corrupted by noise and
errors.
Data assimilation methods are divided into three distinct classes. First,
there is statistical filtering based on Kalman filters. Then, there are variational
data assimilation methods based on optimal control theory. More recently
hybrids of both approaches have been developed
. In this paper we focus on
variational data assimilation. It consists in minimizing a cost function
written as the distance between the observations and their model
counterparts. A Tikhonov regularization term is also added to the cost
function as a distance between the control vector and a background state
carrying a priori information.
Thus, the cost function contains the misfit between the data (a priori and observations) and their control and model counterparts.
Minimizing the cost function aims at reaching a compromise in which these
errors are as small as possible. The errors can be decomposed into amplitude
and position errors. Position errors mean that the structural elements are
present in the data, but misplaced. Some methods have been proposed in order
to deal with position errors . These involve
a preprocessing step which consists in displacing the different data so they
fit better with each other. Then the data assimilation is performed
accounting for those displaced data.
A distance has to be chosen in order to compare the different data and
measure the misfits. Usually, a Euclidean distance is used, often weighted to
take into account the statistical errors. But Euclidean distances have
trouble capturing position errors. This is illustrated in Fig. , which shows two curves ρ0 and ρ1. The
second curve ρ1 can be seen as the first one ρ0 with position
error. The minimizer of the cost function ‖ρ-ρ0‖2+‖ρ-ρ1‖2 is given by ρ*=12(ρ0+ρ1),
plotted with violet stars of Fig. . It is the average
of curves ρ0 and ρ1 with respect to the L2
distance. As we can see in Fig. , it does not correct
for position error, but instead creates two smaller amplitude curves. We
investigate in this article the idea of using instead a distance stemming
from optimal transport theory – the Wasserstein distance, which can take into
account position errors. In Fig. we plot (green dots)
the average of ρ0 and ρ1 with respect to the Wasserstein
distance. Contrary to the L2 average, the Wasserstein average is
what we want it to be: same shape, same amplitude, located in-between. It
conserves the shape of the data. This is what we want to achieve when dealing
with position errors.
Wasserstein (W) and Euclidean (L2) averages
of two curves ρ0 and ρ1.
Optimal transport theory has been pioneered by . He searched for the optimal way of displacing sand piles
onto holes of the same volume, minimizing the total cost of displacement.
This can be seen as a transportation problem between two probability
measures. A modern presentation can be found in and will be
discussed in Sect. .
Optimal transport has a wide spectrum of applications: from pure mathematical
analysis on Riemannian spaces to applied economics; from functional
inequalities to the semi-geostrophic
equations ; and in astrophysics ,
medicine , crowd motion or urban
planning . From optimal transport theory several distances
can be derived, with the most widely known being the Wasserstein distance (denoted
W) which is sensitive to misplaced features and is the primary
focus of this paper. This distance is also widely used in computer vision,
for example in classification of images ,
interpolation or movie reconstruction .
More recently, used the Wasserstein distance to compare
observation and model simulations in an air pollution context, which is a
first step toward data assimilation.
Actual use of optimal transport in a variational data assimilation has been
proposed by to tackle model error. The authors use the
Wasserstein distance instead of the classical L2 norm for
model error control in the cost function, and they offer promising results.
Our contribution is in essence similar to them in the fact that the
Wasserstein distance is proposed in place of the L2 distance.
Looking more closely, we investigate a different question, namely the idea of
using the Wasserstein distance to measure the observation misfit. Also,
we underline and investigate the impact of the choice of the scalar products,
gradient formulations and minimization algorithm choices on the
assimilation performance, which is not discussed in . These
particularly subtle mathematical considerations are indeed crucial for the
algorithm convergence, as will be shown in this paper, and are our main
contribution.
The goal of the paper is to perform variational data assimilation with a cost
function written with the Wasserstein distance. It may be extended to other
type of data assimilation methods such as filtering methods, which largely
exceeds the scope of this paper.
The present paper is organized as follows: first, in Sect. ,
variational data assimilation as well as the Wasserstein distance are defined,
and the ingredients required in the following are presented. The core of our
contribution lies in Sect. : we first present the Wasserstein
cost function and then propose two choices for its gradients, as well as two
optimization strategies for the minimization. In Sect. we
present numerical illustrations, discuss the choices for the gradients and
compare the optimization methods. Also, some difficulties related to the use
of optimal transport will be pointed out and solutions will be proposed.
Materials and methodology
This section deals with the presentation of the variational data assimilation
concepts and method on the one hand and optimal transport and Wasserstein
distance concepts, principles and main theorems on the other hand. Section will combine both worlds and will
constitute the core of our original contribution.
Variational data assimilation
This paper focuses on variational data assimilation in the framework of
initial state estimation. Let us assume that a system state is described by a
variable x,
denoted x0 at initial time. We are also
given observations yobs of the system, which might be
indirect, incomplete and approximate. The initial state and the observations are
linked by operator G, mapping the system initial state x0 to the observation space, so that G(x0) and
yobs belong to the same space. Usually G is
defined using two other operators, namely the model M which gives
the model state as a function of the initial state and the observation
operator H which maps the system state to the observation space,
such that G=H∘M.
Data assimilation aims to find a good estimate of x0 using the
observations yobs and the knowledge of the operator
G. Variational data assimilation methods do so by finding the
minimizer x0 of the misfit function J (the cost
function) between the observations yobs and their computed
counterparts G(x0),
J(x0)=dRG(x0),yobs2,
with dR some distance to be defined. Generally, this problem is ill-posed.
For the minimizer of J to be unique, a background term is added
and acts like a Tikhonov regularization. This background term is generally
expressed as the distance with a background term xb, which
contains a priori information. The actual cost function then reads
Jx0)=dR(G(x0),yobs2+dBx0,xb2,
with dB another distance to be specified. The control of x0 is done by the minimization of J.
Such minimization is generally carried out numerically using gradient descent methods. Section will give
more details about the minimization process.
The distances to the observations dR and to the background term dB have
to be chosen in this formulation. Usually, Euclidean distances (L2 distances, potentially weighted) are chosen, giving the following
Euclidean cost function
J(x0)=‖G(x0)-yobs‖22+‖x0-xb‖22,
with ‖⋅‖2 the L2 norm defined by
‖a‖22:=∫|a(x)|2dx.
Euclidean distances, such as the L2 distance, are local metrics.
In the following we will investigate the use of a non-local metric, the
Wasserstein distance W, in place of dR and dB in
Eq. (). Such a cost function will be presented in Sect. . The Wasserstein distance is presented and defined in the
following subsection.
Optimal transport and Wasserstein distance
The essentials of optimal transport theory and Wasserstein distance required
for data assimilation are presented.
We define, in this order, the space of mass functions where the Wasserstein
distance is defined, then the Wasserstein distance and finally the
Wasserstein scalar product, a key ingredient for variational assimilation.
Mass functions
We consider the case where the observations can be represented as positive
fields that we will call “mass functions”. A mass function is a
nonnegative function of space. For example, a grey-scaled image is a mass
function; it can be seen as a function of space to the interval [0,1] where
0 encodes black and 1 encodes white.
Definition
Let Ω be a closed, convex, bounded set of Rd and let the set of mass functions P(Ω) be the set of
nonnegative functions of total mass 1:
P(Ω):=ρ≥0:∫Ωρ(x)dx=1.
Let us remark here that, in the mathematical framework of optimal transport,
mass functions are continuous and they are called “probability densities”.
In the data assimilation framework the concept of probability densities is
mostly used to represent errors. Here, the positive functions we consider
actually serve as observations or state vectors, so we chose to
call them mass functions to avoid any possible confusion with state or
observation error probability distributions.
Wasserstein distance
Given the set of all transportations between two mass functions, the optimal transport is the one minimizing the kinetic energy. A
transportation between two mass functions ρ0 and ρ1 is given by a
time path ρ(t,x) such that ρ(t=0)=ρ0 and ρ(t=1)=ρ1 and given by a velocity field v(t,x) such that the continuity
equation holds,
∂ρ∂t+div(ρv)=0.
Such a path ρ(t) can be seen as interpolating ρ0 and ρ1. For
ρ(t) to stay in P(Ω), a sufficient condition is that the
velocity field v(t,x) should be tangent to the domain boundary,
meaning that ρ(t,x)v(t,x)⋅n(x)=0 for almost all
(t,x)∈[0,1]×∂Ω. With this condition, the support of
ρ(t) remains in Ω.
Let us be clear here that the time t is fictitious and has no relationship
whatsoever with the physical time of data assimilation. It is purely used to
define the Wasserstein distance and some mathematically related objects.
The Wasserstein distance W is hence the minimum in terms of
kinetic energy among all the transportations between ρ0 and ρ1,
W(ρ0,ρ1)=min(ρ,v)∈C(ρ0,ρ1)∬[0,1]×Ωρ(t,x)|v(t,x)|2dtdx,
with C(ρ0,ρ1) representing the set of continuous transportations
between ρ0 and ρ1 described by a velocity field v
tangent to the boundary of the domain,
C(ρ0,ρ1):=(ρ,v)s.t.∂tρ+div(ρv)=0,ρ(t=0)=ρ0,ρ(t=1)=ρ1,ρv⋅n=0on∂Ω.
This definition of the Wasserstein distance is the Benamou–Brenier
formulation . There exist other definitions, based on the
transport map or the transference plans, but this is slightly out of the scope of
this article. See the introduction of for more details.
A remarkable property is that the optimal velocity field v is of
the form
v(t,x)=∇Φ(t,x) with Φ following the Hamilton–Jacobi equation
∂tΦ+|∇Φ|22=0.
The equation of the optimal ρ is the continuity equation using this
velocity field. Moreover, the function Ψ defined by
Ψ(x):=-Φ(t=0,x)
is said to be the Kantorovich potential of the transport between ρ0 and ρ1. It is a useful feature in the derivation
of the Wasserstein cost function presented in Sect. .
A remarkable property of the Kantorovich potential allows the computation of the
Wasserstein distance, which is the Benamou–Brenier formula (see
or Theorem 8.1), given by
W(ρ0,ρ1)2=∫Ωρ0(x)|∇Ψ(x)|2dx.
Example
The classical example for optimal transport is the transport of
Gaussian mass functions. For Ω=Rd, let us consider two
Gaussian mass functions: ρi of mean μi and variance σi2
for i=0 and i=1. Then the optimal transport ρ(t) between ρ0
and ρ1 is a transportation–dilation function of ρ0 to ρ1.
More precisely, ρ(t) is a Gaussian mass function whose mean is μ0+t(μ1-μ0) and variance is (σ0+t(σ1-σ0))2.
The corresponding computed Kantorovich potential is (up to a constant)
Ψ(x)=σ1σ0-1|x|22+μ1-σ1σ0μ0⋅x.
Finally, a few words should be said about the numerical computation of the
Wasserstein distance. In one dimension, the optimal transport ρ(t,x) is
easy to compute as the Kantorovich potential has an exact formulation: the
Kantorovich potential of the transport between two mass functions ρ0
and ρ1 is the only function Ψ such that
F1(x-∇Ψ(x))=F0(x),∀x,
with Fi being the cumulative distribution function of ρi. Numerically we
fix x and solve iteratively Eq. () using a binary search
to find ∇Ψ. Then, we obtain Ψ thanks to numerical
integration. Finally, Eq. () gives the
Wasserstein distance.
For two- or three-dimensional problems, there exists no general formula for
the Wasserstein distance and more complex algorithms have to be used, such as
the (iterative) primal-dual one or the
semi-discrete one . In the former, an approximation of the
Kantorovich potential is directly read in the so-called dual variable.
Wasserstein inner product
The scalar product between two functions is required for data assimilation
and optimization: as we will recall later, the scalar product choice is used
to define the gradient value. This paper will consider the classical
L2 scalar product as well as the one associated with the
Wasserstein distance. A scalar product defines the angle and norm of vectors
tangent to P(Ω) at a point ρ0. First, a tangent vector
in ρ0 is the derivative of a curve ρ(t) passing through ρ0.
As a curve ρ(t) can be described by a continuity equation, the space of
tangent vectors, the tangent space, is formally defined by
cf.
Tρ0P=η∈L2(Ω),s.t.η=-div(ρ0∇Φ)withΦs.t.ρ0∂Φ∂n=0on∂Ω.
Let us first recall that the Euclidean, or L2, scalar product
〈⋅,⋅〉2 is defined on Tρ0P by
∀η,η′∈Tρ0P(Ω),〈η,η′〉2:=∫Ωη(x)η′(x)dx.
The Wasserstein inner product 〈⋅,⋅〉W is defined for
η=-div(ρ0∇Φ),η′=-div(ρ0∇Φ′)∈Tρ0P by
〈η,η′〉W:=∫Ωρ0∇Φ⋅∇Φ′dx.
One has to note that the inner product is dependent on ρ0∈P(Ω). Finally, the norm associated with a tangent vector
η=-div(ρ0∇Φ)∈Tρ0P is
‖η‖W2=∫Ωρ0|∇Φ|2dx
and hence the kinetic energy of the small displacement η. This point makes
the link between this inner product and the Wasserstein distance.
Optimal transport-based data assimilation
This section is our main contribution. First, we will consider the
Wasserstein distance to compute the observation term of the cost
function; second, we will discuss the choices of the scalar product and the
gradient descent method and their impact on the assimilation algorithm
efficiency.
Wasserstein cost function
In the framework of Sect. we will define the
data assimilation cost function using the Wasserstein distance. For this cost
function to be well defined we assume that the control variables belong to P(Ω) and that the observation
variables belong to another space P(Ωo) with Ωo a
closed, convex, bounded set of Rd′. Let us recall that this
means that they are all nonnegative functions with integral equal to 1.
Having elements with integral 1 (or constant integral) may seem restrictive.
Removing it is possible by using a modified version of the Wasserstein
distance, presented for example in or . For
simplicity we do not consider this possible generalization and all data have
the same integral. The cost function Eq. () is rewritten using the
Wasserstein distance defined in Sect. ,
JW(x0)=12∑i=1NobsWGi(x0),yiobs2+ωb2Wx0,x0b2,
with Gi:P(Ω)→P(Ωo) the
observation operator computing the yiobs counterpart from
x0 and ωb a scalar weight associated with the background
term.
The variables x0 and yiobs may be vectors whose components are functions belonging to P(Ω) and
P(Ωo) respectively. The Wasserstein distance between two such vectors is the sum of the distances between their
components. The remainder of the article is easily adaptable to this case, but for simplicity we set
x0=ρ0∈P(Ω) and yiobs=ρiobs∈P(Ω).
The Wasserstein cost function Eq. () then becomes
JW(ρ0)=12∑i=1NobsWGi(ρ0),ρiobs2+ωb2Wρ0,ρ0b2.
As for the classical L2 cost function,
JW is convex with respect to the Wasserstein
distance in the linear case and has a unique minimizer. In the nonlinear
case, the uniqueness of the minimizer relies on the regularization term
ωb2W(ρ0,ρ0b)2.
To find the minimum of JW, a gradient descent method is applied. It is presented in
Sect. . As this type of algorithm requires the gradient of the cost function, computation of the
gradient of JW is the focus of next section.
Gradient of JW
If JW is differentiable, its gradient is given by
∀η∈Tρ0P,limϵ→0JW(ρ0+ϵη)-JW(ρ0)ϵ=〈η,g〉,
where 〈⋅,⋅〉 represents the scalar product. The scalar
product is not unique, so as a consequence neither is the gradient. In this
work we decided to study and compare two choices for the scalar product – the
natural one W and the usual one L2. W is
clearly the ideal candidate for a good scalar product. However, we also
decided to study the L2 scalar product because it is the usual
choice in optimization. Numerical comparison is done in Sect. .
The associated gradients are respectively denoted as
gradWJW(ρ0) and
grad2JW(ρ0) and are the only elements of
the tangent space Tρ0P of ρ0∈P(Ω)
such that
∀η∈Tρ0P,limϵ→0JW(ρ0+ϵη)-JW(ρ0)ϵ=〈gradWJW(ρ0),η〉W=〈grad2JW(ρ0),η〉2.
Here in the notations, the term “grad” is used for the gradient of a function while the spatial gradient is denoted by the nabla sign
∇. The gradients of JW are elements of Tρ0P and hence functions of space.
The following theorem allows the computation of both gradients of
JW.
Theorem
For i∈{1,…,Nobs}, let Ψi be the Kantorovich
potential (see Eq. ) of the transport
between Gi(ρ0) and ρiobs. Let Ψb be the
Kantorovich potential of the transport map between ρ0 and ρ0b.
Then,
grad2JW(ρ0)=ωbΨb+∑i=1NobsGi*(ρ0).Ψi+c,
with c such that the integral of
grad2JW(ρ0) is zero and Gi* the adjoint of Gi with respect to the L2 inner product
(see definition reminder below). Assuming that
grad2JW(ρ0) has the no-flux boundary
condition (see comment about this assumption below)
ρ0∂grad2JW(ρ0)∂n=0on∂Ω,
then the gradient with respect to the Wasserstein inner product is
gradWJW(ρ0)=-div(ρ0∇[grad2JW(ρ0)]).
(A proof of this Theorem can be found in Appendix .)
The adjoint Gi*(ρ0) is defined by the classical equality
∀η,μ∈Tρ0P,〈Gi*(ρ0).μ,η〉2=〈μ,Gi(ρ0).η〉2,
where Gi[ρ0] is the tangent model, defined by
∀η∈Tρ0P,Gi(ρ0).η:=limϵ→0Gi(ρ0+ϵη)-Gi(ρ0)ϵ.
Note that the no-flux boundary condition assumption for
grad2JW(ρ0), that is
ρ0∂grad2JW(ρ0)∂n=0on∂Ω,
is not necessarily satisfied. The Kantorovich potentials respect this condition. Indeed, their spatial gradients are velocities
tangent to the boundary (see the end of Sect. ). But it may not be conserved through the mapping with the
adjoint model, Gi*(ρ0). In the case where Gi*(ρ0) does not preserve this condition, the Wasserstein
gradient is not of integral zero. A possible workaround is to use a product coming from the unbalanced Wasserstein distance of .
Minimization of JW
The minimizer of JW defined in
Eq. ()
is expected to be a good trade-off between both the observations and the
background with respect to the Wasserstein distance and to have good
properties, as shown in Fig. . It can be computed
through an iterative gradient-based descent method. Such methods start from a
control state ρ00 and step-by-step update it using an iteration of the
form
ρ0n+1=ρ0n-αndn,
where αn is a real number (the step) and dn is a function (the
descent direction), chosen such that
JW(ρ0n+1)<JW(ρ0n).
In gradient-based descent methods, dn can be equal to the gradient of
JW (steepest descent method) or to a function of
the gradient and dn-1 (conjugate gradient, CG; quasi-Newton methods; etc.).
Under sufficient conditions on (αn), the sequence (ρ0n)
converges to a local minimizer. See for more details.
We will now explain how to adapt the gradient descent to the optimal
transport framework. With the Wasserstein gradient Eq. (), the
descent of JW follows an iteration scheme of the
form
ρ0n+1=ρ0n+αndiv(ρ0n∇Φn),
with αn>0 to be chosen.
The inconveniences of this iteration are twofold. First, for ρ0n+1
to be nonnegative, αn may have to be very small. Second, the
supports of functions ρ0n+1 and ρ0n are the same. A more
transport-like iteration could be used instead, by making ρ0n
follow the geodesics in the Wasserstein space. All geodesics ρ(α)
starting from ρ0n are solutions of the set of partial differential
equations
∂αρ+div(ρ∇Φ)=0,ρ(α=0)=ρ0n,∂αΦ+|∇Φ|22=0,
see Eq. (). Furthermore, two different
values of Φ(α=0) give two different geodesics. In the optimal
transport theory community, the geodesic ρ(α) starting from
ρ0n with initial condition Φ(α=0)=Φ0 would be written
with the following notation:
ρ(α)=(I-α∇Φ0)#ρ0n
(see Sect. 8.2 for more details).
For the gradient iteration, we choose the geodesic starting from ρ0n
with initial condition Φ(α=0)=Φn; i.e., using the optimal
transport notation ρ0n+1 is given by
ρ0n+1=(I-αn∇Φn)#ρ0n,
with αn>0 to be chosen. This descent is consistent with
Eq. () because
Eq. ()
is the first-order discretization of Eq. () with
Φ(α=0)=Φn. Therefore, Eqs. () and () are equivalent when αn→0.
The comparison of Eqs. () and () is shown
in Fig. for simple ρ0n and Φ.
This comparison depicts the usual advantage of using
Eq. () instead of
Eq. (): the former is always in P(Ω) and supports of functions change. Iteration
Eq. () is the one used in the following
numerical experiments.
Comparison of iteration Eqs. () and () with ρ0 of limited support and
Φ such that ∇Φ is constant on the support of ρ0.
Numerical illustrations
Let us recall that in the data assimilation vocabulary, the word “analysis”
refers to the minimizer of the cost function at the end of the data
assimilation process.
In this section the analyses resulting from the minimization of the
Wasserstein cost function defined previously in Eq. () are
presented, in particular when position errors occur. Results are compared
with the results given by the L2 cost function defined in
Eq. ().
The experiments are all one-dimensional and Ω is the interval [0,1].
A discretization of Ω is performed and involves 200 uniformly
distributed discretization points. A first, simple experiment uses a linear
operator G. In a second experiment, the operator is nonlinear.
Only a single variable is controlled. This variable ρ0 represents the
initial condition of an evolution problem. It is an element of P(Ω), and observations are also elements of P(Ω).
In this paper we chose to work in the twin experiments framework. In this
context the true state, denoted ρ0t, is known and used to generate the
observations: ρiobs=Gi(ρ0t) at various times
(ti)i=1..Nobs. Observations are first perfect, that is
noise-free and available everywhere in space. Then in Sect. , we will add noise in the observations. The background term
is supposed to have position errors only and no amplitude error. The data
assimilation process aims to recover a good estimation of the true state,
using the cost function involving the simulated observations and the
background term. The analysis obtained after convergence can then be compared
to the true state and effectiveness diagnostics can be made.
Both the Wasserstein Eq. () and L2
Eq. () cost functions are minimized through a steepest gradient
method. The L2 gradient is used to minimize the L2
cost function. Both the L2 and W gradients are used
for the Wasserstein cost functions (cf. Sect. Theorem
for expressions of both gradients), giving respectively, with Φn:=grad2JW(ρ0n), the iterations
ρ0n+1=ρ0n-αnΦn,ρ0n+1=(I-αn∇Φn)#ρ0n.
The value of αn is chosen close to optimal using a line search
algorithm and the descent stops when the decrement of J between
two iterations is lower than 10-6. Algorithms using iterations
described by Eqs. (29) and (30) will be referred to as (DG2) and (DG#) respectively.
Linear example
The first example involves a linear evolution model as (Gi)i=1..Nobs with the number of observations Nobs
equal to 5. Every single operator Gi maps an initial condition
ρ0 to ρ(ti) according to the following continuity equation
defined in Ω=[0,1]:
∂tρ+u⋅∇ρ=0withu=1.
The operator Gi is linear. We control ρ0 only. The true
state ρ0t∈P(Ω) is a localized mass function, similar
to the background term ρ0b but located at a different place, as if it
had position errors. The true and background states as well as the
observations at various times are plotted in Fig. (top). The
computed analysis ρ0a,2 for the L2 cost function is
shown in Fig. (bottom left). This Figure also shows the
analysis ρ0a,W corresponding to both (DG2) and (DG#) algorithms minimizing the same Wasserstein
JW cost function.
As expected in the introduction, see e.g., Fig. , minimizing J2 leads to an analysis
ρ0a,2 being the L2 average of the background and true
states (hence two small localized mass functions), while
JW leads to a satisfactorily shaped analysis
ρ0a,W in-between the background and true states.
The issue of amplitude of the analysis of ρ0a,2 and the issue of position of ρ0a,W are not corrected by
the time evolution of the model, as shown in Fig. (bottom right). At the end of the assimilation window, each
of the analyses still have discrepancies with the observations.
(a) The twin experiments' ingredients are plotted, namely true
initial condition ρ0t, background term ρ0b and observations at
different times. (b) We plot the analyses obtained after each
proposed method, compared to ρ0b and ρ0t: ρ0a,2
corresponds to J2 while ρ0a,W to both (DG2) and (DG#). (c) Fields at final time, ρt,
ρa,2 and ρa,W, when taking respectively ρ0t,
ρ0a,2 and ρ0a,W as initial condition.
Both of the algorithms (DG2) and (DG#) give the
same analysis – the minimum of JW. However, the
convergence speed is not the same at all. The values of
JW throughout the algorithm are plotted in Fig. . It can be seen that (DG#) converges in a
couple of iterations while (DG2) needs more than 2000 iterations
to converge. It is a very slow algorithm because it does not provide the
steepest descent associated with the Wasserstein metric. The Figure also shows
that, even in a conjugate gradient version of (DG2), the
descent is still quite slow (it needs ∼100 iterations to converge).
This comparison highlights the need for a well-suited inner product and more
precisely that the L2 one is not fit for the Wasserstein
distance.
Decreasing of JW through the iterations of
(DG#) and (DG2), and a conjugate gradient version
(CG) of (DG2).
As a conclusion of this first test case, we managed to write and minimize a
cost function which gives a relevant analysis, contrary to what we obtain
with the classical Euclidean cost function, in the case of position errors. We
also noticed that the success of the minimization of
JW was clearly dependent on the scalar product
choice.
Nonlinear example
Further results are shown when a nonlinear model is used in place of
G. The framework and procedure are the same as the first test
case (see the beginning of Sect. and
for details). The nonlinear model used is the shallow-water system described
by
∂th+∂x(hu)=0∂tu+u∂xu+g∂xh=0
subject to initial conditions h(0)=h0 and u(0)=u0, with reflective
boundary conditions (u|∂Ω=0), where the constant g is
the gravity acceleration. The variable h represents the water surface
elevation, and u is the current velocity. If h0 belongs to P(Ω), then the corresponding solution h(t) belongs to P(Ω).
(a) Ingredients of the second experiment: true initial condition
h0t, background h0b and 2 of the 10 observations at different times.
(b) The true and background initial conditions are shown and also
the analyses h0a,2 and h0a,W corresponding respectively to the
Euclidean and Wasserstein cost functions. On the right we show the same plots
(except the background one) but at the end of the assimilation window.
(a) Plot of an example of noise-free observations used in the Sect. experiment, equal to the true surface elevation ht at a
given time. Plot of the corresponding observations with added noise, as
described in Sect. . (b) Analyses from the
L2 cost function using perfect observations and observations with
noise. (c) Likewise with the Wasserstein cost function.
The true state is (h0t,u0t), where velocity u0t is equal to 0 and
surface elevation h0t is a given localized mass function. The initial
velocity field is supposed to be known and therefore not included in the
control vector. Only h0 is controlled, using Nobs=5 direct
observations of h and a background term h0b, which is also a localized mass
function like h0t.
Data assimilation is performed by minimizing either the J2 or the
JW cost functions described above. Thanks to the
experience gained during the first experiment, only the (DG#)
algorithm is used for the minimization of JW.
In Fig. (top) we present initial surface elevation h0t,
h0b as well as 2 of the 10 observations used for the experiment. In Fig. (bottom left), the analyses
corresponding to J2 and
JW are shown: h0a,2 and h0a,W,#.
Analysis h0a,2 is close to the L2 average of the true and
background states, even at time t>0, while h0a,W,# lies close to the
Wasserstein average between the background and true states, and hence has the
same shape as them (see Fig. ).
Figure (bottom right) shows that, at the end of the assimilation
window, the surface elevation ha,W,#=G(h0a,W,#) is still
more realistic than ha,2=G(h0a,2), when compared to the
true state ht=G(h0t).
The conclusion of this second test case is that, even with nonlinear models,
our Wasserstein-based algorithm can give interesting results in the case of
position errors.
Robustness to observation noise
In this section, a noise in position and shape has been added in the
observations. This type of noise typically occurs in images from satellites.
For example, Fig. (top) shows an observation from the
previous experiment where peaks have been displaced and resized randomly. For
each structure of each observations, the displacements and amplitude changes
are independent and uncorrelated. This perturbation is done so that the total
mass is preserved.
Analyses of this noisy experiment using L2 Eq. () and
Wasserstein Eq. () cost functions are compared to analyses from
the last experiment where no noise was present.
For the L2 cost function, surface elevation analyses h0a,2
are shown in Fig. (bottom left). We see that adding
such a noise in the observations degrades the analysis. In particular, the
right peak (associated with the observations) is more widely spread: this is a
consequence of the fact that the L2 distance is a local-in-space
distance.
For the Wasserstein cost function, analyses h0a,W are shown in Fig. (bottom right). The analysis does not change much
with the presence of noise and remains similar to the one obtained in the
previous experiment. This is a consequence of a property of the Wasserstein
distance: the Wasserstein barycenter of several Gaussians is a Gaussian with
averaged position and variance (see example Sect. Theorem).
This example shows that the Wasserstein cost function is more robust than
L2 to such noise. This is quite a valuable feature for realistic
applications.
Conclusions
We showed through some examples that, if not taken into account, position
errors can lead to unrealistic initial conditions when using classical
variational data assimilation methods. Indeed, such methods use the Euclidean
distance which can behave poorly under position errors. To tackle this issue,
we proposed instead the use of the Wasserstein distance to define the related
cost function. The associated minimization algorithm was discussed and we
showed that using descent iterations following Wasserstein geodesics leads to
more consistent results.
On academic examples the corresponding cost function produces an analysis
lying close to the Wasserstein average between the true and background
states, and therefore has the same shape as them, and is well fit to correct
position errors. This also gives more realistic predictions. This is a
preliminary study and some issues have yet to be addressed for realistic
applications, such as relaxing the constant-mass and positivity hypotheses
and extending the problem to 2-D applications.
Also, the interesting question of transposing this work into the filtering
community (Kalman filter, EnKF; particle filters; etc.) raises the issue of
writing a probabilistic interpretation of the Wasserstein cost function,
which is out of the scope of our study for now.
In particular the important theoretical aspect of the representation of error
statistics still needs to be thoroughly studied. Indeed classical
implementations of variational data assimilation generally make use of
L2 distances weighted by inverses of error covariance matrices.
Analogy with Bayes formula allows for considering the minimization of the
cost function as a maximum likelihood estimation. Such an analogy is not
straightforward with Wasserstein distances. Some possible research directions
are given in but this is beyond the scope of this
paper. The ability to account for error statistics would also open the way
for a proper use of the Wasserstein distance in Kalman-based data
assimilation techniques.