Optimal transport for variational data assimilation

. Usually data assimilation methods evaluate observation-model misﬁts using weighted L 2 distances. However it is not well suited when observed features are present in the model with position error. In this context, the Wasserstein distance stemming from optimal transport theory is more relevant. This paper proposes to adapt variational data assimilation to the use of such a measure. It provides a short introduction 5 to optimal transport theory and discusses the importance of a proper choice of scalar product to compute the cost function gradient. It also extends the discussion to the way the descent is performed within the minimisation process. These algorithmic changes are tested on a non-linear shallow-water model, leading to the conclusion that optimal transport-based data assimilation seems to be promising to capture position errors in the model trajectory.


Introduction
Understanding and forecasting the evolution of a given system is a crucial topic in an ever-increasing number of application domains.To achieve this goal, one can rely on multiple sources of information, namely observations of the system, numerical model describing its behavior and additional a priori knowledge such as statistical information or previous forecasts.To combine these heterogeneous sources of observation it is common practice to use so-called data assimilation methods (e.g., see reference books Lewis et al., 2006;Law et al., 2015;Asch et al., 2016).They have multiple aims: finding the initial and/or boundary conditions, parameter estimation, reanalysis, and so on.They are extensively used in numerical weather forecasting, for instance (e.g., see reviews in the books Park andXu, 2009, 2013).
The estimation of the different elements to be sought, the control vector, is performed using data assimilation through the comparison between the observations and their model counterparts.The control vector should be adjusted such that its model outputs would fit the observations, while taking into account that these observations are imperfect and corrupted by noise and errors.
Data assimilation methods are divided into three distinct classes.First, there is statistical filtering based on Kalman filters.Then, there are variational data assimilation methods based on optimal control theory.More recently hybrids of both approaches have been developed (Hamill and Snyder, 2000;Buehner, 2005;Bocquet and Sakov, 2014).In this paper we focus on variational data assimilation.It consists in minimizing a cost function written as the distance between the observations and their model counterparts.A Tikhonov regularization term is also added to the cost function as a distance between the control vector and a background state carrying a priori information.
Thus, the cost function contains the misfit between the data (a priori and observations) and their control and model counterparts.Minimizing the cost function aims at reaching a compromise in which these errors are as small as possible.The errors can be decomposed into amplitude and position errors.Position errors mean that the structural elements are present in the data, but misplaced.Some methods have been proposed in order to deal with position errors (Hoffman and Grassotti, 1996;Ravela et al., 2007).These involve a preprocessing step which consists in displacing the different data so they fit better with each other.Then the data assimilation is performed accounting for those displaced data.
A distance has to be chosen in order to compare the different data and measure the misfits.Usually, a Euclidean distance is used, often weighted to take into account the statistical errors.But Euclidean distances have trouble capturing N. Feyeux et al.: Optimal transport for data assimilation position errors.This is illustrated in Fig. 1, which shows two curves ρ 0 and ρ 1 .The second curve ρ 1 can be seen as the first one ρ 0 with position error.The minimizer of the cost function ρ − ρ 0 2 + ρ − ρ 1 2 is given by ρ * = 1 2 (ρ 0 + ρ 1 ), plotted with violet stars of Fig. 1.It is the average of curves ρ 0 and ρ 1 with respect to the L 2 distance.As we can see in Fig. 1, it does not correct for position error, but instead creates two smaller amplitude curves.We investigate in this article the idea of using instead a distance stemming from optimal transport theory -the Wasserstein distance, which can take into account position errors.In Fig. 1 we plot (green dots) the average of ρ 0 and ρ 1 with respect to the Wasserstein distance.Contrary to the L 2 average, the Wasserstein average is what we want it to be: same shape, same amplitude, located in-between.It conserves the shape of the data.This is what we want to achieve when dealing with position errors.
Optimal transport theory has been pioneered by Monge (1781).He searched for the optimal way of displacing sand piles onto holes of the same volume, minimizing the total cost of displacement.This can be seen as a transportation problem between two probability measures.A modern presentation can be found in Villani (2003) and will be discussed in Sect.2.2.
Optimal transport has a wide spectrum of applications: from pure mathematical analysis on Riemannian spaces to applied economics; from functional inequalities (Cordero-Erausquin et al., 2004) to the semi-geostrophic equations (Cullen and Gangbo, 2001); and in astrophysics (Brenier et al., 2003), medicine (Ratner et al., 2015), crowd motion (Maury et al., 2010) or urban planning (Buttazzo and Santambrogio, 2005).From optimal transport theory several distances can be derived, with the most widely known being the Wasserstein distance (denoted W) which is sensitive to misplaced features and is the primary focus of this paper.This distance is also widely used in computer vision, for example in classification of images (Rubner et al., 1998(Rubner et al., , 2000)), interpolation (Bonneel et al., 2011) or movie reconstruction (Delon and Desolneux, 2010).More recently, Farchi et al. (2016) used the Wasserstein distance to compare observation and model simulations in an air pollution context, which is a first step toward data assimilation.
Actual use of optimal transport in a variational data assimilation has been proposed by Ning et al. (2014) to tackle model error.The authors use the Wasserstein distance instead of the classical L 2 norm for model error control in the cost function, and they offer promising results.Our contribution is in essence similar to them in the fact that the Wasserstein distance is proposed in place of the L 2 distance.Looking more closely, we investigate a different question, namely the idea of using the Wasserstein distance to measure the observation misfit.Also, we underline and investigate the impact of the choice of the scalar products, gradient formulations and minimization algorithm choices on the assimilation performance, which is not discussed in Ning et al. (2014).These particularly subtle mathematical considerations are indeed crucial for the algorithm convergence, as will be shown in this paper, and are our main contribution.
The goal of the paper is to perform variational data assimilation with a cost function written with the Wasserstein distance.It may be extended to other type of data assimilation methods such as filtering methods, which largely exceeds the scope of this paper.
The present paper is organized as follows: first, in Sect.2, variational data assimilation as well as the Wasserstein distance are defined, and the ingredients required in the following are presented.The core of our contribution lies in Sect.3: we first present the Wasserstein cost function and then propose two choices for its gradients, as well as two optimization strategies for the minimization.In Sect. 4 we present numerical illustrations, discuss the choices for the gradients and compare the optimization methods.Also, some difficulties related to the use of optimal transport will be pointed out and solutions will be proposed.

Materials and methodology
This section deals with the presentation of the variational data assimilation concepts and method on the one hand and optimal transport and Wasserstein distance concepts, principles and main theorems on the other hand.Section 3 will combine both worlds and will constitute the core of our original contribution.

Variational data assimilation
This paper focuses on variational data assimilation in the framework of initial state estimation.Let us assume that a system state is described by a variable x, denoted x 0 at initial time.We are also given observations y obs of the system, which might be indirect, incomplete and approximate.The Nonlin.Processes Geophys., 25,[55][56][57][58][59][60][61][62][63][64][65][66]2018 www.nonlin-processes-geophys.net/25/55/2018/ N. Feyeux et al.: Optimal transport for data assimilation 57 initial state and the observations are linked by operator G, mapping the system initial state x 0 to the observation space, so that G(x 0 ) and y obs belong to the same space.Usually G is defined using two other operators, namely the model M which gives the model state as a function of the initial state and the observation operator H which maps the system state to the observation space, such that G = H • M. Data assimilation aims to find a good estimate of x 0 using the observations y obs and the knowledge of the operator G. Variational data assimilation methods do so by finding the minimizer x 0 of the misfit function J (the cost function) between the observations y obs and their computed counterparts G(x 0 ), with d R some distance to be defined.Generally, this problem is ill-posed.For the minimizer of J to be unique, a background term is added and acts like a Tikhonov regularization.This background term is generally expressed as the distance with a background term x b , which contains a priori information.The actual cost function then reads with d B another distance to be specified.The control of x 0 is done by the minimization of J .Such minimization is generally carried out numerically using gradient descent methods.Section 3.3 will give more details about the minimization process.
The distances to the observations d R and to the background term d B have to be chosen in this formulation.Usually, Euclidean distances (L 2 distances, potentially weighted) are chosen, giving the following Euclidean cost function with • 2 the L 2 norm defined by Euclidean distances, such as the L 2 distance, are local metrics.In the following we will investigate the use of a non-local metric, the Wasserstein distance W, in place of d R and d B in Eq. ( 1).Such a cost function will be presented in Sect.3. The Wasserstein distance is presented and defined in the following subsection.

Optimal transport and Wasserstein distance
The essentials of optimal transport theory and Wasserstein distance required for data assimilation are presented.
We define, in this order, the space of mass functions where the Wasserstein distance is defined, then the Wasserstein distance and finally the Wasserstein scalar product, a key ingredient for variational assimilation.

Mass functions
We consider the case where the observations can be represented as positive fields that we will call "mass functions".A mass function is a nonnegative function of space.For example, a grey-scaled image is a mass function; it can be seen as a function of space to the interval [0, 1] where 0 encodes black and 1 encodes white.

Definition
Let be a closed, convex, bounded set of R d and let the set of mass functions P( ) be the set of nonnegative functions of total mass 1: Let us remark here that, in the mathematical framework of optimal transport, mass functions are continuous and they are called "probability densities".In the data assimilation framework the concept of probability densities is mostly used to represent errors.Here, the positive functions we consider actually serve as observations or state vectors, so we chose to call them mass functions to avoid any possible confusion with state or observation error probability distributions.

Wasserstein distance
Given the set of all transportations between two mass functions, the optimal transport is the one minimizing the kinetic energy.A transportation between two mass functions ρ 0 and ρ 1 is given by a time path ρ(t, x) such that ρ(t = 0) = ρ 0 and ρ(t = 1) = ρ 1 and given by a velocity field v(t, x) such that the continuity equation holds, Such a path ρ(t) can be seen as interpolating ρ 0 and ρ 1 .For ρ(t) to stay in P( ), a sufficient condition is that the velocity field v(t, x) should be tangent to the domain boundary, meaning that Let us be clear here that the time t is fictitious and has no relationship whatsoever with the physical time of data assimilation.It is purely used to define the Wasserstein distance and some mathematically related objects.
The Wasserstein distance W is hence the minimum in terms of kinetic energy among all the transportations between ρ 0 and ρ 1 , N. Feyeux et al.: Optimal transport for data assimilation with C(ρ 0 , ρ 1 ) representing the set of continuous transportations between ρ 0 and ρ 1 described by a velocity field v tangent to the boundary of the domain, This definition of the Wasserstein distance is the Benamou-Brenier formulation (Benamou and Brenier, 2000).There exist other definitions, based on the transport map or the transference plans, but this is slightly out of the scope of this article.See the introduction of Villani (2003) for more details.
A remarkable property is that the optimal velocity field v is of the form v(t, x) = ∇ (t, x) with following the Hamilton-Jacobi equation (Benamou and Brenier, 2000) The equation of the optimal ρ is the continuity equation using this velocity field.Moreover, the function defined by is said to be the Kantorovich potential of the transport between ρ 0 and ρ 1 .It is a useful feature in the derivation of the Wasserstein cost function presented in Sect.3. A remarkable property of the Kantorovich potential allows the computation of the Wasserstein distance, which is the Benamou-Brenier formula (see Benamou andBrenier, 2000 or Villani, 2003, Theorem 8.1), given by

Example
The classical example for optimal transport is the transport of Gaussian mass functions.For = R d , let us consider two Gaussian mass functions: ρ i of mean µ i and variance σ 2 i for i = 0 and i = 1.Then the optimal transport ρ(t) between ρ 0 and ρ 1 is a transportation-dilation function of ρ 0 to ρ 1 .More precisely, ρ(t) is a Gaussian mass function whose mean is Finally, a few words should be said about the numerical computation of the Wasserstein distance.In one dimension, the optimal transport ρ(t, x) is easy to compute as the Kantorovich potential has an exact formulation: the Kantorovich potential of the transport between two mass functions ρ 0 and ρ 1 is the only function such that with F i being the cumulative distribution function of ρ i .Numerically we fix x and solve iteratively Eq. ( 11) using a binary search to find ∇ .Then, we obtain thanks to numerical integration.Finally, Eq. ( 10) gives the Wasserstein distance.
For two-or three-dimensional problems, there exists no general formula for the Wasserstein distance and more complex algorithms have to be used, such as the (iterative) primal-dual one (Papadakis et al., 2014) or the semi-discrete one (Mérigot, 2011).In the former, an approximation of the Kantorovich potential is directly read in the so-called dual variable.

Wasserstein inner product
The scalar product between two functions is required for data assimilation and optimization: as we will recall later, the scalar product choice is used to define the gradient value.This paper will consider the classical L 2 scalar product as well as the one associated with the Wasserstein distance.A scalar product defines the angle and norm of vectors tangent to P( ) at a point ρ 0 .First, a tangent vector in ρ 0 is the derivative of a curve ρ(t) passing through ρ 0 .As a curve ρ(t) can be described by a continuity equation, the space of tangent vectors, the tangent space, is formally defined by (cf.Otto, 2001) Let us first recall that the Euclidean, or L 2 , scalar product The Wasserstein inner product One has to note that the inner product is dependent on ρ 0 ∈ P( ).Finally, the norm associated with a tangent vector η = −div(ρ 0 ∇ ) ∈ T ρ 0 P is and hence the kinetic energy of the small displacement η.This point makes the link between this inner product and the Wasserstein distance.
Nonlin This section is our main contribution.First, we will consider the Wasserstein distance to compute the observation term of the cost function; second, we will discuss the choices of the scalar product and the gradient descent method and their impact on the assimilation algorithm efficiency.

Wasserstein cost function
In the framework of Sect.2.2 we will define the data assimilation cost function using the Wasserstein distance.For this cost function to be well defined we assume that the control variables belong to P( ) and that the observation variables belong to another space P( o ) with o a closed, convex, bounded set of R d .Let us recall that this means that they are all nonnegative functions with integral equal to 1. Having elements with integral 1 (or constant integral) may seem restrictive.Removing it is possible by using a modified version of the Wasserstein distance, presented for example in Chizat et al. (2015) or Farchi et al. (2016).For simplicity we do not consider this possible generalization and all data have the same integral.The cost function Eq. ( 1) is rewritten using the Wasserstein distance defined in Sect.2.2, with G i : P( ) → P( o ) the observation operator computing the y obs i counterpart from x 0 and ω b a scalar weight associated with the background term.
The variables x 0 and y obs i may be vectors whose components are functions belonging to P( ) and P( o ) respectively.The Wasserstein distance between two such vectors is the sum of the distances between their components.The remainder of the article is easily adaptable to this case, but for simplicity we set x 0 = ρ 0 ∈ P( ) and y obs i = ρ obs i ∈ P( ).The Wasserstein cost function Eq. ( 16) then becomes As for the classical L 2 cost function, J W is convex with respect to the Wasserstein distance in the linear case and has a unique minimizer.In the nonlinear case, the uniqueness of the minimizer relies on the regularization term To find the minimum of J W , a gradient descent method is applied.It is presented in Sect.3.3.As this type of algorithm requires the gradient of the cost function, computation of the gradient of J W is the focus of next section.

Gradient of J
where •, • represents the scalar product.The scalar product is not unique, so as a consequence neither is the gradient.
In this work we decided to study and compare two choices for the scalar product -the natural one W and the usual one L 2 .W is clearly the ideal candidate for a good scalar product.However, we also decided to study the L 2 scalar product because it is the usual choice in optimization.Numerical comparison is done in Sect. 4. The associated gradients are respectively denoted as grad W J W (ρ 0 ) and grad 2 J W (ρ 0 ) and are the only elements of the tangent space T ρ 0 P of ρ 0 ∈ P( ) such that ∀η ∈ T ρ 0 P, lim Here in the notations, the term "grad" is used for the gradient of a function while the spatial gradient is denoted by the nabla sign ∇.The gradients of J W are elements of T ρ 0 P and hence functions of space.
The following theorem allows the computation of both gradients of J W .

Theorem
For i ∈ {1, . .., N obs }, let i be the Kantorovich potential (see Eq. 9) of the transport between G i (ρ 0 ) and ρ obs i .Let b be the Kantorovich potential of the transport map between ρ 0 and ρ b 0 .Then, with c such that the integral of grad 2 J W (ρ 0 ) is zero and G * i the adjoint of G i with respect to the L 2 inner product (see definition reminder below).Assuming that grad 2 J W (ρ 0 ) has the no-flux boundary condition (see comment about this assumption below) then the gradient with respect to the Wasserstein inner product is (A proof of this Theorem can be found in Appendix A.) The adjoint G * i (ρ 0 ) is defined by the classical equality where G i [ρ 0 ] is the tangent model, defined by Note that the no-flux boundary condition assumption for grad 2 J W (ρ 0 ), that is is not necessarily satisfied.The Kantorovich potentials respect this condition.Indeed, their spatial gradients are velocities tangent to the boundary (see the end of Sect.2.2).But it may not be conserved through the mapping with the adjoint model, G * i (ρ 0 ).In the case where G * i (ρ 0 ) does not preserve this condition, the Wasserstein gradient is not of integral zero.A possible workaround is to use a product coming from the unbalanced Wasserstein distance of Chizat et al. (2015).

Minimization of J W
The minimizer of J W defined in Eq. ( 17) is expected to be a good trade-off between both the observations and the background with respect to the Wasserstein distance and to have good properties, as shown in Fig. 1.It can be computed through an iterative gradient-based descent method.Such methods start from a control state ρ 0 0 and step-by-step update it using an iteration of the form where α n is a real number (the step) and d n is a function (the descent direction), chosen such that J W (ρ n+1 0 ) < J W (ρ n 0 ).In gradient-based descent methods, d n can be equal to the gradient of J W (steepest descent method) or to a function of the gradient and d n−1 (conjugate gradient, CG; quasi-Newton methods; etc.).Under sufficient conditions on (α n ), the sequence (ρ n 0 ) converges to a local minimizer.See Nocedal and Wright (2006) for more details.
We will now explain how to adapt the gradient descent to the optimal transport framework.With the Wasserstein gradient Eq. ( 21), the descent of J W follows an iteration scheme of the form with α n > 0 to be chosen.The inconveniences of this iteration are twofold.First, for ρ n+1 0 to be nonnegative, α n may have to be very small.Second, the supports of functions ρ n+1 0 and ρ n 0 are the same.A more transport-like iteration could be used instead, by making ρ n 0 follow the geodesics in the Wasserstein space.All geodesics ρ(α) starting from ρ n 0 are solutions of the set of partial differential equations see Eq. ( 8).Furthermore, two different values of (α = 0) give two different geodesics.In the optimal transport theory  25) and ( 28) with ρ 0 of limited support and such that ∇ is constant on the support of ρ 0 .
For the gradient iteration, we choose the geodesic starting from ρ n 0 with initial condition (α = 0) = n ; i.e., using the optimal transport notation ρ n+1 0 is given by with α n > 0 to be chosen.This descent is consistent with Eq. ( 25) because Eq. ( 25) is the first-order discretization of Eq. ( 26) with (α = 0) = n .Therefore, Eqs. ( 28) and ( 25) are equivalent when α n → 0. The comparison of Eqs. ( 28) and ( 25) is shown in Fig. 2 for simple ρ n 0 and .This comparison depicts the usual advantage of using Eq. ( 28) instead of Eq. ( 25): the former is always in P( ) and supports of functions change.Iteration Eq. ( 28) is the one used in the following numerical experiments.

Numerical illustrations
Let us recall that in the data assimilation vocabulary, the word "analysis" refers to the minimizer of the cost function at the end of the data assimilation process.
In this section the analyses resulting from the minimization of the Wasserstein cost function defined previously in Eq. ( 16) are presented, in particular when position errors occur.Results are compared with the results given by the L 2 cost function defined in Eq. ( 2).
The experiments are all one-dimensional and is the interval [0, 1].A discretization of is performed and involves 200 uniformly distributed discretization points.A first, simple experiment uses a linear operator G.In a second experiment, the operator is nonlinear.
Only a single variable is controlled.This variable ρ 0 represents the initial condition of an evolution problem.It is an element of P( ), and observations are also elements of P( ).
In this paper we chose to work in the twin experiments framework.In this context the true state, denoted ρ t 0 , is known and used to generate the observations: ρ obs i = G i (ρ t 0 ) at various times (t i ) i=1..N obs .Observations are first perfect, that is noise-free and available everywhere in space.Then in Sect.4.3, we will add noise in the observations.The background term is supposed to have position errors only and no amplitude error.The data assimilation process aims to recover a good estimation of the true state, using the cost function involving the simulated observations and the background term.The analysis obtained after convergence can then be compared to the true state and effectiveness diagnostics can be made.
Both the Wasserstein Eq. ( 17) and L 2 Eq. ( 2) cost functions are minimized through a steepest gradient method.The L 2 gradient is used to minimize the L 2 cost function.Both the L 2 and W gradients are used for the Wasserstein cost functions (cf.Sect.Theorem for expressions of both gradients), giving respectively, with n := grad 2 J W (ρ n 0 ), the iterations The value of α n is chosen close to optimal using a line search algorithm and the descent stops when the decrement of J between two iterations is lower than 10 −6 .Algorithms using iterations described by Eqs. ( 29) and (30) will be referred to as (DG2) and (DG#) respectively.

Linear example
The first example involves a linear evolution model as (G i ) i=1..N obs with the number of observations N obs equal to 5. Every single operator G i maps an initial condition ρ 0 to ρ(t i ) according to the following continuity equation defined in = [0, 1]: The operator G i is linear.We control ρ 0 only.The true state ρ t 0 ∈ P( ) is a localized mass function, similar to the background term ρ b 0 but located at a different place, as if it had position errors.The true and background states as well as the observations at various times are plotted in Fig. 3 (top).The computed analysis ρ a,2 0 for the L 2 cost function is shown in Fig. 3 (bottom left).This Figure also shows the analysis ρ a,W 0 corresponding to both (DG2) and (DG#) algorithms minimizing the same Wasserstein J W cost function.
As expected in the introduction, see e.g., Fig. 1, minimizing J 2 leads to an analysis ρ a,2 0 being the L 2 average of the background and true states (hence two small localized mass functions), while J W leads to a satisfactorily shaped analysis ρ a,W 0 in-between the background and true states.The issue of amplitude of the analysis of ρ a,2 0 and the issue of position of ρ a,W 0 are not corrected by the time evolution of the model, as shown in Fig. 3 (bottom right).At the end of the assimilation window, each of the analyses still have discrepancies with the observations.Both of the algorithms (DG2) and (DG#) give the same analysis -the minimum of J W .However, the convergence speed is not the same at all.The values of J W throughout the algorithm are plotted in Fig. 4. It can be seen that (DG#) converges in a couple of iterations while (DG2) needs more than 2000 iterations to converge.It is a very slow algorithm because it does not provide the steepest descent associated with the Wasserstein metric.The Figure also shows that, even in a conjugate gradient version of (DG2), the descent is still quite slow (it needs ∼ 100 iterations to converge).This comparison highlights the need for a well-suited inner product and more precisely that the L 2 one is not fit for the Wasserstein distance.
As a conclusion of this first test case, we managed to write and minimize a cost function which gives a relevant analysis, contrary to what we obtain with the classical Euclidean cost function, in the case of position errors.We also noticed that the success of the minimization of J W was clearly dependent on the scalar product choice.

Nonlinear example
Further results are shown when a nonlinear model is used in place of G.The framework and procedure are the same as the first test case (see the beginning of Sect. 4 and 4.1 for details).The nonlinear model used is the shallow-water system described by subject to initial conditions h(0) = h 0 and u(0) = u 0 , with reflective boundary conditions (u| ∂ = 0), where the constant g is the gravity acceleration.The variable h represents the water surface elevation, and u is the current velocity.If h 0 belongs to P( ), then the corresponding solution h(t) belongs to P( ).
The true state is (h t 0 , u t 0 ), where velocity u t 0 is equal to 0 and surface elevation h t 0 is a given localized mass function.The initial velocity field is supposed to be known and therefore not included in the control vector.Only h 0 is controlled, using N obs = 5 direct observations of h and a background term h b 0 , which is also a localized mass function like h t 0 .Data assimilation is performed by minimizing either the J 2 or the J W cost functions described above.We plot the analyses obtained after each proposed method, compared to ρ b 0 and ρ t 0 : ρ a,2 0 corresponds to J 2 while ρ a,W 0 to both (DG2) and (DG#).(c) Fields at final time, ρ t , ρ a,2 and ρ a,W , when taking respectively ρ t 0 , ρ a,2 0 and ρ a,W 0 as initial condition.the experience gained during the first experiment, only the (DG#) algorithm is used for the minimization of J W .In Fig. 5 (top) we present initial surface elevation h t 0 , h b 0 as well as 2 of the 10 observations used for the experiment.In Fig. 5 (bottom left), the analyses corresponding to J 2 and J W are shown: h a,2 0 and h a,W,# 0 .Analysis h a,2 0 is close to the L 2 average of the true and background states, even at time t > 0, while h a,W,# 0 lies close to the Wasserstein average between the background and true states, and hence has the same shape as them (see Fig. 1).
Figure 5 (bottom right) shows that, at the end of the assimilation window, the surface elevation h a,W,# = G(h a,W,# 0 ) is still more realistic than h a,2 = G(h a,2 0 ), when compared to the true state h t = G(h t 0 ).The conclusion of this second test case is that, even with nonlinear models, our Wasserstein-based algorithm can give interesting results in the case of position errors.

Robustness to observation noise
In this section, a noise in position and shape has been added in the observations.This type of noise typically occurs in images from satellites.For example, Fig. 6 (top) shows an observation from the previous experiment where peaks have been displaced and resized randomly.For each structure of each observations, the displacements and amplitude changes are independent and uncorrelated.This perturbation is done so that the total mass is preserved.
Analyses of this noisy experiment using L 2 Eq. ( 1) and Wasserstein Eq. ( 17) cost functions are compared to analyses from the last experiment where no noise was present.
For the L 2 cost function, surface elevation analyses h a,2 0 are shown in Fig. 6 (bottom left).We see that adding such a noise in the observations degrades the analysis.In particular, the right peak (associated with the observations) is more widely spread: this is a consequence of the fact that the L 2 distance is a local-in-space distance.
For the Wasserstein cost function, analyses h a,W 0 are shown in Fig. 6 (bottom right).The analysis does not change much with the presence of noise and remains similar to the Analyses of this noisy experiment using L 2 (1) and Wasserstein (17) cost functions are compared to analyses from the last experiment where no noise was present.
For the L 2 cost function, surface elevation analyses h a,2 0 are shown in Fig. 6 (bottom left).We see that adding such a noise in the observations degrades the analysis.In particular, the right peak (associated to the observations) is more widely spread: this is a consequence of the fact that the L 2 distance is a local-in-space distance.

25
For the Wasserstein cost function, analyses h a,W 0 are shown in Fig. 6 (bottom right).The analysis does not change much with the presence of noise and remains similar to the one obtained in the previous experiment.This is a consequence of a property of the Wasserstein distance: the Wasserstein barycenter of several Gaussians is a Gaussian with averaged position and variance (see example 2.2).  one obtained in the previous experiment.This is a consequence of a property of the Wasserstein distance: the Wasserstein barycenter of several Gaussians is a Gaussian with averaged position and variance (see example Sect.Theorem).This example shows that the Wasserstein cost function is more robust than L 2 to such noise.This is quite a valuable feature for realistic applications.

Conclusions
We showed through some examples that, if not taken into account, position errors can lead to unrealistic initial conditions when using classical variational data assimilation methods.Indeed, such methods use the Euclidean distance which can behave poorly under position errors.To tackle this issue, we proposed instead the use of the Wasserstein distance to define the related cost function.The associated minimization algorithm was discussed and we showed that using descent iterations following Wasserstein geodesics leads to more consistent results.
On academic examples the corresponding cost function produces an analysis lying close to the Wasserstein average between the true and background states, and therefore has the same shape as them, and is well fit to correct position errors.This also gives more realistic predictions.This is a preliminary study and some issues have yet to be addressed for realistic applications, such as relaxing the constant-mass and positivity hypotheses and extending the problem to 2-D applications.
Also, the interesting question of transposing this work into the filtering community (Kalman filter, EnKF; particle filters; etc.) raises the issue of writing a probabilistic interpretation of the Wasserstein cost function, which is out of the scope of our study for now.
In particular the important theoretical aspect of the representation of error statistics still needs to be thoroughly studied.Indeed classical implementations of variational data assimilation generally make use of L 2 distances weighted by inverses of error covariance matrices.Analogy with Bayes formula allows for considering the minimization of the cost function as a maximum likelihood estimation.Such an analogy is not straightforward with Wasserstein distances.Some possible research directions are given in Feyeux (2016) but this is beyond the scope of this paper.The ability to account for error statistics would also open the way for a proper use of the Wasserstein distance in Kalman-based data assimilation techniques.
Data availability.No data sets were used in this article.

Figure 2 .
Figure 2. Comparison of iteration Eqs.(25) and (28) with ρ 0 of limited support and such that ∇ is constant on the support of ρ 0 .

Figure 3 .
Figure 3. Top: the twin experiments ingredients are plotted, namely true initial condition ⇢ t 0 , background term ⇢ b 0, and observations at different times.Bottom left: we plot the analyses obtained after each proposed method, compared to ⇢ b 0 and ⇢ t 0 : ⇢ a,2 0 corresponds to J2 while ⇢ a,W 0 to both (DG2) and (DG#).Bottom right: fields at final time, ⇢ t , ⇢ a,2 and ⇢ a,W , when taking respectively ⇢ t 0 , ⇢ a,2 0 and ⇢ a,W 0 as initial

Figure 3
Figure 3. (a) The twin experiments' ingredients are plotted, namely true initial condition ρ t 0 , background term ρ b 0 and observations at different times.(b)We plot the analyses obtained after each proposed method, compared to ρ b 0 and ρ t 0 : ρ a,2 0 corresponds to J 2 while ρ a,W

Figure 5 .
Figure 5. Top: Ingredients of the second experiment: true initial condition h t 0 , background h b 0 and 2 of the 10 observations at different times.Bottom left: the true and background initial conditions are shown, and also the analyses h a,2 0 and h a,W 0 corresponding respectively to the Euclidean and Wasserstein cost functions.On the right we show the same plots (except the background one) but at the end of the assimilation window.

Figure 5 .
Figure 5. (a) Ingredients of the second experiment: true initial condition h t 0 , background h b 0 and 2 of the 10 observations at different times.(b) The true and background initial conditions are shown and also the analyses h a,2 0 and h a,W 0 corresponding respectively to the Euclidean and Wasserstein cost functions.On the right we show the same plots (except the background one) but at the end of the assimilation window.

Figure 6
Figure 6.(a) Plot of an example of noise-free observations used in the Sect.4.2 experiment, equal to the true surface elevation h t at a given time.Plot of the corresponding observations with added noise, as described in Sect.4.3.(b) Analyses from the L 2 cost function using perfect observations and observations with noise.(c) Likewise with the Wasserstein cost function.