The Onsager--Machlup functional for data assimilation

When taking the model error into account in data assimilation, one needs to evaluate the prior distribution represented by the Onsager--Machlup functional. Through numerical experiments, this study clarifies how the prior distribution should be incorporated into cost functions for discrete-time estimation problems. Consistent with previous theoretical studies, the divergence of the drift term is essential in weak-constraint 4D-Var (w4D-Var), but it is not nec essary in Markov chain Monte Carlo with the Euler scheme. Although the former property may cause difficulties when implementing w4D-Var in large systems, this paper proposes a new technique for estimating the divergence term and its derivative.


Introduction
In traditional weak-constraint 4D-Var settings (e.g. Zupanski, 1997;Trémolet, 2006), a quadratic cost function is defined as the negative logarithm of the probability for each sample path, which is suitable for path sampling (e.g. Zinn-Justin, 2002). The optimisation problem is naively described as finding the most probable path by minimising the quadratic cost function. However, the term "the most probable path" does not make sense in this context, because the paths are not countable. One should note that the concern is not about ranking the individual path probabilities, but about seeking the route with the densest path population. To define the optimisation problem properly, one should introduce a macroscopic variable φ = φ(t) that represents a smooth curve, and introduce a measure that accounts for how densely the paths are populated in the -neighbourhood centred at φ, which can be termed "the tube". Then the problem is defined as finding the most probable tube φ, which represents the maximum a posteriori (MAP) estimate of the path distribution. Mathematicians pioneering the theory of stochastic differential equations (SDEs) (e.g. Ikeda and Watanabe, 1981;Zeitouni, 1989) have been aware of this subtle point since the 1980s, and established the proper form of the cost function as the Onsager-Machlup (OM) functional (Onsager and Machlup, 1953) for the tube.
The aim of this work is to organise existing knowledge about the OM functional into a form that can be used to represent model errors in data assimilation, i.e. numerical evaluation of non-linear smoothing problems.
Throughout this article, we consider non-linear smoothing problems of the form (1) where t is time, x is a D-dimensional stochastic process, w is a D-dimensional Wiener process, x b ∈ R D is the background value of the initial condition, σ b > 0 is the standard deviation of the background value, y m ∈ R D is observational data at time t m , x m = x tm , t m = mδ t , M is the set of observation times, σ o > 0 is the standard deviation of the observational data, and σ > 0 is the noise intensity. Note that there is no need to distinguish the Ito integral from the Stratonovich integral with regard to the discretisation of the SDE, because the noise intensity is a constant. Before moving on to its applications, here we review the concept of the OM functional. To make presentation simple, we assume that D = 1 and σ = 1, and concentrate on the formulation of the prior distribution in the subsequent two Sects. 1.1 and 1.2.

OM functional for path sampling
The model Eq. (1) is discretised with the Euler scheme (with the drift term at the previous time) as where δ t is the time increment, and each ξ n−1 obeys N (0, δ t ). Equation (4) can be considered a non-linear mapping F 1 : ξ → x from the noise vector ξ = (ξ 0 , ξ 1 , · · · , ξ N −1 ) T to the state vector x = (x 1 , x 2 , · · · , x N ) T . The inverse of the mapping is linearised as where f is the derivative of f , and the Jacobian is DF −1 1 = |dξ/dx| = 1.
It is also discretised with the trapezoidal scheme (with the drift term at the midpoint) as which defines a mapping F 2 : ξ → x. The inverse of the mapping is linearised as whose Jacobian is Generally, we can assign a measure µ 0 to a cylinder set Ω ≡Ω 0 ×Ω 1 ×· · ·×Ω N −1 in the noise space using a density g as follows.
where λ is the Lebesgue measure on R N . In our case, we can see that a small area dξ in the noise space is equipped with a measure: Suppose we have a cylinder set Ω ≡ Ω 1 ×Ω 2 ×· · ·×Ω N in the state space, where each Ω n ⊂ R 1 is on time slice t = nδ t . Now, the mapping F 1 (or F 2 ) induces a measure through the change of variables from ξ to x with respect to the measure µ 0 as In our case, each mapping assigns the following measure to a small area dx in the corresponding state space: where f (x n− 1 2 ) = f (xn)+f (xn−1) 2 . Measures µ 1 and µ 2 represent the occurrence probability of the noise seen from the state space, and thus can be used for path sampling.
The change-of-measure argument (Appendix B1) or the path integral argument (e.g. Zinn-Justin, 2002) shows that similar forms are available for time-continuous and multidimensional processes, except that the term f (x t ) is promoted to div f (x t ).

OM functional for mode estimate
If we perform path sampling with a sufficient number of paths, in theory we can find the mean of distribution by av-eraging the samples, or the mode of distribution by organising them into a histogram. Still, in some practical applications, we must efficiently find the mode of distribution by variational methods; computationally, this approach is much cheaper than path sampling. For that purpose, we are tempted to use a quadratic cost function for the minimisation. However, we can illustrate a simple example against maximising the path probability (11) to obtain the mode of distribution. Suppose we have a discrete-time stochastic system in R 1 , starting from x 0 = 0, and we move forward two time steps, where ξ 0 and ξ 1 obey independent normal distributions N (0, δ t ). It may be seen as a discrete version of dx t = x 2 t dt + dw t . It is easy to notice that the mode of distribution (x 1 , x 2 ) is not (0, 0) owing to the non-linear term ξ 2 0 δ t . On the other hand, according to the path probability (11), the best trajectory is (x 1 , x 2 ) = (0, 0), which has no noise: (ξ 0 , ξ 1 ) = (0, 0). We expect a path with the highest probability at (x 1 , x 2 ) = (0, 0), but it is not the route where the paths are most concentrated. Motivated by this example, we shall investigate a proper strategy to find the route that maximises the density of paths. In this regard, we ask how densely the paths populate in the small neighbourhood of a curve φ = φ(t) in the state space.
Assuming that f and φ are twice continuously differentiable, we evaluate the density of paths in the -neighbourhoods around a curve φ connecting points {φ n , n = 1, 2, · · · , N } with the following integral: By regarding v n in Eq. (16) as being generated according to the probability , we can interpret the integration as a weighted ensemble averaging of a random function up to a numerical constant. The sequence v n can be set as a random walk v 0 = 0, v n = n k=1 ξ k , where ξ k are independent normal random variables obeying N (0, δ t ). For simplicity, we rather assume that ξ k takes values ± √ δ t with 0.5 probability for either one, because Donsker's theorem ensures it has the same probability law as the former when δ t is sufficiently small. We suppose √ δ t < so that no step of the random walk escapes from the -neighbourhood. Accordingly, the integral is expressed as the ensemble average with respect to random walks confined in the tube where E v denotes the ensemble averaging of the random walks denoted by v, each of which follows the route (v 0 , v 1 , · · · , v N ) and satisfies |v n | < for all n.
Because v n−1 is small, we can apply the expansion where f is the derivative of f . Let us accept that the following average containing the higher-order terms O(v 2 ) converges (see Eq. B20).
As shown in Appendix B2, the remaining terms in the exponent −J(φ, v) are less than O( ), except for the following one.
Consequently, we obtain the asymptotic expression for the ensemble average when is small and δ t < 2 : Appendix B2 shows that a similar form is available for timecontinuous and multi-dimensional processes, except that the term f (φ(t)) is promoted to div f (φ(t)). Importantly, the control variable for the optimisation has changed from x to φ.

Probabilistic description of data assimilation
Using the OM functional derived in Sect. 1.1 and 1.2 as a model error term, we shall develop a probabilistic description of data assimilation.
Following the derivation in Sect. 2.3 of Law et al. (2015), we can assign each path a posterior probability According to Eq. (2), the prior probability for the initial condition is given as According to Eq. (3), the likelihood of the state x m , given observation y m , is Based on the argument in Sect. 1.1, Eq. (4) has the transition probability at discrete time steps called the Euler scheme, which uses the drift f (x n−1 ) at the previous time step. Section 1.1 also shows that this transition probability has another expression: which can be called the trapezoidal scheme because the integral is evaluated with the drift terms at both ends of each interval. The transition probability leads to the prior probability P (x|x 0 ) of a path x = {x n } 0≤n≤N as follows: where the " " sign indicates that, if δ t is sufficiently small, the equations on both sides are compatible.
On the other hand, based on the argument in Sect. 1.2, we can also define the probability P (U φ |φ 0 ) for a smooth tube that represents its neighbouring paths U φ = ω (∀n)|φ n − x n (ω)| < : The scaling argument for a smooth curve in Appendix A allows us to use the drift term f (φ n− 1 2 ) instead in Eq. (35): The corresponding posterior probabilities are thus given as follows: for a sample path, and for a smooth tube. Note that different pairs of time-discretisation schemes of the OM functional,

Four schemes for OM
In the argument in Sects. 1.1 and 1.2, the prior probability has a form P (x|x 0 ) ∝ exp −δ t N n=1 OM , where OM is the OM functional. As a proof-of-concept described in these sections, we will test all the cases with conceivable combinations of the timing of the drift term f (x t ) and the presence or absence of the divergence term. Including those shown in Eqs. (38), (39), (41), and (42), as well as those that are potentially incorrect, the possible candidates for the discretisation schemes of the OM functional are as follows, where the symbol ψ represents either φ for a smooth curve or x for a sample path.

Data assimilation algorithms
By using one of the above schemes adopted for the model error term in the cost function, we can apply a data assimilation algorithm -either Markov chain Monte Carlo (MCMC) (e.g. Metropolis et al., 1953) or four-dimensional variational data assimilation (4D-Var) (e.g. Zupanski, 1997). Among versions of MCMC, we focus on the Metropolis-adjusted Langevin algorithm (MALA) (e.g. Roberts and Rosenthal, 1998;Cotter et al., 2013). MALA samples the paths x (k) = {x n (ω k )} 0≤n≤N according to the distribution P path by iterating with the Metropolis rejection step for adjustment, to obtain an ensemble of sample paths according to the posterior probability, while 4D-Var seeks the centre of the most probable tube φ = {φ n } 0≤n≤N by iterating: Note that if the OM functional of type OM ED is used, the gradient is of the form where ∂f ∂φn (φ n ) T is an adjoint integration starting from the subsequent term, which is typical in gradient calculations in 4D-Var. In comparison, the term ∂ ∂φn div f (φ n ) requires the second derivative of f , which is not typical in 4D-Var, and could be difficult to implement in large dimensional systems.
To investigate the applicability of the four candidate schemes in Sect. 2.1, we use them in these algorithms.
The results should be checked with "the correct answer". The reference solution that approximates the correct answer is provided by a particle smoother (PS) (e.g. Doucet et al., 2000), which does not involve the explicit computation of prior probability. When we have observations only at the end of the assimilation window, the PS algorithm is as follows.

Generate samples of initial and model errors, integrate
M copies of the model, and use them to obtain a Monte Carlo approximation of the prior distribution: where χ (m) n is the state of member m at time n.

Example A (hyperbolic model)
In our first example, we solve the non-linear smoothing problem for the hyperbolic model (Daum, 1986), which is a sim-ple problem with one-dimensional state space, but which has a non-linear drift term. We want to find the probability distribution of the paths described by subject to an observation y: The setting follows Daum (1986). In this case, div f (x) = 1/ cosh 2 (x) imposes a penalty for small x. The total time duration T = 5 is divided into N = 100 segments with δ t = 5 × 10 −2 . Figure 1 shows the probability densities of paths normalised on each time slice, P t=n (φ) = P (U φ |y)dφ t =n , derived by MCMC and PS. PS is performed with 5.1 × 10 10 particles. It is clear that MCMC with E or TD provides the proper distribution matched with that of PS; this is also clear from the expected paths yielded by these experiments, as shown in Fig. 2. These schemes correspond to candidates in Eqs. (38) and (39). The expected path by ED bends towards a larger x, which should be caused by an extra penalty for a larger x. The expected path by T bends towards a smaller x, which should be caused by the lack of a penalty for a larger x.
The results of 4D-Var, which represents the MAP estimates, are shown in Fig. 3. ED and TD provide the proper MAP estimate. These schemes correspond to candidates in Eqs. (41) and (42). The expected paths by E and T bend towards a smaller φ, which should be caused by the lack of a penalty for a larger φ.
In this case, div f (x) = x 1 + a − c imposes a penalty for large x 1 . The total time duration T = 0.4 is divided into N = 800 segments with δ t = 5 × 10 −4 . The results by MCMC and 4D-Var for the Rössler model are shown in Figs. 4 and 5, respectively. The state variable x 1 is chosen for the vertical axes. PS is performed with 3 × 10 12 particles. The curve for PS in Fig. 5 indicateŝ φ = argmax φ P (φ|y), where U represents the tube centred at φ with radius 0.03. Figure 4 shows that, just as for the hyperbolic model, E and TD provide the proper expected path. Figure 5 shows that ED and TD provide the proper MAP estimate.

Towards application to large systems
When one computes the cost value J(x), the negative logarithm of the posterior probability, in data assimilation, the value f (x) is explicitly computed by the numerical model while div f (x) is not. If the dimension D of the state space is large, and f is complicated, the algebraic expression of div f (x) can be difficult to obtain. The gradient of the cost function ∇J(x) contains the derivative of f (x), which can be implemented as the adjoint model by symbolic differentiation (e.g. Giering and Kaminski, 1998). However, schemes with the divergence term require the calculation of the second derivative of f (x), for which the algebraic expression can be even more difficult to obtain. Still, there may be a way to circumvent this difficulty by utilising Hutchinson's trace estimator (Hutchinson, 1990) (see Appendix C). It is also clear that the Euler scheme without the divergence term is more  convenient for implementing path sampling, because it does not require cumbersome calculation of the divergence term.

Conclusions
We examined several discretisation schemes of the OM functional, 1 by matching the answers given by MCMC and 4D-Var with that given by PS, taking the hyperbolic model and the Rössler model as examples. Table 1 lists the discretisation schemes which were found to be applicable, i.e. those expected to converge to the same result as the reference solution. These results are consistent with the literature (e.g. Apte et al., 2007;Malsom and Pinski, 2016;Dutra et al., 2014;Stuart et al., 2004). This justifies, for instance, the use of the following cost function for the MAP estimate given by 4D-Var: where n is the time index, δ t is the time increment, x b is the background value, σ b is the standard deviation of the background value, y is the observational data, σ o is the standard deviation of the observational data, and σ is the noise intensity. However, the divergence term above should be excluded for the assignment of path probability in MCMC. For application in large systems, the Euler scheme without the divergence term is preferred for path sampling because it does not require cumbersome calculation of the divergence term. In 4D-Var, the divergence term can be incorporated into the cost function by utilising Hutchinson's trace estimator.

Appendix A: Scaling of the terms
Taylor expansion of the f (ψ n−1 ) term around ψ n− 1 2 in scheme E gives where we assume order-one fluctuations, σ = O(1), and the symbol ψ represents either φ for a smooth curve or x for a sample path. For a sample path of the stochastic process, the scaling is The shift term induces a Jacobian that coincides with the divergence term in TD (Zinn-Justin, 2002).
In the case of a smooth curve, there is no stochastic term, and thus ψ n − ψ n−1 is the product of a bounded function f (ψ n−1 ) and δ t , which results in a value with O(δ t ). This leads to The shift term is negligible, but the divergence term is not.
Appendix B: Divergence term B1 Divergence term in a trapezoidal scheme Consider two stochastic processes (cf. Sect. 6.3.2 of Law et al., 2015): where Eq. (B1) has measure µ and Eq. (B2) has measure µ 0 (Wiener measure). By the Girsanov theorem, the Radon-Nikodym derivative of µ with respect to µ 0 is If we define F (x T ) − F (x 0 ) = x T x0 f (x) • dx with the Stratonovich integral, then by Ito's formula, dF = f · dx + 1 2 div (f )dt. (B4) Eliminating f · dx in Eq. (B3) using Eq. (B4), we obtain If we write the Wiener measure formally as µ 0 (dx) = exp − 1 2 T 0 dx dt 2 dt dx, we get the following from Eq. (B3), and the following from Eq. (B6), where the integrals should be interpreted in the Ito sense and in the Stratonovich sense, respectively.

B2 Divergence term for smooth tube
When weight is assigned to smooth tubes, there should always be a divergence term, for the following reason. Let x be a diffusion process that follows the stochastic differential equation where w is a Wiener process. To investigate paths near a smooth curve φ, let us consider the following stochastic process x t − φ(t) (Ikeda and Watanabe, 1981;Zeitouni, 1989): d(xt − φ(t)) = (f (xt − φ(t) + φ(t)) −φ(t))dt + dwt.

(B10)
This means that if a drift f is applied to the Wiener process, and the reference frame is shifted by φ, the process x t − φ(t) which has the drift f (· + φ) −φ is obtained. The weight relative to the Wiener measure can be calculated by Girsanov's formula as follows.
where the expectation is taken with respect to the Wiener process w conditioned to w T ≡ sup 0<t<T |w t | < . We are going to evaluate the terms containing w t in the exponent on the RHS of Eq. (B11).
1. If we assume φ is a twice continuously differentiable function, then by applying Ito's product rule toφ(t)·w t , and using (∀t) |w t | < ,