Hybrid Levenberg-Marquardt and weak-constraint ensemble Kalman smoother method

Abstract. The ensemble Kalman smoother (EnKS) is used as a linear least-squares solver in the Gauss–Newton method for the large nonlinear least-squares system in incremental 4DVAR. The ensemble approach is naturally parallel over the ensemble members and no tangent or adjoint operators are needed. Furthermore, adding a regularization term results in replacing the Gauss–Newton method, which may diverge, by the Levenberg–Marquardt method, which is known to be convergent. The regularization is implemented efficiently as an additional observation in the EnKS. The method is illustrated on the Lorenz 63 model and a two-level quasi-geostrophic model.


Introduction
Four-dimensional variational data assimilation (4DVAR) is a dominant data assimilation method used in weather forecasting centers worldwide.4DVAR attempts to reconcile model and data variationally, by solving a large weighted nonlinear least-squares problem.The unknown is a vector of system states over discrete points in time, when the data are given.The objective function minimized is the sum of the squares of the differences of the initial state from a known background state at the initial time and the differences of the values of the observation operator and the data at every given time point.In the weak-constraint 4DVAR (Trémolet, 2007), considered here, the model error is accounted for by allowing the ending and starting states of the model at every given time point to be different, and also adding to the objective func-tion the sums of the squares of those differences.The sums of the squares are weighted by the inverses of the appropriate error covariance matrices, and much of the work in the applications of 4DVAR goes into modeling those covariance matrices.
In the incremental approach (Courtier et al., 1994), the nonlinear least-squares problem is solved iteratively by solving a succession of linearized least-squares problems.The major cost in 4DVAR iterations is evaluating the model, the tangent and adjoint operators, and solving the large linear least squares.A significant software development effort is needed for the additional code to implement the tangent and adjoint operators to the model and the observation operators.Straightforward linearization leads to the Gauss-Newton method for nonlinear least squares (Bell, 1994;Tshimanga et al., 2008).Gauss-Newton iterations are not guaranteed to converge, not even locally, though a careful design of an application system may avoid divergence in practice.Finally, while the evaluation of the model operator is typically parallelized on modern computer architectures, there is a need to further parallelize the 4DVAR process itself.
The Kalman filter is a sequential Bayesian estimation of the Gaussian state of a linear system at a sequence of discrete time points.At each of the time points, the use of the Bayes theorem results in an update of the state, represented by its mean and covariance.The Kalman smoother considers all states within an assimilation time window to be a large composite state.Consequently, the Kalman smoother can be obtained from the Kalman filter by simply applying the same update as in the filter to the past states as well.However, his-Published by Copernicus Publications on behalf of the European Geosciences Union & the American Geophysical Union.
torically, the focus was on efficient short recursions (Rauch et al., 1965;Strang and Borre, 1997), similarly as in the Kalman filter.
It is well known that weak-constraint 4DVAR is equivalent to the Kalman smoother in the linear case and when all observations are in the assimilation window.Use of the Kalman smoother to solve the linear least squares in the Gauss-Newton method is known as the iterated Kalman smoother, and considerable improvements can be obtained against running the Kalman smoother only once (Bell, 1994;Fisher et al., 2005).
The Kalman filter and smoother require maintenance of the covariance of the state, which is not feasible for large systems, such as in numerical weather prediction.Hence, the ensemble Kalman filter (EnKF) and ensemble Kalman smoother (EnKS) (Evensen, 2009) use a Monte Carlo approach for large systems, representing the state by an ensemble of simulations and estimating the state covariance from the ensemble.The implementation of the EnKS in Stroud et al. (2010) uses the adjoint model explicitly, with the short recursions and a forward and backward pass, as in the Kalman smoother.However, the implementations in Khare et al. (2008) and Evensen (2009) do not depend on the adjoint model and simply apply EnKF algorithms to the composite state over multiple time points.Such composite variables are also called 4-D vectors (e.g., Desroziers et al., 2014).We use the latter approach in the computations reported here.
In this paper, we use the EnKS as a linear least-squares solver in 4DVAR.The EnKS is implemented in the physical space and with randomization.The ensemble approach is naturally parallel over the ensemble members.The rest of the computational work is relatively cheap compared to the ensemble of simulations, and parallel dense linear algebra libraries can be used; however, in high-dimensional systems or for a large lag, the storage requirements can be prohibitive (e.g., Cosme et al., 2010).The proposed approach uses finite differences from the ensemble, and no tangent or adjoint operators are needed.To stabilize the method and ensure convergence, a Tikhonov regularization term is added to the linear least squares, and the Gauss-Newton method becomes the Levenberg-Marquardt method (Levenberg, 1944;Marquardt, 1963).The Tikhonov regularization is implemented within EnKS as an independent observation following Johns and Mandel (2008) in a computationally cheap additional analysis step, which is statistically correct because the smoother operates only on the linearized problem.A new probabilistic ensemble is generated in every iteration, so the minimization is not restricted to the combinations of a single ensemble.We use finite differences from the ensemble mean towards the ensemble members to linearize the model and observation operators.The iterations can be proven to converge to incremental 4DVAR iterations for small finite difference steps and large ensemble sizes (Bergou et al., 2014).Thus, in the limit, the method performs actual minimization of the weak-constraint objective function and inherits the ad-vantages of 4DVAR in handling nonlinear problems.We call the resulting method EnKS-4DVAR.
Combinations of ensemble and variational approaches have been of considerable recent interest.Estimating the background covariance for 4DVAR from an ensemble was one of the first connections (Hamill and Snyder, 2000b).It is now standard and became operational (Wang, 2010).Zhang et al. (2009) use a two-way connection between EnKF and 4DVAR to obtain the covariance for 4DVAR, and 4DVAR to feed the mean analysis into EnKF.EnKF is operational at the National Centers for Environmental Prediction (NCEP) as part of its Global Forecast System Hybrid Variational Ensemble Data Assimilation System (GDAS), together with the Gridpoint Statistical Interpolation (GSI) variational data assimilation system (Developmental Testbed Center, 2015).
The first methods that use ensembles for more than computing the covariance minimized the 3DVAR objective function in the analysis step.The maximum likelihood ensemble filter (MLEF) method by Zupanski (2005) works in the ensemble space, i.e., minimizing in the span of the ensemble members, with the control variables being the coefficients of a linear combination of the ensemble members.Gu and Oliver (2007) use an iterated ensemble Kalman filter (with randomization) in the state space, with a linearization of the observation operator obtained by a regression on the increments given by the ensemble.This approach was extended by Chen and Oliver (2013) to a Levenberg-Marquardt method, with the regularization done by a multiplicative inflation of the covariance in the linearized problem rather than adding a Tikhonov regularization term.Liu et al. (2008Liu et al. ( , 2009) ) and Liu and Xiao (2013) minimize the (strong-constraint) 4DVAR objective function over linear combinations of the ensemble by computations in the observation space.
The iterated ensemble Kalman filter by Sakov et al. (2012), called IEnKF, minimizes the lag-one 4DVAR objective function in the ensemble space, using the square root EnKF as a linear solver in the Newton-Gauss method, and rescaling the ensemble to approximate the tangent operators, which is similar to the use of finite differences and EnKS here.Bocquet and Sakov (2012) combined the IEnKF method of Sakov et al. (2012) with an inflation-free approach to obtain a 4-D ensemble variational method, and with the Levenberg-Marquardt method by adding a diagonal regularization to the Hessian.Bocquet and Sakov (2012) and Chen and Oliver (2013) used Levenberg-Marquardt for faster convergence, as an adaptive method between the steepest descent and the Gauss-Newton method rather than to overcome divergence.Bocquet and Sakov (2012) also considered scaling the ensemble to approximate the tangent operators ("bundle variant") as in Sakov et al. (2012).Bocquet and Sakov (2013) extended IEnKF to a smoother (IEnKS) with fixed lag and moving window and noted that Gauss-Newton can be replaced by Levenberg-Marquardt.The method is formulated in terms of the composite model operator, i.e., with strong constraints.Bocquet and Sakov (2014) developed the method further, in-cluding cycling.Bocquet and Sakov (2012, 2013, 2014) note that various optimizers could be used in IEnKF/IEnKS; the present method can be understood as EnKS used as such an optimizer.
It is well known that for good practical performance, ensemble methods need to be modified by localization to improve the sampling error.Ensemble methods can be localized in multiple ways (Sakov and Bertino, 2011).For methods operating in the physical space, localization can be achieved, e.g., by tapering of the covariance matrix (Furrer and Bengtsson, 2007) or by replacing the sample covariance by its diagonal in a spectral space (Kasanický et al., 2015).This is not completely straightforward for the EnKS, but implementations of the EnKS based on the Bryson-Frazier version of the classical formulation of the Kalman smoother, with a forward and backward pass, are more flexible (Butala, 2012).Methods in the ensemble space can be modified to update only nodes in a neighborhood of the observation (e.g., Ott et al., 2004).The 4DEnVAR method of Desroziers et al. (2014) uses ensemble-derived background covariance, and the authors propose several methods to solve the linearized problem in each iteration by combinations of ensemble members with the weights allowed to vary spatially.Lorenc et al. (2014) compare the hybrid 4DEnVAR and hybrid 4DVAR for operational weather forecasts."Hybrid" refers to a combination of a fixed climatological model of the background error covariances and localized covariances obtained from ensembles.
The paper is organized as follows.In Sect.2, we review the formulation of 4DVAR.The EnKF and the EnKS are reviewed in Sect.3. The proposed method is described in Sect. 4. Section 5 contains the results of the computational experiments, and Sect.6 is the conclusion.

Incremental 4DVAR
For vectors u i , i = 1, . .., L, denote the composite (column) 4-D vector where L is the number of cycles in the assimilation window.We want to estimate x 0 , . .., x L , where x i is the state at time i, from the background state, x 0 ≈ x b , the model, x i ≈ M i (x i−1 ), and the observations, H i (x i ) ≈ y i , where M i is the model operator and H i is the observation operator.Quantifying the uncertainty by covariances, with x 0 ≈ x b taken as (x 0 − x b ) T B −1 (x 0 − x b ) ≈ 0, etc., we get the non-linear least-squares problem called weak-constraint 4DVAR (Trémolet, 2007).Originally, in 4DVAR, x i = M i (x i−1 ); the weak-constraint x i ≈ M i (x i−1 ) accounts for model error.
The least-squares problem (Eq. 1) is solved iteratively by linearization, In each iteration x 0:L ← x 0:L + δx 0:L , one solves the auxiliary linear least-squares problem for the increments δx 0:L , (2) This is the Gauss-Newton method (Bell, 1994;Tshimanga et al., 2008) for nonlinear least squares, known in 4DVAR as the incremental approach (Courtier et al., 1994).Write the auxiliary linear least-squares problem (Eq.2) for δx 0:L as where The function minimized in Eq. ( 3) is the same as the one minimized in the Kalman smoother (Bell, 1994).

Ensemble Kalman filter and smoother
We present the EnKF and EnKS algorithms, essentially following Evensen (2009), in a form suitable for our purposes.We start with a formulation of the EnKF, in a notation useful for the extension to EnKS of anything else.The ensemble of states of the linearized model at time i, conditioned on data up to time j (that is, with the data up to time j already ingested), is denoted by , where the ensemble member index always runs over = 1, . .., N , and similarly for other ensembles.Assume for the moment that the observation operator H i is linear; that is, H i (u) = H i u.The EnKF algorithm consists of the following steps.
1. Initialize 2. For i = 1, 2, . .., L, (a) advance in time: (b) The analysis step is where Denote by A N i the matrix of anomalies of the ensemble Then T , and we can write the matrices in Eq. ( 7) as In particular, the matrix H i is used here only in the matrixvector multiplications which allows the matrix-vector multiplication to be replaced by the use of a possibly nonlinear observation operator H i evaluated on the ensemble members only (Eq.18 below).
This technique is commonly used for nonlinear observation operators.With Eq. ( 9) becomes Also, from Eqs. ( 7) and ( 9) and writing the matrix of anomalies in the form where 1 is the column vector of all ones of length N; it follows that the analysis ensemble X N i|i consists of linear combinations of the forecast ensemble.Hence, it can be written as multiplying the forecast ensemble by a suitable transformation matrix T N i , where The EnKS is obtained by applying the same analysis step as in EnKF (Eq.7) to the ensemble X 0:i|i−1 of 4-D composite states from time 0 to i, conditioned on data up to time i − 1, in the place of X i|i−1 , with the observation matrix H 0:i = [0, . .., H i ].Then, Eq. ( 7) becomes where P N 0:i,0:i is the sample covariance matrix of X N 0:i|i−1 .Fortunately, the matrix-vector and matrix-matrix products can be simplified: which is the same expression as in Eq. ( 9).Also using Eq. ( 11), we obtain the EnKS algorithm.
1. Initialize: 2. For i = 1, . .., L: (a) advance in time: (b) Compute the anomalies of the ensemble in the state space and in the observation space.
(c) The analysis step: Comparing Eqs. ( 7) and ( 19), we see that the EnKS can be implemented in a straightforward manner by applying the same transformation as in the EnKF to the composite 4-D state vector from times 0 to i, X N 0:i|i = X N 0:i|i−1 T N i , where T N i is the transformation matrix in Eq. ( 12) (Brusdal et al., 2003, Eq. 20).

EnKS-4DVAR
We apply the EnKS algorithm (Eqs.15-19) with the increments δx in place of x to solve the linearized auxiliary leastsquares problem (Eq.3).Approximating by finite differences based at x i−1 with step τ > 0, we get the action of the linearized model operator and the linearized observation operator The Gauss-Newton method may diverge, but convergence to a stationary point of Eq. ( 1) can be recovered by a control of the step δx.Adding a constraint of the form δx i ≤ ε leads to globally convergent trust region methods (Gratton et al., 2013).Here, we add to Eq. ( 3) a Tikhonov regularization term of the form γ δx i 2 S −1 i , which controls the step size as well as rotates the step direction towards the steepest descent, and obtain the Levenberg-Marquardt method (Levenberg, 1944;Marquardt, 1963) x 0:L ← x 0:L + δx 0:L , where Under suitable technical assumptions, the Levenberg-Marquardt method is guaranteed to converge globally if the regularization parameter γ ≥ 0 is large enough (Gill and Murray, 1978;Osborne, 1976).Estimates for the convergence of the Levenberg-Marquardt method in the case when the linear system is solved only approximately exist (Wright and Holt, 1985).
Similarly as in Johns and Mandel (2008), we interpret the regularization term γ δx i 2 S −1 i in Eq. ( 22) as arising from additional independent observations δx i ≈ 0 with covariance γ −1 S i .The independent observation can be assimilated separately, resulting in a mathematically equivalent but often more efficient two-stage method -simply run the EnKF analysis twice.With the choice of S i as an identity or, more generally, a diagonal matrix, the implementation of these large observations can be made efficient (Mandel et al., 2009).We use the notation δx 0:i|i−1/2 for the increments after the first half-step, conditioned on the original observations only, and δx 0:i|i for the increments conditioned also on the regularization δx i ≈ 0. Note that, unlike in Johns and Mandel (2008), where the regularization was applied to a nonlinear problem and thus the sequential data assimilation was only approximate, here the EnKS is run on the auxiliary linearized problem, so all distributions are Gaussian and the equivalence of assimilating the observations at the same time and sequentially is statistically exact.

Initialize
if not given already.
2. Incremental 4DVAR (Eq.2): given x 0 , . .., x L , initialize the ensemble of increments i. Advance the ensemble of increments δx in time following Eq.( 16), with the linearized operator approximated from Eq. ( 20), ii. Compute the anomalies of the ensemble in the 4-D state space and in the observation space.
iii.The first analysis step: iv.If γ > 0, compute the anomalies of the ensemble in the 4-D state space: The observation operator for the regularization is the identity, so the anomalies in the observation space are simply Z N i .v. If γ > 0, regularization as the second analysis step with zero data and data covariance γ −1 S i : otherwise, δx 0:i|i = δx 0:i|i−1/2 , = 1, . .., N .
(b) Complete the approximate incremental 4DVAR iteration: update Note that for small γ → 0, Eq. ( 28) has asymptotically no effect: δx 0:i|i → δx 0:i|i−1/2 .The computational cost of EnKS-4DVAR is one evaluation of the model M i for the initialization, N + 1 evaluations of the model M i , and N evaluations of the observation operator H i in each incremental 4DVAR iteration, in each of the L observation periods.In comparison, the cost of EnKF is N evaluation of the model M i and of the observation operator H i in each observation period.Running the model and evaluating the observation operator are the major costs in practical problems such as weather models, rather than the linear algebra of the EnKS itself, in a reasonably efficient EnKF/EnKS implementation.
It can be proven that for small τ and large N, the iterates x 0:L converge to those of incremental 4DVAR (Bergou et al., 2014).Surprisingly, it turns out that in the case when τ = 1, we recover the standard EnKS applied directly to the nonlinear problem (Eq.1), as shown by the following theorem.In particular, EnKS-4DVAR does not converge when τ = 1 for nonlinear problems, because the result of each iteration is determined only by the starting value x 0 .It is interesting that the ensemble transform approach in Sakov et al. (2012) and Bocquet and Sakov (2012, 2013, 2014) corresponds to our τ = 1, but it does not seem to reduce to the standard EnKS.

Computational results
In this section, we investigate the performance of the EnKS-4DVAR method, described in this paper, by solving the nonlinear least-squares problem (Eq. 1) in which the dynamical models are chosen either by the Lorenz 63 system (Lorenz, 1963) or the two-level quasi-geostrophic model (Fandry and Leslie, 1984).Most of the experiments assess the convergence of the incremental 4DVAR iterations, with EnKS as the linear solver in a single assimilation cycle (Sects.5.1.1,5.1.2).We also demonstrate the overall long-term performance on a large number of assimilation cycles on the Lorenz 63 model in Sect.5.1.3.We first consider experiments where the regularization is not necessary to guarantee the convergence (i.e., γ = 0).Lorenz 63 equations are used as a forecast model for these experiments.Section 5.1 describes the Lorenz 63 model and presents numerical results on the convergence.Using the same model, in Sect.5.1.2,we investigate the impact of the finite differences parameter τ , used to approximate the derivatives of the model and observation operators, along the iterations.
Experiments where the regularization is necessary to guarantee the convergence are shown in Sect.5.2, and we analyze the impact of the regularization parameter γ on the application to the two-level quasi-geostrophic model.
Note that for the experiments presented here, we do not use localization; hence, we choose large ensemble sizes.In all experiments, the regularization covariance S i = I.

Numerical tests using the Lorenz 63 model
The Lorenz 63 equations (Lorenz, 1963) are given by the nonlinear system where x = x(t), y = y(t), z = z(t) and σ , ρ, β are parameters whose values are chosen as 10, 28, and 8/3, respectively, for the experiments described in this paper.These values result in a chaotic behavior with two regimes as illustrated in Fig. 1.This figure shows the Lorenz attractor, which has two lobes connected near the origin, and the trajectories of the system in this saddle region are particularly sensitive to perturbations.Hence, slight perturbations can alter the subsequent path from one lobe to the other.The state at time t is denoted by To evaluate the performance of the EnKS-4DVAR method, we will test it using the classical twin experiment technique, which consists in fixing an initial true state, denoted by truth 0 , and then integrating the initial truth in time using the model to obtain the true state truth i = M(truth i−1 ) at each cycle i.We then build the data y i by applying the observation operator H i to the truth at time i and by adding a Gaussian perturbation N (0, R i ).Similarly, the background x b is sampled from the Gaussian distribution with the mean truth 0 and the covariance matrix B. Then, we try to recover the truth using the observations and the background.covariance is chosen as the identity matrix of order 3, i.e., B = I 3 .The model is advanced in cycles of 0.1 time unit.Within each cycle, the differential equations are solved by the adaptive Runge-Kutta method implemented as MAT-LAB function ode45, with default parameter values.The assimilation time window length is L = 50 cycles (5 time unit total).The observation operator is defined as H i (x, y, z) = x 2 , y 2 , z 2 .At each time i, the observations are constructed as follows: y i = H i (truth i ) + v i , where v i is sampled from N(0, R) with R = I 3 .Observations are taken for each cycle (i = 1, . ..50).The ensemble size is fixed to N = 100.

Convergence of the iterations
Figure 2 shows the root square error (RSE) for the first five iterations, defined as where truth i is the true vector state at time i, x i is the j th iterate at time i, and n is the length of x i .Table 1 shows the root mean square error (RMSE) for each iterate given by RMSE (j )  = From Table 1 and Fig. 2, it can be seen that the iterates converge to the solution, without using regularization.For these experiments, we observe that RMSE is reduced significantly in five iterations.Note that the error does not converge to zero because of the approximation and variability inherent in the ensemble approach.

The impact of the finite difference parameter
Now we investigate the influence of the finite differences parameter τ used to approximate the derivatives of the model and observation operators.We use the same experimental setup as described in the previous section.The numerical results are based on 30 runs with eight iterations for the Lorenz 63 problem, with the following choices for the parameter τ : 1, 10 −1 , 10 −2 , 10 −3 , 10 −4 , 10 −5 , and 10 −6 .Table 2 shows the mean of the objective function value as a function of the finite difference step τ and the number of iterations.When τ = 1, the iterations after the first one do not improve the objective function.However, when τ ≤ 10 −1 , the objective function was overall decreasing along the iterations after a large initial increase.Because of the stochastic nature of the algorithm, the objective function does not necessarily decrease every iteration, and its values eventually fluctuate around a limit value randomly.This stage was achieved after at most six iterations, so only eight iterations are shown; further lines (not shown) exhibit the same fluctuating pattern in all columns.This limit value of the objective function decreases with smaller τ until it stabilizes for τ ≤ 10 −3 .Figures 3 and 4 show more details of the statistics as box plots of the objective function values.Each panel corresponds to one line of Table 2.
We can conclude that, for this toy test case at least, the method was insensitive to the choice of τ ≤ 10 −3 .This is a similar conclusion as in Bocquet and Sakov (2014); the parameter τ here plays the same role as their ε.It should be noted that very small τ , when the problem solved by the smoother is essentially the tangent problem, results in a large increase in the value of the objective function in the first iteration.This is not uncommon in Newton-type methods and highly nonlinear problems.Hence, an adaptive method, which decreases τ adaptively, may be of interest.This issue will be studied elsewhere.

Cycling
So far, we have studied the impact of the use of the stochastic solver for a single assimilation window only.Now we test the overall long-term performance.Consider again the Lorenz 63 model (Eq.32), with the parameters σ = 10, ρ = 28, β = 8/3.This time, we use the Runge-Kutta method of order 4 with the time step of 0.01 time unit.This is the same parameter setup as the one used in Bocquet and Sakov (2012).We The assimilation window is 50 cycles.In each box, the central line presents the median (red line), the edges are the 25th and 75th percentiles (blue line), the whiskers extend to the most extreme data points the plot algorithm considers not to be outliers (black line), and the outliers are plotted individually (red dots).then proceed with similar testing as in Metref et al. (2014).
We perform the usual twin model experiment.The initial truth state Y 0 is generated from N (0, I 3 ) distribution and the initial forecast state is then simulated by sampling from N(Y 0 , I 3 ).Both states are advanced for a 50 000 model time steps burn-in period.We use the nonlinear observational operator h (x, y, z) = x 3 , y 3 , z 3 with observational error generated from N 0, σ 2 I 3 with σ 2 = 8 and τ = 10 −4 .The cy-cle length t between the two available observations varies from 0.05 time unit, when the model is nearly linear, to 0.55 time unit, when the model is strongly nonlinear.We use ensemble size 10.After running multiple simulations, we have found suitable values of the parameters of the method as the number of iterations 25 and the penalty coefficient γ = 10 −9 when t = 0.05 and γ = 1000 otherwise.The length of the assimilation window is L = 6, i.e., assimilating six observa- Init 5.61 × 10 6 5.61 × 10 6 5.61 × 10 6 5.61 × 10 6 5.61 × 10 6 5.61 × 10 6 5.61 × 10 6 1 1.02 × 10 6 1.39 × 10 9 3.21 × 10 9 3.54 × 10 9 3.58 × 10 9 3.58 × 10 9 3.58 × 10 9 2 1.39 × 10 6 5.27 × 10 7 1.70 × 10 8 1.93 × 10 8 1.96 × 10 8 1.96 × 10 8 1.96 × 10 8 3 1.32 × 10 6 4.14 × 10 6 2.99 × 10 6 3.69 × 10 6 3.76 × 10 6 3.77 × 10 6 3.77 tion vectors at once.Each observation vector is assimilated only once; i.e., the assimilation windows do not overlap.To create the initial ensemble at the beginning of each iteration, we use the background covariance created as a weighted average of the sample covariance from the last iteration in the previous assimilation window and the identity matrix, similarly as in Hamill and Snyder (2000a).The weights are 0.99 for the sample covariance and 0.01 for the identity.The model error covariance in each cycle is Q = 0.01I 3 .The experiment was run for 100 000 observation cycles.We also compare the proposed method with the standard EnKF with ensemble size 10, where the initial ensemble is created after the burn-in period by adding perturbations sampled from N (0, I 3 ).For stability reasons and to preserve the covariance between ensemble members, we add noise sampled from N (0, 0.01I 3 ) after advancing the ensemble.The necessity of related covariance inflation was also pointed out in Bocquet and Sakov (2012).The EnKF algorithm is run every time when new observations are available.
Figure 5 shows that the proposed method has a significantly smaller RMSE than the EnKF in the case when the time between observation is larger and thus the behavior of the model is nonlinear.Only in the case when the cycle length between the observation is 0.05 time unit, i.e., the model behavior is nearly linear, does EnKF give a result comparable to the proposed method.

Numerical tests using a two-layer quasi geostrophic model (QG)
The EnKS-4DVAR algorithm has been implemented in the Object Oriented Prediction System (OOPS) (Trémolet, 2013), which is a data assimilation framework developed by the European Centre for Medium-Range Weather Forecasts (ECMWF).Numerical experiments are performed by using the simple two-layer quasi-geostrophic model in the OOPS platform.Numerical experiments are performed to solve the weak-constraint data assimilation problem (Eq. 1) by us- ing EnKS-4DVAR with regularization.Numerical results are presented in Sect.5.2.3.

A two-layer quasi-geostrophic model
The two-layer quasi-geostrophic channel model is widely used in theoretical atmospheric studies, since it is simple enough for numerical calculations and it adequately captures an important aspect of large-scale dynamics in the atmosphere.
The two-layer quasi-geostrophic model equations are based on the non-dimensional quasi-geostrophic potential vorticity, whose evolution represents large-scale circulations of the atmosphere.The quasi-geostrophic potential vorticity on the first (upper) and second (lower) layers can be written, respectively, as where ψ 1 and ψ 2 are the stream functions, ∇ 2 is the 2-D Laplacian, R s represents orography or heating, β is the (nondimensionalized) northward variation of the Coriolis parameter at the fixed latitude y, and f 0 is the Coriolis parameter at the southern boundary of the domain.L is the typical length scale of the motion we wish to describe, H 1 and H 2 are the depths of the two layers, g = g θ/θ is the reduced gravity where θ is the mean potential temperature, and θ is the difference in potential temperature across the layer interface.The non-dimensional equations (Fandry and Leslie, 1984;Pedlosky, 1979) can be derived as follows: where t denotes time, Ū is a typical velocity scale, x and y are the eastward and northward coordinates, respectively, u and v are the horizontal velocity components, β 0 is the northward derivative, and the tilde notation refers to the dimensionalized parameters.Potential vorticity in each layer is conserved and thus is described by where D i /Dt is the total derivative, defined by and are the horizontal velocity components in each layer.Therefore, the potential vorticity at each time step is determined by using the conservation of potential vorticity given by Eq. ( 36).In this process, time stepping consists of a simple first-order semi-Lagrangian advection of potential vorticity.
Given the potential vorticity at a fixed time, Eq. ( 35) can be solved for the stream function at each grid point and then the velocity fields obtained through Eq. ( 38).The equations are solved by using periodic boundary conditions in the west-east direction and the Dirichlet boundary condition in the north-south direction.For the experiments in this paper, we choose L = 10 6 m, Ū = 10 m s −1 , H 1 = 6000 m, H 2 = 4000 m, f 0 = 10 −4 s −1 , and β 0 = 1.5 × 10 −11 s −1 m −1 .For more details on the model and its solution, we refer to Fisher et al. (2011).
The domain for the experiments is 12 000 km by 6300 km for both layers.The horizontal discretization consists of 40× 20 points, so that the east-west and north-south resolution is approximately 300 km.The dimension of the state vector of the model is then 1600.Note that the state vector is defined only in terms of the stream function.

Experimental setup
The performance of EnKS-4DVAR with regularization is analyzed by using twin experiments (Sect.5.1).
The truth is generated from a model with layer depths of D 1 = 6000 m and D 2 = 4000 m, and the time step is set to 300 s, whereas the assimilating model has layer depths of D 1 = 5500 m and D 2 = 4500 m, and the time step is set to 3600 s.These differences in the layer depths and the time step provide a source of model error.
For all the experiments presented here, observations of non-dimensional stream function, vector wind and wind speed were taken from a truth of the model at 100 points randomly distributed over both levels.Observations were taken every 12 h.We note that the number of observations is much smaller than the dimension of the state vector.Observation errors were assumed to be independent of all others and uncorrelated in time.The standard deviations (SD) were chosen to be equal to 0.4 for stream function observation error, 0.6 for vector wind, and 1.2 for wind speed.The observation operator is the bi-linear interpolation of the model fields to horizontal observation locations.
The background error covariance matrix (matrix B) and the model error covariances (matrices Q i ) used in these experiments correspond to vertical and horizontal correlations.The vertical and horizontal structures are assumed to be separable.In the horizontal plane, covariance matrices correspond to isotropic, homogeneous correlations of a stream function with Gaussian spatial structure obtained from a fast Fourier transform approach (Dietrich and Newsam, 1997;Nowak et al., 2003).For the background covariance matrix B, the SD and the horizontal correlation length scale in these experiments were set to 0.8 and 10 6 m, respectively.For the model error covariance matrices Q i , the SD and the horizontal correlation length scale were set to 0.2 and 2 × 10 6 m, respectively.The vertical correlation is assumed to be constant over the horizontal grid, and the correlation coefficient value between the two layers was taken as 0.5 for Q i and 0.2 for B.

Numerical results
We perform one cycle for the experiments.The window length is set to 10 days when nonlinearity is increasing (Fisher et al., 2011, Fig. 2)  Figure 6 shows the objective function values along iterations of the incremental 4DVAR method.The objective function oscillates with the iteration number; therefore, the incremental 4DVAR method without regularization diverges.This divergence is due to the highly nonlinear behavior of the model for a long window (10 days).In such a case, as explained in Sect.4, a convergence to a stationary point can be recovered by controlling the step, which is done by introducing an additional regularization term in this study.In order to see the effect of this regularization, we performed EnKS-4DVAR with different values of the regularization parameter γ .Figure 7 shows the objective function values along iterations for eight different choices of γ .RMSE values along the iterations for the same experiments performed with 4DVAR and EnKS-4DVAR are presented in Table 3.
It can be seen from Fig. 7 that when γ = 0, the iterations diverge as expected, since we do not use regularization and we only approximate the linearized subproblem using ensembles.For small values of γ (e.g., γ ≤ 10 −1 ), the objective function is not monotonically decreasing; hence, the iterations are still diverging even if we use the regularization.Therefore, small values of γ can not guarantee the convergence.For large values of γ (e.g., γ ≥ 10), we can observe the decrease on the objective function along iterations.Moreover, the fastest decrease on the objective function is obtained for γ = 10.If we look at the RMSE values from Table 3, we can see that increasing γ beyond an optimal value results in higher RMSE values, and the reduction in RMSE values becomes very slow.In any case, the RMSE values oscillate along the iterations.We note that all RMSE values are lower than the initial RMSE value.
In conclusion, when the regularization is used, the choice of the regularization parameter γ is crucial to ensure the convergence.For instance, for small values of γ , the method can still diverge, and for large values of γ , the objective function decreases, but slowly (and many iterations may be needed to attain some predefined decrease).On the other hand, small γ values result in small RMSE values with oscillation along the iterations, and RMSE values decrease slowly for the larger values of γ .Therefore the regularization parameter should be neither "very small" nor "very large".An adaptive γ over iterations can be a better compromise, which will be explored in future studies.

Conclusions
We have proposed a stochastic solver for the incremental 4DVAR weak-constraint method.The regularization term added to the Gauss-Newton method, resulting in a globally convergent Levenberg-Marquardt method, maintains the structure of the linearized least-squares subproblem, enabling us to use an ensemble Kalman smoother as a linear solver while simultaneously controlling the convergence.We have formulated the EnKS-4DVAR method and have shown that it is capable of handling strongly nonlinear problems.We have demonstrated that the randomness of the EnKS version used (with perturbed data) eventually limits the convergence to a minimum, but a sufficiently large decrease in the objective function can be achieved for successful data assimilation.On the contrary, we suspect that the randomization may help to increase the supply of the search directions over the iterations, as opposed to deterministic methods locked into one low-dimensional subspace, such as the span of one given ensemble.
We have numerically illustrated the new method on the Lorenz 63 model and the two-level quasi-geostrophic model.We have analyzed the impact of the finite differences parameter τ used to approximate the derivatives of the model and observation operators.We have shown that for τ = 1, the iterates obtained from EnKS-4DVAR are equivalent to those obtained from the standard EnKS.Based on computational experiments, it may be better to start with the EnKS (i.e., τ = 1) and then to decrease τ in further iterations.
We have demonstrated long-term stability of the method on the Lorenz 63 model and shown that it achieves lower RMSE than standard EnKF for a highly nonlinear problem.This, however, took some parameter tuning, in particular the data error variance.
For the second part of the experiments, we have shown the performance of the EnKS-4DVAR method with regularization on the two-level quasi-geostropic problem, one of the widely used models in theoretical atmospheric studies, since it is simple enough for numerical calculations and it adequately captures an important aspect of large-scale dynamics in the atmosphere.We have observed that the incremental 4DVAR method does not converge for a long assimilation window length, and that the regularization is necessary to guarantee convergence.We have concluded that the choice of the regularization parameter is crucial to ensure the convergence, and different choices of this parameter can change the rate of decrease in the objective function.As a summary, an adaptive regularization parameter can be a better compromise to achieve the approximate solution in a reasonable number of iterations.
The choice of the parameters used in our approach is of crucial importance for the computational cost of the algorithm, for instance the number of iterations to obtain some desired reduction.The exploration in more detail of the best strategies to adapt these parameters' course of the iterations will be studied elsewhere.
The base method, used in the computational experiments here, is using sample covariance.However, there is a priori nothing to prevent the use of more sophisticated variants of EnKS with localization and the covariance inflation, and square root filters instead of EnKS with data perturbation, as is done in related methods in the literature.These issues, as well as the performance on larger and realistic problems, will be studied elsewhere.

Figure 2 .
Figure 2. Root square error given by Eq. (33) for the first five Gauss-Newton iterations from the Lorenz 63 problem.The initial conditions for the truth are x(0) = 1, y(0) = 1, and z(0) = 1.The cycle length is dt = 0.1 time unit.The observations are the full state at each time step.The ensemble size is N = 100.The assimilation window length is L = 50 cycles.The finite differences parameter is τ = 10 −3 .

Figure 3 .
Figure 3. Box plots of objective function values for the Lorenz 63 problem.From the left to the right and from the top to the bottom, the figures correspond to the results of the first, second, third and fourth iterations, respectively.The whole state is observed.Ensemble size is 50.The assimilation window is 50 cycles.In each box, the central line presents the median (red line), the edges are the 25th and 75th percentiles (blue line), the whiskers extend to the most extreme data points the plot algorithm considers not to be outliers (black line), and the outliers are plotted individually (red dots).

Figure 4 .
Figure 4. Same as Fig. 3, but for the fifth, sixth, seventh and eighth iteration, respectively.

Figure 5 .
Figure 5.Comparison of the RMSE between EnKF and EnKS-4DVAR from the twin experiment for the Lorenz 63 model.EnKS-4DVAR has better performance for the larger time interval between the observations as the model become more nonlinear.See Sect.5.1.3for further details.

Table 1 .
The root mean square error given by Eq. (34) for the first six Gauss-Newton iterations, for the Lorenz 63 problem.The whole state is observed.Ensemble size is 100.The assimilation window length is 50 cycles.The finite differences parameter is 10 −3 .

Table 2 .
Mean of the objective function from 30 runs of the EnKS-4DVAR algorithm for the Lorenz 63 problem and for different values of τ (finite differences parameter).The whole state is observed.Ensemble size is 50.The assimilation window length is 50 cycles.

Table 3 .
RMSE values calculated by Eq. (34) along the incremental 4DVAR and EnKS-4DVAR iterations for different values of the regularization parameter γ , for the two-level quasi-geostrophic model (Sect.5.2.2). = 2).No localization is used in the experiments; as a result the ensemble size is chosen to be large enough, N = 30 000.Therefore, this test is only a partial assessment.Localization and cycling in the QG model are beyond the scope of this paper.For the finite difference approximation, the parameter τ is set to 10 −4 for all experiments.We have performed experiments for incremental 4DVAR and EnKS-4DVAR.The incremental 4DVAR method used conjugate gradients to solve the linearized problem with exact tangent and adjoint models in each iteration, with no ensembles involved.The numerical results are presented as follows.