Implicit particle filtering for models with partial noise, and an application to geomagnetic data assimilation

Implicit particle filtering is a sequential Monte Carlo method for data assim- ilation, designed to keep the number of particles manageable by focussing attention on regions of large probability. These regions are found by min- imizing, for each particle, a scalar function F of the state variables. Some previous implementations of the implicit filter rely on finding the Hessians of these functions. The calculation of the Hessians can be cumbersome if the state dimension is large or if the underlying physics are such that derivatives of F are difficult to calculate. This is the case in many geophysical applica- tions, in particular for models with partial noise, i.e. with a singular state covariance matrix. Examples of models with partial noise include stochastic partial differential equations driven by spatially smooth noise processes and models for which uncertain dynamic equations are supplemented by con- servation laws with zero uncertainty. We make the implicit particle filter applicable to such situations by combining gradient descent minimization with random maps and show that the filter is efficient, accurate and reliable because it operates in a subspace whose dimension is smaller than the state dimension. As an example, we assimilate data for a system of nonlinear partial differential equations that appears in models of geomagnetism.


Introduction
The task in data assimilation is to use available data to update the forecast of a numerical model. The numerical model is typically given by a discretization of a stochastic differential equation (SDE) where x is an m-dimensional vector, called the state, t n , n = 0, 1, 2, . . . , is a sequence of times, R is an m-dimensional vector function, G is an m × m matrix and ∆W is an m-dimensional vector, whose elements are independent standard normal variates. The random vectors G(x n , t n )∆W n+1 represent the uncertainty in the system, however even for G = 0 the state x n may be random for any n because the initial state x 0 can be random. The data z l = h(x q(l) , t q(l) ) + Q(x q(l) , t q(l) )V l , are collected at times t q(l) , l = 1, 2, . . . ; for simplicity, we assume that the data are collected at a subset of the model steps, i.e. q(l) = rl, with r ≥ 1 being a constant. In the above equation, z is a k-dimensional vector (k ≤ m), h is a k-dimensional vector function, V is a k-dimensional vector whose components are independent standard normal variates, and Q is a k × k matrix. Throughout this paper, we will write x 1:n for the sequence of vectors x 0 , . . . , x n . Data assimilation is necessary in many areas of science and engineering and is essential in geophysics, for example in oceanography, meteorology, geomagnetism or atmospheric chemistry (see e.g. the reviews [5,18,20,27,28,39]). What makes the assimilation of data in geophysical applications difficult is the complicated underlying physics, which lead to a large state dimension m and a nonlinear function R in equation (1).
If the model (1) as well as h in (2) are linear and if, in addition, the initial state x 0 is Gaussian, then the probability density function (pdf) of the state x n is Gaussian for any n and can be characterized in full by its mean and covariance. The Kalman filter (KF) sequentially computes the mean of the model (1), conditioned on the observations and, thus, provides the best linear unbiased estimate of the state [23]. The ensemble Kalman filter (EnKF) is a Monte Carlo approximation of the Kalman filter and can be obtained by replacing the state covariance matrix by the sample covariance matrix in the Kalman formalism. The state covariance is the covariance matrix of the pdf of the current state conditioned on the previous state which we calculate from the model (1) to be: p(x n+1 | x n ) ∼ N (R(x n , t n ), G(x n , t n )G(x n , t n ) T ), where N (µ, Σ) denotes a Gaussian with mean µ and covariance matrix Σ.
To streamline the notation we write for the state covariance where T denotes a transpose. In the EnKF, the sample covariance matrix is computed from an "ensemble," by running the model (1) for different realizations of the noise process ∆W . The Monte Carlo approach avoids the computationally expensive step of updating the state covariance in the Kalman formalism. Both KF and EnKF have extensions to nonlinear, non-Gaussian models, however they rely on linearity and Gaussianity approximations [22]. Variational methods [3,11,12,[36][37][38]42] aim at assimilating the observations within a given time window by computing the state trajectory of maximum probability. The trajectory is computed by minimizing a suitable cost function which is, up to a normalization constant, the logarithm of the pdf of the state trajectory = x 0:q(l) given the set of observations z 1:l , p(x 0:q(l) | z 1:l ). In particular, 3D-Var methods assume a static model [36]. Strong constraint 4D-Var determines an optimal initial state given a "perfect" dynamic model, i.e. G = 0, and a Gaussian initial uncertainty, i.e. x 0 ∼ N (µ 0 , Σ 0 ) [11,12,36,37]. Uncertain models with G = 0 are tackled with a weak constraint 4D-Var approach [3,38,42]. Many implementations of variational methods compute the gradient of the cost function from tangent linear adjoint equations and rely on linear approximations.
(5) In the above equation p(x 0:q(l+1) | z 1:l+1 ) is the pdf of the state trajectory up to time t q(l+1) given all available observations up to time t q(l+1) and is called the target density; p(z l+1 | x q(l+1) ) is the probability density of the current observation given the current state and can be obtained from (2): p(z l+1 | x q(l+1) ) ∼ N (h(x q(l) , t q(l) ), Σ n z ), with Σ n z = Q(x n , t n )Q(x n , t n ) T .
The pdf p(x q(l)+1:q(l+1) | x q(l) ) is the density of the state trajectory from the previous assimilation step to the current observation, conditioned on the state at the previous assimilation step, and is determined by the model (1). A standard version of the sampling-importance-resampling (SIR) particle filter (also called bootstrap filter, see e.g. [14]) generates, at each step, samples from p(x q(l)+1:q(l+1) | x q(l) ) (the prior density) by running the model. These samples (particles) are weighted by the observations with weights w ∝ p(z l+1 | x q(l+1) ), to yield a posterior density that approximates the target density p(x 0:q(l) | z 1:l ). One then removes particles with a small weight by "resampling" (see e.g. [1] for resampling algorithms) and repeats the procedure when the next observation becomes available. This SIR filter is straightforward to implement, the catch is that many particles have small weights because the particles are generated without using information from the data. If many particles have a small weight, the approximation of the target density is poor and the number of particles required for a good approximation of the target density can grow catastrophically with the dimension of the state [4,34]. Various methods, e.g. different prior densities and weighting schemes, have been invented to ameliorate this problem (see e.g. [14,[39][40][41]).
The basic idea of implicit particle filters [7,8,31] is to use the available observations to find regions of high probability in the target density and look for samples within this region. This implicit sampling strategy generates a thin particle beam within the high probability domain and, thus, keeps the number of particles required manageable, even if the state dimension is large. The focussing of particles is achieved by setting up an underdetermined algebraic equation that depends on the model (1) as well as on the data (2), and whose solution generates a high probability sample of the target density. We review the implicit filter in the next section, and it will become evident that the construction assumes that the state covariance Σ n x in (4) is nonsingular. This condition is often not satisfied. If, for example, one wants to assimilate data into a stochastic partial differential equation (SPDE) driven by spatially smooth noise, then the continuous-time noise process can be represented by a series with rapidly decaying coefficients, leading to a non-singular or ill-conditioned state covariance Σ n x in discrete time and space (see Sections 3.1 and 4, as well as [10,21,26]). A second important class of models with partial noise are uncertain dynamic equations supplemented by conservation laws (e.g. conservation of mass) with zero uncertainty. Such models often appear in data assimilation for fluid dynamics problems [25].
The purpose of the present paper is two-fold. First, in Section 2, we present a new implementation of the implicit particle filter. Most previous implementations of the implicit filter [7,31] rely in one way or another on finding the Hessians of scalar functions in rm variables. For systems with very large state vectors and considerable gaps between observations, memory constraints may forbid a computation of these Hessians. Our new implementation combines gradient descent minimization with random maps [31] to avoid the calculation of Hessians, and thus reduces the memory requirements.
The second objective is to consider models with a singular or ill-conditioned state covariance Σ n x where previous implementations of the implicit filter, as described in [7,8,31], are not applicable. In Section 3, we make the implicit filter applicable to models with partial noise and show that our approach is then particularly efficient, because the filter operates in a space whose dimension is determined by the rank of Σ n x , rather than by the model dimension. We compare the new implicit filter to SIR, EnKF and variational methods, in particular with respect to how information is propagated from observed variables to unobserved ones.
In Section 4, we illustrate the theory with an application in geomagnetism and consider two coupled nonlinear SPDE's with partial noise. We observe that the implicit filter gives good results with very few (4-10) particles, while EnKF and SIR require hundreds to thousands of particles for similar accuracy.

Implicit sampling with random maps
We first follow [31] closely to review implicit sampling with random maps. Suppose we are given a collection of M particles X q(l) j , j = 1, 2, . . . , M , whose empirical distribution approximates the target density at time t q(l) , where q(l) = rl, and suppose that an observation z l+1 is available after r steps at time t q(l+1) = t r(l+1) . From (5) we find, by repeatedly using Bayes' theorem, that, for each particle, Implicit sampling is a recipe for computing high-probability samples from the above pdf. To draw a sample we define, for each particle, a function F j by exp(−F (X j )) = p(X where X j is shorthand for the state trajectory X q(l)+1:q(l+1) j . Specifically, we have where R n j is shorthand notation for R(X n j , t n ) and where Z j is a positive number that can be computed from the normalization constants of the various pdf's in the definition of F j in (9). With this F j , we solve the algebraic equation where ξ j is a realization of the rm−dimensional reference variable ξ ∼ N (0, I), and where φ j = min F j .
The choice of a Gaussian reference variables does not imply linearity or Gaussianity assumptions and other choices are certainly possible. We find solutions of (11) by using the random map where λ j is a scalar, µ j is an rm-dimensional column vector which represents the location of the minimum of F j , i.e. µ j = argminF j , L j is a deterministic rm×rm matrix we can choose, and η j = ξ j / ξ T j ξ j , is uniformly distributed on the unit rm-sphere. Upon substitution of (13) into (11), we can find a solution of (11) by solving a single algebraic equation in the variable λ j . The weight of the particle can be shown to be (see [31], Section 3) where ρ j = ξ T j ξ j and det L j denotes the determinant of the matrix L j (see [31] for details of the calculation). An expression for the scalar derivative ∂λ j /∂ρ j can be obtained by implicit differentiation of (11): where ∇F j denotes the gradient of F j (an rm-dimensional row vector). The weights are normalized so that their sum equals one. The weighted positions X j of the particles approximate the target pdf. We compute the mean of X j with weights w j as the state estimate, and then proceed to assimilate the next observation. The method just described makes use of only one set of observations per assimilation step, however an extension to multiple observation sets per assimilation step (smoothing) is straightforward.
2.1 Implementation of an implicit particle filter with gradient descent minimization and random maps An algorithm for data assimilation with implicit sampling and random maps was presented in [31]. This algorithm relies on the calculation of the Hessians of the F j 's and the Hessians are used for minimizing the F j 's with Newton's method and for setting up the random map. The calculation of the Hessians however may not be easy in some applications because of a very large state dimension, or because the second derivatives are hard to calculate, as is the case for models with partial noise (see Section 3). To avoid the calculation of Hessians, we propose to use a gradient descent algorithm with line-search to minimize the F j 's (see e.g. [32]), along with simple random maps. Of course other minimization techniques, in particular quasi-Newton methods (see e.g. [16,32], can also be applied here and perhaps speed up the minimization. However, we decided to use gradient descent with line search to keep the minimization as simple as possible. For simplicity, we assume that G and Q in (1)-(2) are constant matrices and calculate the gradient of F j from (10): with for k = q(l)+1, q(l)+2, . . . , q(l+1)−1, where R n is shorthand for R(X n , t n ), and where Here, we dropped the index j for the particles for notational convenience. We initialize the minimization using the result of a simplified implicit particle filter (see next subsection). Once the minimum is obtained, we substitute the random map (13) with L j = I, where I is the identity matrix, into (11) and solve the resulting scalar equation by Newton's method. The scalar derivative we need for the Newton steps is computed numerically. We initialize this iteration with λ j = 0. Finally, we compute the weights according to (14). If some weights are small, as indicated by a small effective sample size [1], we resample using Algorithm 2 in [1]. The implicit filtering algorithm with gradient descent minimization and random maps is summarized in pseudocode in Algorithm 1. This implicit filtering algorithm shares with weak constraint 4D-Var that a "cost function" (here F j ) is minimized by gradient descent. However, most 4D-Var implementations use tangent linear adjoint equations to compute the gradient. In the implicit filtering Algorithm 1, we do a fully nonlinear calculation of the gradient. Two further differences between 4D-Var and Algorithm 1 are (i ) 4D-Var does not update the state sequentially, but the implicit particle filter does and, thus, reduces memory requirements; (ii ) 4D-Var computes the most likely state by minimizing the cost function, and this estimate can be biased; the implicit particle filter approximates the Algorithm 1 Implicit Particle Filter with Random Maps and Gradient Descent Minimization • Set up and minimize F j using gradient descent to compute φ j and µ j • Sample reference density ξ j ∼ N (0, I) • Compute ρ j = ξ T j ξ j and η j = ξ j / √ ρ j • Solve (11) using the random map (13) with L j = I • Compute weight of the particle using (14) • Save particle X j and weight w j end for • Normalize the weights so that their sum equals 1 • Compute state estimate from X j weighted with w j (e.g. the mean) target density and, thus, can compute other statistics as state estimates, in particular the conditional expectation, which is, under wide conditions, the optimal state estimate (see e.g. [9]).

A simplified implicit particle filtering algorithm with random maps and gradient descent minimization
We wish to simplify the implicit particle filtering algorithm by reducing the dimension of the function F j . The idea is to do an implicit sampling step only at times t q(l+1) , i.e. when an observation becomes available. The state trajectory of each particle from time t q(l) (the last time an observation became available) to t q(l+1)−1 , is generated using the model equations (1). This approach reduces the dimension of F j from rm to m (the state dimension). The simplification is thus very attractive if the number of steps between observations, r, is large. However, difficulties can also be expected for large r: the state trajectories up to time t q(l+1)−1 are generated by the model alone and, thus, may not have a high probability with respect to the observations at time t q(l+1) . The focussing effect of implicit sampling can be expected to be less emphasized and the number of particles required may grow as the gap between observations becomes larger. Whether or not the simplification we describe here can reduce the computational cost is problem dependent and we will illustrate advantages and disadvantages in the examples in Section 4. Suppose we are given a collection of M particles X q(l) j , j = 1, 2, . . . , M , whose empirical distribution approximates the target density at time t q(l) and the next observation, z l+1 , is available after r steps at time t q(l+1) . For each particle, we run the model for r−1 steps to obtain X q(l)+1 j , . . . , X q(l+1)−1 j . We then define, for each particle, a function F j by whose gradient is given by (18). The algorithm then proceeds as Algorithm 1 in the previous section: we find the minimum of F j using gradient descent and solve (11) with the random map (13) with L j = I. The weights are calculated by (14) with r = 1 and the mean of X j weighted by w j is the state estimate at time t q(l+1) .
This simplified implicit filter simplifies further if the observation function is linear, i.e. h(x) = Hx, where H is a k × m matrix. One can show [31] that the minimim of F j is with A numerical approximation of the minimum is thus not required. The location of the minimum is with Moreover, if L j is a Cholesky factor of Σ j , then X j = µ j + L j ξ j solves (11) and the weights simplify to For the special case of a linear observation function and observations available at every model step (r = 1), the simplified implicit filter is the full implicit filter and reduces to a version of optimal importance sampling [1, 5, 7, 31].
3 Implicit particle filtering for equations with partial noise We consider the case of a singular state covariance matrix Σ x in the context of implicit particle filtering. We start with an example taken from [21], to demonstrate how a singular state covariance appears naturally in the context of SPDE's driven by spatially smooth noise. The example serves as a motivation for more general developments in later sections. Another class of models with partial noise consists of dynamical equations supplemented by conservation laws. The dynamics are often uncertain and thus driven by noise processes, however there is typically zero uncertainty in the conservation laws (e.g. conservation of mass), so that the full model (dynamics and conservation laws) is subject to partial noise [25].
where Γ is a continuous function, and W t is a cylindrical Brownian motion (BM) [21]. The derivative ∂W t /∂t in (26) is formal only (it does not exist in the usual sense). Equation (26) is supplemented by homogeneous Dirichlet boundary conditions and the initial value u(x, 0) = u o (x). We expand the cylindrical BM W t in the eigenfunctions of the Laplace operator where β k t denote independent BM's and where the coefficients q k ≥ 0 must be chosen such that, for γ ∈ (0, 1), where λ k are the eigenvalues of the Laplace operator [21]. If the coefficients q k decay fast enough, then, by (27) and basic properties of Fourier series, the noise is smooth in space and, in addition, the sum (28) remains finite as is required. For example one may be interested in problems where for some c > 0. The continuous equation must be discretized for computations and here we consider the Galerkin projection of the SPDE into an m-dimensional space spanned by the first m eigenfunctions e k of the Laplace operator dU m After multiplying with the basis functions and integrating over the spatial domain, we are left with a set of m stochastic ordinary differential equations where x is an m-dimensional state vector, f is a nonlinear vector function, W is a BM. In particular, we calculate from (31): where diag(a) is a diagonal matrix whose diagonal elements are the components of the vector a. Upon time discretization using, for example, a stochastic version of forward Euler with time step δ [24], we arrive at (1) with It is now clear that the state covariance matrix Σ x = GG T is singular for c < m.
A singular state covariance causes no problems for running the discrete time model (1) forward in time. However problems do arise if we want to know the pdf of the current state given the previous one. For example, the functions F j in the implicit particle filter algorithms (either those in Section 2, or those in [7,8,31]) are not defined for singular Σ x . If c ≥ m, then Σ x is illconditioned and causes a number of numerical issues in the implementation of these implicit particle filtering algorithms and, ultimately, the algorithms fail.
3.2 Implicit particle filtering of models with partial noise, supplemented by densely available data We start deriving the implicit filter for models with partial noise by considering the special case in which observations are available at every model step (r = 1). For simplicity, we assume that the noise is additive, i.e. G(x n , t n ) = G = constant and that Q in (2) is a constant matrix. Under these assumptions, we can use a linear coordinate transformation to diagonalize the state covariance matrix and rewrite the model (1) and the observations (2) as where x is a p-dimensional column vector, p < m is the rank of the state covariance matrix (4), and where f is a p-dimensional vector function,Σ x is a non-singular, diagonal p × p matrix, y is a (m − p)-dimensional vector, and g is a (m − p)-dimensional vector function. For ease of notation, we drop the hat above the "new" state covariance matrixΣ x in (35) and, for convenience, we refer to the set of variables x and y as the "forced" and "unforced variables" respectively.
The key to filtering this system is observing that the unforced variables at time t n+1 , given the state at time t n , are not random. To be sure, y n is random for any n due to the nonlinear coupling g(x, y), but the conditional pdf p(y n+1 | x n , y n ) is the delta-distribution. For a given (not random) initial state x 0 , y 0 , the target density is Suppose we are given a collection of M particles, X n j , Y n j , j = 1, 2, . . . , M , whose empirical distribution approximates the target density p(x 0:n , y 0:n | z 1:n ) at time t n . The pdf for each particle at time t n+1 is thus given by (38) with the substitution of X j for x and Y j for y. In agreement with the definition of F j in previous implementations of the implicit filter, we define F j here by More specifically, where f n j is shorthand notation for f (X n j , Y n j , t n ). With this F j , we can use Algorithm 1 to construct the implicit filter. For this algorithm we need the gradient of F j : Note that Y n+1 j is fixed for each particle, if its previous state, (X n j , Y n j ), is known, so that the filter only updates X n+1 j when the observations z n+1 become available. The unforced variables of the particles, Y n+1 j , are moved forward in time using the model, as they should be, since there is no uncertainty in y n+1 given x n , y n . The data are used in the state estimation of y indirectly through the weights and through the nonlinear coupling between the forced and unforced variables of the model. If one observes only the unforced variables, i.e. h(x, y) = h(y), then the data is not used directly when generating the forced variables, X n+1 j , because the second term in (40) is merely a constant. In this case, the implicit filter becomes equivalent to a standard SIR filter, with weights w n+1 j = w n j exp(−φ j ). This implementation of the implicit filter is numerically effective for filtering systems with partial noise, because the filter operates in a space of dimension p (the rank of the state covariance matrix), which typically is less than the state dimension (see the example in Section 4). The use of a gradient descent algorithm and random maps further makes the often costly computation of the Hessian of F j unnecessary. If h is linear no iterative minimization is required.
If the state covariance matrix is ill-conditioned, a direct implementation of Algorithm 1 is not possible. We propose to diagonalize the state covariance and set all eigenvalues below a certain threshold to zero so that a model of the form (35)-(37) can be obtained. In our experience, such approximations are accurate and the filter of this section can be used. 3.3 Implicit particle filtering for models with partial noise, supplemented by sparsely available data We extend the results of Section 3.2 to the more general case of observations that are sparse in time. Again, the key is to realize that y n+1 is fixed given x n , y n . For simplicity, we assume additive noise and a constant Q in (2), so that the target density becomes p(x 0:q(l+1) , y 0:q(l+1) | z 1:l+1 ) ∝ p(x 0:q(l) , y 0:q(l) | z 1:l ) Given a collection of M particles, X n j , Y n j , j = 1, 2, . . . , M , whose empirical distribution approximates the target density p(x 0:q(l) , y 0:q(l) | z 1:l ) at time t q(l) , we define, for each particle, the function F j by where X j is shorthand for X q(l)+1,...,q(l+1) j , so that At each model step, the unforced variables of each particle depend on the forced and unforced variables of the particle at the previous time step, so that Y . The function F j thus depends on the forced variables only. However, the appearances of the unforced variables in F j make it rather difficult to compute derivatives. The implicit filter with gradient descent minimization and random maps (see Algorithm 1) is thus a good filter for this problem, because it only requires computation of the first derivatives of F j , while other implementations (see [7,31]) require second derivatives as well.
The gradient of F j is given by the rp-dimensional row vector for k = q(l)+1, . . . , q(l+1)−1 and where (·) | k denotes "evaluate at time t k ." The derivatives ∂y i /∂X k j , i = k + 1, . . . , q(l), can be computed recursively while constructing the sum, starting with and then using The minimization of F j for each particle is initialized with a free model run for r steps, with initial conditions given by the final position of the j th particle at the previous assimilation step. With this initial guess we compute the gradient using (44)-(47) and, after a line search and one step of gradient descent, obtain a new set of forced variables. We use this result to update the unforced variables by the model, and proceed to the next iteration. Once the minimum φ j and its location µ j are found, we use the random map (13) with L j = I to compute X q(l)+1 j , . . . , X q(l+1) j for this particle and then use these forced variables to compute Y q(l)+1,...,q(l+1) j . We do this for all particles, and compute the weights from (14) with m = p, then normalize the weights so that their sum equals one and thereby obtain an approximation of the target density. We resample if the effective sample size M Ef f is below a threshold and move on to assimilate the next observation. The implicit filtering algorithm is summarized with pseudo code in Algorithm 2.

Algorithm 2 Implicit Particle Filter with Random Maps and Gradient Descent Minimization for Models with Partial Noise
• Set up and minimize F j using gradient descent: Initialize minimization with a free model run while Convergence criteria not satisfied do Compute gradient by (44) Do a line search Compute next iterate by gradient descent step Use results to update unforced variables using the model Check if convergence criteria are satisfied end while • Sample reference density ξ j ∼ N (0, I) • Compute ρ j = ξ T j ξ j and η j = ξ j / √ ρ j • Solve (11) using random map (13) with L j = I to compute X j • Use this X j and the model to compute corresponding Y j • Compute weight of the particle using (14) • Save particle (X j , Y j ) and weight w j end for • Normalize the weights so that their sum equals 1 • Compute state estimate from X j weighted with w j (e.g. the mean) Note that all state variables are computed by using both the data and the model, regardless of which set of variables (the forced or unforced ones) is observed. The reason is that sparse observations induce a nonlinear coupling, through f and g in (35)- (37), between the unforced and forced variables at the various model steps. It should also be noted that the function F j is a function of rp variables (rather than rm), because the filter operates in the subspace of the forced variables. If the minimization is computationally too expensive, because p or r is extremely large, then one can easily adapt the "simplified" implicit particle filter of Section 2.2 to the situation of partial noise using the methods we have described above. If h is nonlinear, this simplified filter requires a minimization of a p-dimensional function for each particle. If h is linear, no numerical minimization is required.

Discussion
We wish to point out similarities and differences between the implicit filter and three other data assimilation methods. In particular, we discuss how data are used in the generation of the state estimates. It is clear that the implicit filter uses the available data as well as the model to generate the state trajectories for each particle, i.e. it makes use of the nonlinear coupling between forced and unforced variables.
SIR and EnKF make less direct use of the data. In SIR, the particle trajectories are generated using the model alone and only later weighted by the observations. Data thus propagate to the SIR state estimates indirectly through the weights. In EnKF, the state trajectories are generated using the model and only the states at times t q(l) (when data are available) are updated by data. Thus, EnKF uses the data only to update its state estimates at times for which data are actually available.
A weak constraint 4D-Var method is perhaps closest in spirit to the implicit filter. In weak constraint 4D-Var, a cost function similar to F j is minimized (typically by gradient descent) to find the state trajectory with maximum probability given data and model. If one picks the time window for a 4D-Var assimilation from one observation z l to the next z l+1 , then the use of the data is similar to the use of the data in an implicit filter, because, in both algorithms, the model as well as data are used to generate the state trajectories. In fact, one can view the implicit particle filter as a randomized and sequential version of weak constraint 4D-Var (or, one may interpret weak constraint 4D-Var as an implicit smoother with a single particle). These issues will be taken up in more detail in a subsequent paper [2].
Finally, we would like to discuss the implicit filtering algorithm for the special case of a perfect model, i.e.
Following the steps above and, assuming we are given a collection of M particles, Y n j , j = 1, 2, . . . , M , whose empirical distribution approximates the target density p(y 0:q(l) | z 1:l ) at time t q(l) , we define, for each particle, the function F j by so that Since Y q(l) j is fixed for each particle, F j is merely a constant that is used to weigh the particle trajectory by the weight The data are used indirectly here, because the initial condition determines the full state trajectory. However, this initial condition is fixed for each particle. For a perfect model, strong constraint 4D-Var makes more efficient use of the available data by using it to find an "optimal initial condition," compatible with the data.

Application to Geomagnetism
Data assimilation has been recently applied to geomagnetic applications and there is a need to find out which data assimilation technique is most suitable [18]. Thus far, a strong constraint 4D-Var approach [17] and a Kalman filter approach [35] have been considered. Here, we apply the implicit particle filter to a test problem very similar to the one first introduced by Fournier and his colleagues in [17]. The model is given by two SPDE's where, g u , g b are scalars, and where W is a stochastic process (the derivative here is formal and may not exist in the usual sense). We study the above equations with ν = 10 −3 as in [17], and with g u = 0.01, g b = 1, so that the uncertainty in the unobserved quantity is much larger than the uncertainty in the unobserved quantity. We consider the above equations on the strip 0 ≤ t ≤ T , −1 ≤ x ≤ 1 and with boundary conditions Physically, u represents the velocity field and b represents the secular variation of the magnetic field. The model is essentially the model proposed in [17], but with additive noise where w 1 k , w 2 k are independent BMs and where Here, we are content with this simple noise model that represents a small uncertainty at the boundaries of both fields and is spatially smooth. However, it is straightforward to incorporate more information about the spatial distribution of the uncertainty by picking suitable coefficients α k , β k . An illustration of the noise process is given in Figure 1.

Discretization of the dynamical equations
We follow [17] in the discretization of the dynamical equations, however we decided to present some details of the discretization to explain how the noise process W comes into play.
For both fields, we use Legendre spectral elements of order N (see e.g. [ 6,13]), so that where ψ j are the characteristic Lagrange polynomials of order N , centered at the jth Gauss-Lobatto-Legendre (GLL) node ξ j . We substitute the expansions into the weak form of (53) and (54) (no integration by parts) and evaluate the integrals by Gauss-Lobatto-Legendre quadrature where w j are the corresponding weights. Making use of the orthogonality of the basis functions, ψ j (ξ k ) = δ j,k , we obtain the set of SDE's where • denotes the Hadamard product ((û •b) k =û kbk ),û,b,Ŵ are m = (N − 2)-dimensional column vectors whose components are the coefficients in the series expansions of u, b, W u and W b respectively and where Ψ B x = diag ((∂ x ψ j (ξ 1 ), . . . , ∂ x ψ j (ξ N −1 ))) and Ψ B xx = (∂ xx ψ 2 (ξ 1 ), . . . , ∂ xx ψ N −1 (ξ N −1 )) T is a diagonal m×m matrix and an m-dimensional column vector respectively, which make sure that our approximation satisfies the boundary conditions. In the above equations, the m × m matrices M , D and D 2 are given by We apply a first-order implicit-explicit method with time step δ for time discretization and obtain the discrete-time and discrete-space equations and F s = (sin(π), sin(2π), . . . , sin(mπ))(ξ 1 , ξ 2 , . . . , ξ m ) T , F c = (cos(π/2), cos(3π/2), . . . , cos(mπ/2))(ξ 1 , ξ 2 , . . . , ξ m ) T . (64) For our choice of α k , β k in (58), the state covariance matrices Σ u and Σ b are singular if m > 10. To diagonalize the state covariances we solve the symmetric eigenvalue problems [33] and define the linear coordinate transformations where the columns of the m × m-matrices V u and V b are the eigenvectors of v u , v b respectively. The discretization using Legendre spectral elements works in our favor here, because the matrices M and D 2 are symmetric so that we can diagonalize the left hand side simultaneously with the state covariance matrix to obtain where f u , f b are 10-dimensional vector functions, g u , g b are (m−10)-dimensional vector functions and wherê We test the convergence of our approximation as follows. To assess the convergence in the number of grid-points in space, we define a reference solution using N = 2000 grid-points and a time step of δ = 0.002. We compute another approximation of the solution, using the same (discrete) BM as in the reference solution, but with another number of gridpoints, say N = 500. We compute the error at t = T = 0.2, e x = || (u 500 (x, where || · || denotes the Euclidean norm, and store it. We repeat this procedure 500 times and compute the mean of the error norms. The results are shown in the left panel of Figure 2. We observe a super algebraic convergence as expected from a spectral method.
Similarly, we check the convergence of the approximation in the time step by computing a reference solution with N Ref = 1000 and δ Ref = 2 −12 . Using the same BM as in the reference solution, we compute an approximation with time step δ and compute the error at t = T = 0.2, , and store it. We repeat this procedure 500 times and then compute the mean of these error norms. The results are shown in the right panel of Figure 2. We observe a first order decay in the error as is expected. The scheme has converged for time steps smaller than δ = 0.002, so that a higher resolution in time does not improve the accuracy of the approximation. Moreover, the Courant-Friederichs-Lewy condition limits the time step for a given number of nodes. The limit here is quite strict because the Legendre elements accumulate grid-points close to the boundaries so that the smallest spacing between grid-points is very small, even for a moderate number of nodes.  Here we are satisfied with an approximation with δ = 0.002 and N = 300 grid-points in space as in [17]. The relatively small number of spatial gridpoints is sufficient because the noise is very smooth in space and because the Legendre spectral elements accumulate nodes close to the boundaries and, thus, represent the steep boundary layer, characteristic of (53)-(54), well even if N is small (see also [17]).

Filtering results
We apply the implicit particle filter with gradient descent minimization and random maps (see Algorithm 2 in Section 3), the simplified implicit particle filter (see Section 2.2) adapted to models with partial noise, a standard EnKF (without localization or inflation), as well as a standard SIR filter to the test problem (53)-(54). The numerical model is given by the discretization described in the previous section with a random initial state. The distribution of the initial state is Gaussian with mean u(x, 0), b(x, 0) as in (55)-(56) and with a covariance Σ u , Σ b given by (60)-(61). In Figure 3, we illustrate the uncertainty in the initial state and plot 10 realizations of the initial state (grey lines) along with its mean (black lines). We observe that the uncertainty in u 0 is small compared to the uncertainty in b 0 .
The data are the values of the magnetic field b, measured at k equally where s = 0.001 and where H is a k × m-matrix that maps the numerical approximation b (defined at the GLL nodes) to the locations where data is collected. We consider data that are dense in time (r = 1) as well as sparse in time (r > 1). The data are sparse in space and we consider two cases: (i ) we collect the magnetic field at 200 equally spaced locations; and (ii ) we collect b at 20 equally spaced locations. The variables u are unobserved and it is of interest to study how the various data assimilation techniques make use of the information in b to update the unobserved variables u [17,18].
To assess the performance of the filters, we ran 100 twin experiments. A twin experiment amounts to: (i ) drawing a sample from the initial state and running the model forward in time until T = 0.2 (one fifth of a magnetic diffusion time [17]) (ii ) collecting the data from this free model run; and (iii ) using the data as the input to a filter and reconstructing the state trajectory. Figure 4 shows the result of one twin experiment for r = 4.
For each twin experiment, we calculate and store the error at T = 0.2 in the velocity, e u = || u(x, T )−u F ilter (x, T ) || / || u(x, T ) ||, and in the magnetic After running the 100 twin experiments, we calculate the mean of the error norms (not the mean error) and the variance of the error norms (not the variance of the error). All filters we tested were "untuned," i.e. we have not adjusted or inserted any free parameters to boost the performance of the filters. Figure 5 shows the results for the implicit particle filter, the EnKF as well as the SIR filter for 200 measurement locations and for r = 10. It is evident from this figure that the implicit particle filter requires only very few particles to yield accurate state estimates with less than 1% error in the observed variables. The SIR filter with 1000 particles gives significantly larger errors (about 10% in the observed variables) and much larger variances in the errors. The EnKF requires about 500 particles to come close to the accuracy of the implicit filter with only 10 particles.
In the experiments, we observed that the minimization in implicit particle filtering typically converged after 4-10 steps (depending on r, the gap in time between observations). The convergence criterion was to stop the iteration when the change in F j was less than 10%. A more accurate minimization did not improve the results significantly, so that we were satisfied with a relatively crude estimate of the minimum in exchange for a speed-up of the algorithm. We found λ by solving (11) with Newton's method using λ 0 = 0 as initial guess and observed that it converged after about eight steps. The convergence criterion was to stop the iteration if |F (λ) − φ − ρ| ≤ 10 −3 , because the accurate solution of this scalar equation is numerically inexpensive. We resampled using Algorithm 2 in [1] if the effective sample size M Ef f in (19) is less than 90% of the number of particles.
To further investigate the performance of the filters, we run more numerical experiments and vary the availability of the data in time, as well as  the number of particles. Figure 6 shows the results for the implicit particle filter, the simplified implicit particle filter, the EnKF and the SIR filter for 200 measurement locations and for r = 1, 2, 4, 10.
We observe from Figure 6, that the error statistics of the implicit particle filter have converged, so that there is no significant improvement when we increase the number of particles to more than 10. In fact, the numerical experiments suggest that no more than 4 particles are required here. Independent of the gap between the observations in time, we observe an error of less than 1% in the observed variable. The error in the unobserved variable u depends strongly on the gap between observations and, for a large gap, is about 12%.
The reconstructions of the observed variables by the simplified implicit particle filter are rather insensitive to the availability of data in time and, with 20 particles, the simplified filter gives an error in u of less than 1%. The errors in the unobserved quantity depend strongly on the gap between the observations and can be as large as 15%. Here, we need more particles, observe a larger error and a larger sensitivity of the errors to the availability of the data, because the simplified filter makes less direct use of the data, than the "full" implicit filter, since it generates the state trajectories using the model and only the final position of each particle is updated by the data. Thus, the error increases as the gap in time between the observations becomes larger. Again, the error statistics have converged and only minor improvements can be expected if the number of particles is increased beyond 20.
The SIR filter also makes less efficient use of the data so that we require significantly more particles, observe larger errors as well as a stronger dependence of the errors on the availability of data in time, for both the observed and unobserved quantities. With 1000 particles and for a large gap (r = 10), the SIR filter gives mean errors of 10% for the observed quantity and 22% for the unobserved quantity. An increase in the number of particles did not decrease these errors. The EnKF performs well and, for about 500 particles, gives results that are comparable to those of the implicit particle filter. The reason for the large number of particles is, again, the indirect use of the data in EnKF.
The errors in the reconstructions of the various filters are not Gaussian, so that an assessment of the errors based on the first two moments is incomplete. In the two panels on the right of Figure 7, we show histograms of the errors of the implicit filter (10 particles), simplified implicit filter (20 particles), EnKF (500 particles) and SIR filter (1000 particles) for r = 10 model steps between observations. We observe that the errors of the im-   plicit filter, simplified implicit filter and EnKF are centered to the right of the diagrams (at around 10% in the unobserved quantity u and about 1% for the observed quantity b) and show a considerably smaller spread than the errors of the SIR filter, which are centered at much larger errors (20% in the unobserved quantity u and about 9% for the observed quantity b). A closer look at the distribution of the errors thus confirms our conclusions we drew from an analysis based on the first two moments. We decrease the spatial resolution of the data to 20 measurement locations and show filtering results from 100 twin experiments in Figure 8. The results are qualitatively similar to those obtained at a high spatial resolution of 200 data points per observation. We observe for the implicit particle filter that the errors in the unobserved quantity are insensitive to the spatial resolution of the data, while the errors in the observed quantity are determined by the spatial resolution of the data and are rather insensitive to the temporal resolution of the data. These observations are in line with those reported in connection with a strong 4D-Var algorithm in [17]. All other filters we have tried show a dependence of the errors in the observed quantity on the temporal resolution of the data. Again, the reason for the good performance of the implicit particle filter is its direct use of the data. The two panels to the left of Figure 7, show histograms of the errors of the implicit filter (10 particles), simplified implicit filter (20 particles), EnKF (500 particles) and SIR filter (1000 particles) for r = 10 model steps between observations. The results are qualitatively similar to the results we obtained at a higher spatial resolution of the data and the closer look at the distributions of the errors confirms the conclusions we drew from an analysis of the first two moments.
In summary, we observe that the implicit particle filter yields the lowest errors with a small number of particles for all examples we considered, and performs well and reliably in this application. The SIR and simplified implicit particle filters could not reach the accuracy of the implicit particle filter, even when the number of particles is very large. The EnKF requires about 500 particles to come close to the accuracy of the implicit particle filter with only 4 particles. Although the implicit filter uses the computationally most expensive particles, the small number of particles required for a very high accuracy make the implicit filter the most efficient filter for this problem. The partial noise works in our favor here, because the dimension of the space the implicit filter operates in is 20, rather than the state dimension 600.
Finally, we wish to compare our results with those in [17], where a strong constraint 4D-Var algorithm was applied to the deterministic version of the test problem. Fournier used "perfect data," i.e. the observations were not corrupted by noise, and applied a conjugate-gradient algorithm to minimize the 4D-Var cost function. The iterative minimization was stopped after 5000 iterations. With 20 observations in space and a gap of r = 5 model steps between observations, an error of about 1.2% in u and 4.7% in b was achieved. With the implicit filter, we can get to a similar accuracy at the same spatial resolution of the data, but with a larger gap of r = 10 model steps between observations. Moreover, the data assimilation problem we solve here is somewhat harder than the strong constraint 4D-Var problem because we allow for model errors. The implicit particle filter also reduces the memory requirements because it operates in the 20-dimensional subspace of the forced variables. Each minimization is thus not as costly as a 600dimensional strong constraint 4D-Var minimization.

Conclusions
We considered implicit particle filters for data assimilation. Previous implementations of the implicit particle filter rely on finding the Hessians of functions F j of the state variables. Finding these Hessians can be expensive if the state dimension is large and can be cumbersome if the second derivatives of the F j 's are hard to calculate. We presented a new implementation of the implicit filter combining gradient descent minimization with random maps. This new implementation avoids the often costly calculation of the Hessians and, thus, reduces the memory requirements compared to earlier implementations of the filter. We have considered models for which the state covariance matrix is singular or ill-conditioned. This happens often, for example, in geophysical applications in which the noise is smooth in space or if the model includes conservation laws with zero uncertainty. Previous implementations of the implicit filter are not applicable here and we have shown how to use our new implementation in this situation. The implicit filter is found to be more efficient than competing methods because it operates in a space whose dimension is given by the rank of the state covariance matrix rather than the model dimension.
We applied the implicit filter in its new implementation to a test problem in geomagnetic data assimilation. The implicit filter performed well in comparison to other data assimilation methods (SIR, EnKF and 4D-Var) and gave accurate state estimates with a small number of particles and at a low computational cost. We have studied how the various data assimilation techniques use the available data to propagate information from observed to unobserved quantities and found that the implicit particle filter uses the data in a direct way, propagating information to unobserved quantities faster than competing methods. The direct use of the data is the reason for the very small errors in reconstructions of the state.