Interactive comment on “ Improved singular spectrum analysis for time series with missing data ”

the abbreviation are not proper. For example improved SSA ISSA or similarly SSAM. These should be changed as it is not common. The introduction is very poor. They need to inform what are the novelties of the proposed technique and why its work better than the previous approach. The definition and explanation in page 1953, just before section 3 should go to introduction as explained above. In fact, this motivates your work. Of course, it needs to be expended. Page 1954: We use the 30 h window size (L = 120),: This is very important issue. Window length. You did not mention about selection of window length and moreover, the sensitivity of your proposed method to L. The following source might are related to window length selection among many papers on this issue: 1Multivariate Singular Spectrum Analysis: A General View and New Vector Forecasting Approach. International Journal of Energy and Statistics, 01, 5583. 2On the Separability Between Signal and Noise in Singular Spectrum Analysis. C862 NPGD 1, C862–C863, 2015


Introduction
Singular spectrum analysis (SSA) introduced by Broomhead and King (1986) for studying dynamical systems is a powerful toolkit for extracting short, noisy and chaotic signals (Vautard et al., 1992).SSA first transfers a time series into a trajectory matrix, and carries out the principal component analysis to pick out the dominant components of the trajectory matrix.Based on these dominant components, the time series is reconstructed.Therefore the reconstructed time series improves the signal-to-noise ratio and reveals the characteristics of the original time series.SSA has been widely used in geosciences to analyse a variety of time series, such as the stream flow and sea-surface temperature (Robertson and Mechoso, 1998;Kondrashov and Ghil, 2006), the seismic tomography (Oropeza and Sacchi, 2011) and the monthly gravity field (Zotova and Shum, 2010).Schoellhamer (2001) developed a modified SSA for time series with missing data (SSAM), which was successfully applied to analyse the time series of suspended-sediment concentration (SSC) in San Francisco Bay (Schoellhamer, 2002).This SSAM approach does not need to fill missing data.Instead, it computes each principal component (PC) with observed data and a scale factor related to the number of missing data.Shen et al. (2014) developed a new principal component analysis approach for extracting common mode errors from the time series with missing data of a regional station network.The other kind of SSA approach process the time series with missing data by filling the data gaps recursively or iteratively, such as the "Caterpillar" SSA method (Golyandina and Osipov, 2007), the imputation method (Rodrigues and Carvalho, 2013) or the iterative method (Kondrashov and Ghil, 2006).
This paper is motivated by Schoellhamer (2001) and Shen et al. (2014) and develops an improved SSA (ISSA) approach.In our ISSA, the lagged correlation matrix is computed in the same way as by Schoellhamer (2001) -the PCs are directly computed with both the eigenvalues and eigenvectors of the lagged correlation matrix.However, the PCs in Schoellhamer (2001) were calculated with the eigenvec-Published by Copernicus Publications on behalf of the European Geosciences Union & the American Geophysical Union.
tors and a scale factor to compensate for the missing value.Moreover, we do not need to fill in the missing data recursively and iteratively as in Golyandina and Osipov (2007).The rest of this paper is organized as follows: the improvement of SSA for time series with missing data follows in Sect.2, synthetic and real numerical examples are presented in Sects.3 and 4 respectively, and then conclusions are given in Sect. 5.

Improved singular spectrum analysis for time series with missing data
For a stationary time series x i (1 ≤ i ≤ N), we can construct an L × (N − L + 1) trajectory matrix with a window size L.
Its Toeplitz lagged correlation matrix C is formulated by Each element c(j ) is computed by (2) For matrix C, we can compute its eigenvalues λ k and the corresponding eigenvectors v k in descending order of λ k (1 ≤ k ≤ L).Then the ith element of the kth principal component (PC) a k is computed by where v j,k is the j th element of v k .We compute the kth reconstructed components (RCs) of the time series with the kth PC as (Vautard et al., 1992) (4) Since λ k , the variance of the kth RC, is sorted in descending order, the first several RCs contain most of the signals of the time series, while the remaining RCs contain mainly the noises of time series.Thus the original time series is reconstructed with the first several RCs.
The SSAM approach developed by Schoellhamer (2001) computes the elements c(j ) of the lagged correlation matrix by where both x i and x i+j must be observed rather than missed, and N j is the number of the products of x i and x i+j within the sample index i ≤ N − j .Then we compute the eigenvalues and eigenvectors from the lagged correlation matrix C.
The PCs are also calculated with observed data: where L i is the number of observed data within the sample index from i to i + L − 1.The reconstruction procedure of time series from PCs is the same as SSA.The scale factor L/L i is used to compensate for the missing value.
In order to derive the expression of computing PCs for the time series with missing data, Eq. ( 3) is reformulated as where 1 ≤ i ≤ N − L + 1, and S i and S i are the index sets of sampling data and missing data respectively within the in- PCs are available, we can reproduce the missing values.Therefore, the missing values in Eq. ( 7) can be substituted with PCs as Substituting Eq. ( 8) into the second term of the right-hand side of Eq. ( 7) yields Collecting all equations of Eq. ( 9) for k = 1, 2, . . ., L, we have where Nonlin.Processes Geophys., 22, 371-376, 2015 www.nonlin-processes-geophys.net/22/371/2015/ Since G i is a symmetric and rank-deficient matrix with the number of rank deficiency equaling the number of missing data within the interval [x i , x i+L−1 ], the PCs a k,i (k = 1, 2, . . ., L) are solved with Eq. ( 10) based on the following criterion (Shen et al. 2014): min where is diagonal matrix of eigenvalues λ k , which is the covariance matrix of PCs.The solution of Eq. ( 10) is as follows: The symbol "-" denotes the pseudo-inverse of a matrix.
If the non-diagonal elements of G i are all set to zero, Eq. ( 14) can be further simplified as L at the missing data points, the solution of Eq. ( 15) will be reduced to Eq. ( 6).Therefore, the SSAM approach is a special case of our ISSA approach.The first several PCs contain most variance; the element x i+j −1 can be approximately reproduced with the first several PCs in Eq. ( 8).
The main difference of our ISSA approach from the SSAM approach of Schoellhamer (2001) is in calculating the PCs.We produce the PCs from observed data with Eq. ( 14) according to the power spectrum (eigenvalues) and eigenvectors of the PCs, while Schoellhamer (2001) calculates the PCs from observed data with Eq. ( 6) only according to the eigenvectors and uses the scale factor L/L i to compensate the missing value.We have pointed out that this scale factor can be derived from Eq. ( 15), which is the simplified version of our ISSA approach, by supposing the missing data points with the same eigenvector elements.Therefore the performance of our ISSA approach is better than SSAM of Schoellhamer (2001).The only disadvantage of our method is that it will cost more computational effort.

Performance of ISSA with synthetic time series
The same synthetic time series as in Schoellhamer (2001) are used to analyse the performance of ISSA compared to SSAM.The synthetic SSC time series is expressed as where R(t) is a time series of Gaussian white noise with zero mean and unit standard deviation; c s (t) is the periodic signal expressed as The periodic signal oscillates about the mean value 100 mg L −1 including the signals with seasonal frequency ω s = 2π/365 day −1 , spring/neap angular frequency ω sn = 2π/14 day −1 and advection angular frequency ω a = 2π/12.5/24day −1 .The 1 year of synthetic SSC time series c(t), starting at 1 October with 15 min time step, is presented at the bottom of Fig. 1, the corresponding periodic signal c s (t) is shown at the top of Fig. 1.
Although the selection of window length is an important issue for SSA (Hassani et al., 2012;Hassani and Mehmoudvand, 2013), this paper chooses the same window length (L = 120) as that in Schoellhamer (2001) in order to compare the performance of the proposed method with that of Schoellhamer.Using the synthetic time series we compute the lagged correlation matrix and the variances of each mode.The first four modes contain the periodic components, which account for 72.3 % of the total variance; in particular, the first mode contains 50.2 % of the total variance.In order to evaluate the accuracies of reconstructed PCs from the time series with different percentages of missing data, following the approach of Shen et al. (2014), we compute the relative errors of the first four modes derived by ISSA and SSAM with the following expression: where the symbol "T " denotes the transpose of a matrix, p denotes relative error, N is the number of repeated experiments, a i is the reconstructed PCs of the ith experiment from data missing time series, and a 0 denotes the PCs reconstructed from the time series without missing data.We design the experiment of missing data by randomly deleting the data from the synthetic time series.The percentage of deleted data is from 10 to 60 % with an increase of 10 % each time.Then, we reconstruct the first four PCs from the data-deleted synthetic time series using both SSAM and ISSA, and repeat the experiments 50 times.The relative errors of the first four PCs are presented in Fig. 2, from which we clearly see that the accuracies of reconstructed PCs by our ISSA are obviously higher than those by SSAM, especially for the second and fourth PCs.In the case of 60 % missing data, the accuracy improvements are up to 19.64, 41.34, 23.27 and 50.30% for the first four PCs, respectively.We reconstruct the time series ĉ(t) using the first four PC modes and then evaluate the quality of reconstructed series by examining the error ĉ(t) = ĉ(t) − c s (t).For the cases whose missing data are between 10 to 50 % over the whole time series, the reconstructed component of the time series is calculated only when the percentage of missing data in the window size is less than 50 %; while for the cases whose overall missing data already reach 60 %, 60 % missing data is allowed in the window size.In Fig. 3, we demonstrate the root mean squared errors (RMSEs) of each experiment of different percentages of missing data.The RMSE is computed with ĉ(t) as where M is the number of data points involved in the experiment.
As can be seen from the Fig. 3, the RMSEs of ISSA are much smaller than those of SSAM for the same experiment scenarios.In Table 1, we present the mean absolute reconstruction error (MARE) and mean root mean squared error (MRMSE) of 50 experiments with different percentages of missing data.
Obviously, if there are no missing data, the ISSA coincides with SSAM.If the percentage of missing data increases, both MARE and MRMSE will become larger.In Table 1, all the MARE and MRMSE of ISSA are smaller than those of SSAM.When the percentage of missing data reaches 50 %, the MARE and MRMSE are 3.17 and 4.14 mg L −1 for ISSA, and 4.57 and 5.89 mg L −1 for SSAM, respectively.The im-

Performance of ISSA with real time series
The mid-depth SSC time series at San Mateo Bridge is presented in Fig. 4, which contains about 61 % missing data.This time series was reported by Buchanan and Schoellhamer (1999) and Buchanan and Ruhl (2000), and analysed by Schoellhamer (2001) using SSAM.We analyse this time series using our ISSA with the window size of 30 h (L = 120) comparing with SSAM.The first 10 modes represent dominant periodic components as shown in Schoellhamer (2001) which contain 89.1 % of the total variance.Therefore, we reconstruct the time series with the first 10 modes when the missing data in a window size is less than 50 %.The residual time series, e.g. the differences of observed minus reconstructed data, are presented in Fig. 5.The maximum, minimum and mean absolute residuals as well as the SD are presented in Table 2.It is clear that both maximum and minimum residuals are significantly reduced by using ISSA approach.The SD of our ISSA is reduced by 8.6 %.The squared correlation coefficients between the observations and the reconstructed data from ISSA and SSAM are 0.9178 and 0.9046, respectively, which reflect that the reconstructed time series with our ISSA can indeed, to very large extent, specify the real time series.

Conclusions
We have developed the ISSA approach in this paper for processing the incomplete time series by using the principle that a time series can be reproduced using its principal components.We prove that the SSAM developed by Schoellhamer ( 2001) is a special case of our ISSA.The performances of ISSA and SSAM are demonstrated with a synthetic time series, and the results show that the relative errors of the first four principal components by ISSA are sig- nificantly smaller than those by SSAM.As the fraction of missing data increases, the improvement of the relative error becomes greater.When the percentage of missing data reaches 60 %, the improvements of the first four principal components are up to 19.64, 41.34, 23.27

Figure 1 .
Figure 1.Periodic signal c s (t) (top panel) and Synthetic time series (bottom panel).

Table 1 .
Mean absolute reconstruction error and mean root mean squared error of simulated time series with different percentage of missing data (mg L −1 ).

Table 2 .
Maximum, minimum and mean absolute residuals of SSAM and ISSA.

Table 1 .
As the amount of missing data increases, the IMPs of both MARE and MRMSE increase as well.Moreover, when the synthetic time series with the missing data is same as the real SSC time series of Fig.4, the IMPs of MARE and MRMSE are 8.87 and 15.19 %, respectively.

The Supplement related to this article is available online at doi:10.5194/npg-22-371-2015-supplement.
and 50.30%, respectively.Moreover, when the missing data account for 60 %, the MARE and MRMSE derived by ISSA are 3.52 and 4.60 mg L −1 , and by SSAM are 5.37 and 6.96 mg L −1 .The corresponding improvements of ISSA with respect to SSAM are 34.45 and 33.91 %.When the missing data of synthetic time series is the same as the real SSC time series, the improvements of MARE and MRMSE are 8.87 and 15.19 %, respectively.The SD derived from the real SSC time series at San Mateo Bridge by ISSA and SSAM are 12.27 and 13.48 mg L −1 , and the squared correlation coefficients between the observations and the reconstructed data from ISSA and SSAM are 0.9178 and 0.9046, respectively.Therefore, ISSA can indeed, to a great extent, retrieve the informative signals from the original incomplete time series.Author contributions.Y. Shen proposed the improved singular spectrum analysis and F. Peng wrote the FORTRAN program and performed the simulations.Y. Shen, F. Peng and B. Li prepared the paper.