Optimal heavy tail estimation , Part I : Order selection

The tail probability, P , of the distribution of a variable is important for risk analysis of extremes. Many variables in complex geophysical systems show heavy tails, where P decreases with the value, x, of a variable as a power law with characteristic exponent, α. Accurate estimation of α on the basis of data is currently hindered by the problem of the selection of the order, that is, the number of largest x-values to utilize for the estimation. This paper presents a new, widely applicable, data-adaptive order selector, which is based on computer simulations and brute force search. It is the first in a set of papers on 5 optimal heavy tail estimation. The new selector outperforms competitors in a Monte Carlo experiment, where simulated data are generated from stable distributions and AR(1) serial dependence. We calculate error bars for the estimated α by means of simulations. We illustrate the method on an artificial time series. We apply it to an observed, hydrological time series from the river Elbe and find an estimated characteristic exponent of 1.48±0.13. This result indicates finite mean but infinite variance of the statistical distribution of river runoff. 10


Introduction
Not all geophysical variables obey a Gaussian (normal) distribution.This is true not only for the central part, but also for the extremal part (tail) of a distribution.Instead of a Gaussian exponential behaviour, we often observe a Pareto tail (power law) with a distribution function, F , of a variable, X, that holds above some threshold, x > u ≥ 0. The characteristic exponent or heavy tail index parameter, α > 0, determines the probability, P , of observing extreme values.Its knowl-edge is of crucial importance in applied risk analysis, for example, of floods (Jongman et al., 2014).Theoretical explanations of the heavy tail behaviour rest on multiplicative or non-linear interaction of variables in complex geophysical systems.Such derivations exist, for example, for the variables rainfall (Wilson and Toumi, 2005) and air pressure (Sardeshmukh and Sura, 2009).Other complex systems, such as finance (Malevergne and Sornette, 2006) or society (Barabási, 2005;Helbing, 2013), may also exhibit heavy tail phenomena.
Many statistical distributions have heavy tails.A particularly useful class of those are the stable distributions (Nolan, 2003), for which the distribution of the sum has the same shape (up to scale and shift) as each of the independent, identically distributed stable summands.Stable distributions have a heavy tail index between 0 and 2. They include Gaussian (α = 2), Cauchy (α = 1) and Lévy (α = 1/2) distributions.The fact that for other α-values no analytical expression of the distribution exists does not reduce the usefulness for analysing extremes.A more serious point is that heavy tail distributions in general, not only stable distributions, may have infinite statistical moments: of first order (mean) for α < 1 and of second order (variance) for α < 2. Therefore, in analytical practice, research questions arise such as the following.(1) How realistic are infinite moments for the studied geophysical variable?(2) How serious are the consequences of heavy tails and the resulting inflated estimation accuracy of geophysical system parameters?(3) Does the heavy tail law (Eq.1) hold, not over the full but just a restricted xrange, which may then be compatible with finite moments (Mantegna and Stanley, 1995)?
The accurate statistical estimation of the heavy tail index on the basis of a set of data, {x(i)} n i=1 , of size n, is therefore important.An estimator, α, which may be called "classic", was devised by Hill (1975).Let x denote the x-values sorted according to descending size, We assume a zero sample mean (via mean subtraction).Without loss of generality we consider the right tail (positive values).Let there be K ≥ 2 positive x -values.The Hill estimator is where k ≤ K − 1 is denoted as order parameter.If K < 2, then the Hill estimator cannot be applied.The selection of k completes the estimation.Order selection has a decisive influence on α.However, it is still unclear how best to achieve this, and order selection has been called the "Achilles' heel" of heavy tail index estimation (Resnick, 2007).
Order selection constitutes a statistical trade-off problem (Hill, 1975).Large k leads to usage of many data points and a small estimation variance.However, the risk then is that points are included for which Eq. (1) does not hold (i.e.bias).On the other hand, small k leads to a small estimation bias and a large variance.
We assess as minor the fact that the Hill estimator is not translation invariant (to shifts in x) (Resnick, 2007), since there is a natural choice in the form of zero sample mean.In geophysical analyses, such as the estimation of trend parameters on climate time series (Mudelsee, 2014), a by-product is the series of residuals (data minus fit).These values are realizations of the noise process, and they are subjected to various forms of residual tests regarding distributional shape and persistence.By virtue of their construction, the residuals have zero sample mean.
There exist other estimators of α (Resnick, 2007), but here we consider Hill and focus on order selection.We introduce an order selector that is optimal in the sense that, for a given estimation problem, it minimizes a root mean squared error (RMSE) measure for α in an internal (i.e.within the algorithm) simulation loop.In our approach, the search for optimal k is performed in a brute force manner and adaptively for a data set.Hence, the approach is computationally intensive.
Processes in complex geophysical systems may exhibit not only heavy tail behaviour, but also persistence in the time domain.Let t (i) denote a time value and {t (i), x(i)} n i=1 a time series, that is, a sample of the dynamics of a system.Many geophysical time series have an uneven time spacing (e.g.proxy series of paleoclimate obtained from natural archives).Therefore we model the persistence in its simplest form as a first-order autoregressive or AR(1) process on an unevenly spaced time grid (Mudelsee, 2014), T (i) is the discrete time variable, assumed to increase strictly monotonically; E is an independent, identically distributed random innovation with zero mean and variance σ 2 ; the parameter τ > 0 is called persistence time.In the case of even spacing (T (i) − T (i − 1) = d(i) = d = const.),Eq. (3) corresponds to the more familiar formulation with an AR(1) parameter a = exp(−d/τ ).The heteroscedasticity in Eq. ( 3) ensures stationarity of the AR(1) process for σ 2 < ∞.It is thought to be of no harm in case E is heavy tailed with infinite variance.The persistence time can be estimated using a least-squares criterion and numerical techniques.See Mudelsee (2014) for more details.
We aim here for a heavy tail parameter estimation that is accurate, widely applicable and robust (i.e.reliable even when some underlying assumptions are not met).The selection 0 ≤ α ≤ 2 and k ≤ K − 1 allows a wide range of possible distributions of the data-generating process.The full distribution does not need to follow Eq. ( 1), just the extremal part ("distributional robustness").The adoption of an AR(1) model for uneven spacing (Eq. 3) ensures that for many time series, the persistence dynamics is captured at least to first order ("persistence robustness").Notably included is the persistence-free case, where time is irrelevant and only the observed values are required.This analytical design and the presented method are therefore applicable to many different types of data from geophysics and disciplines beyond.We detail the new order selector (Sect.2) and show its superiority in a Monte Carlo simulation experiment (Sect.3).Simulation is also the approach to construct error bars for α (Sect.4).We illustrate the method via applications to artificial (Sect.5) and observed, hydrological time series (Sect.6).The conclusions (Sect.7) address practitioners of risk analysis.

Order selection
To repeat the ingredients of the statistical problem, let {t (i), x(i)} n i=1 be a time series, where the time values, t (i), increase strictly monotonically and the x-values have a zero mean (via mean subtraction).Let x denote the x-values sorted according to descending size.Let there be K ≥ 2 positive x -values.
Algorithm 1 gives the solution of the problem of order (k) selection for the Hill estimator of the heavy tail parameter (α).
Algorithm 1 is an illustration of the concept of optimal estimation (Mudelsee, 2014).This concept roughly states that when confronted with a complex estimation problem on given data, the first task is then to explore the various estimation techniques to find out the optimal technique.Optimality is meant in a certain sense (e.g.RMSE).The second task is then to apply the optimal technique to the given data.Optimal estimation becomes feasible with increasing computing power.
Algorithm 1 Optimal order selection for the Hill estimator.
3: calculate τ using a least-squares criterion (Mudelsee, 2014) 4: for j = 1 to N inner do 5: generate n random values from a stable distribution with prescribed α = α k using the algorithm by Nolan (1997) In the context of heavy tail index estimation, Algorithm 1 attacks the "Achilles' heel" problem of order selection via brute force.The extension to other estimators is straightforward.

Monte Carlo experiment
We compare the optimal order selector for the Hill estimator (Algorithm 1) with two other selectors in a Monte Carlo simulation experiment (Algorithm 2).This involves many time series generated in a computer by means of a random number generator (Fishman, 1996).The prescribed uneven spacing is drawn from a gamma distribution, which may be a realistic model for many paleoclimate time series (Mudelsee, 2014).
Algorithm 2 Monte Carlo experiment on order selection for the Hill estimator, n sim = 10 000.
1: prescribe n, τ, α and order selector (asymptotic, bootstrap or optimal) 2: draw spacing, {d(i)} n−1 i=1 , from a gamma distribution with order parameter 3 3: scale {d(i)} n−1 i=1 such that the average, d, equals unity 4: set t (1) = 1 and t (i) = t (i − 1) + d(i − 1) for i = 2, . .., n 5: for j = 1 to n sim do 6: generate {x(i)} n i=1 from an AR(1) process (Eq. 3) on the time grid, {t (i)} n i=1 , with stable distributed (Nolan, 1997) innovations and prescribed values for τ and α 7: select order, k 8: calculate α j = α k (Eq.2) on the sorted and mean-subtracted data, x (i) n i=1 9: end for 10: calculate RMSE α = [ The first competitor as order selector is based on the asymptotic normality of α k (Eq.2).Hall (1982) showed that for n → ∞ and under further conditions, the expression k 1/2 ( α k −α) approaches a normal distribution with mean zero and variance α 2 .This allows the construction of an order selector based on the theoretical minimal asymptotic mean squared error (AMSE).A caveat against any asymptotic normality argument is that it is difficult to check in practice whether the underlying conditions are fulfilled: in particular, how far n is away from infinity.
The second competitor aims to improve the selector based on asymptotic normality by estimating the AMSE via a computing-intensive bootstrap resampling procedure (Danielsson et al., 2001).This data-adaptive order selector possesses stronger robustness than the one based on theoretical AMSE because it makes less restrictive assumptions.The adaptation to the data at hand makes the selector based on the bootstrap relevant for practical applications.
There exists also the suggestion to look for a plateau of the sequence α k against k as an indication of a suitable order (Resnick, 2007).Evidently, it is not straightforward to objectively define a plateau and implement that definition in a Monte Carlo experiment.Of higher relevance is the finding (Sect.5) that the optimal order can be located in a region that does not at all resemble a plateau.
The results (Fig. 1) show that the new optimal order selector outperforms (i.e. has a smaller RMSE α ) the two competitors.This is true over a considerable range of design parameters (n, τ, α).The success of the optimal order selector may at least partly be due to the situation that the normality of α, on which the two competing selectors are based, has not been approached in the simulation world.On the other hand, the prescribed stable distributional shape of the data-generating process fits particularly well to the optimal selector (Algorithm 1).This point will be investigated in a future analysis of the optimal selector under varied Monte Carlo designs.

Error bars
To adapt the preface to our book on climate time series analysis (Mudelsee, 2014): we wish to know the truth about a geophysical system, but have only a limited sample, {t (i), x(i)} n i=1 , influenced by various sources of noise.Therefore we cannot expect our estimate, α, which is based on data, to equal the truth.However, we can determine the typical size of that deviation: an error bar.Error bars help to critically assess estimation results, they prevent us from making overstatements, and they guide us on our way to enhancing geophysical knowledge.Estimates without error bars are useless.
The Monte Carlo experiment (Sect.3) contains a method to construct error bars.For this purpose, Algorithm 2 may be adapted as follows.Line 1: take over n from the sample, and set τ = τ and α = α.Lines 2 to 4: take over {t (i)} n i=1 from the sample.Line 5: set n sim = 100.Report RMSE α as an error bar.This uncertainty measure has the advantage that www.nonlin-processes-geophys.net/24/737/2017/ Nonlin.Processes Geophys., 24, 737-744, 2017 it includes not only estimation variance, but also bias.This makes it more reliable than, for example, the standard error.This error bar construction is also used for the persistence time (RMSE τ ).
A note on the selection of n sim = 100 for error bar determination: we explored the influence of n sim on the accuracy of RMSE α in another Monte Carlo experiment.We varied n sim and analysed the coefficient of variation (CV) of RMSE α .The CV is given by the standard deviation of RMSE α , which is calculated over a number of external (i.e.outside of the algorithm) runs, divided by the mean calculated over the runs.One run consists in generating a series and estimating the tail index with RMSE α .The number of runs was 10 000.We found that for a number of n sim ≈ 100, a saturation behaviour of the CV sets in, while for smaller n sim values, the CV decreases with n sim .Further increasing n sim had no measurable effect on the accuracy of RMSE α .The value of 100 also agrees roughly with the Monte Carlo findings on the minimum number of simulations required for obtaining reliable results for the bootstrap standard error (Efron and Tibshirani, 1993).

Application to artificial time series
The application of heavy tail estimation to artificial data (Fig. 2) offers a test of the new analysis method because the properties of the data-generating process (τ, α) are prescribed and can be compared with the estimates ( τ , α).Employing a data size (n = 5000) not untypical for ambitious non-linear geophysical analyses and using the new order selector yields a clearly expressed optimal order of k opt = 1071 (Fig. 2c).That means about 20 % of the data are utilized for heavy tail index estimation.
It is remarkable that the sequence α k does not at all display a plateau at around k opt (Fig. 2b).Rather, the sequence shows a trend that decreases with k.
The good agreement between data and fit is also reflected by the good agreement between data histograms and fitted densities (Fig. 2d-f).
One caveat to consider is the fact that the prescribed density of the process that generated the data (Fig. 2a) is a stable distribution, which is also employed by optimal order selection (Algorithm 1).This may have, at least partly, produced the good fit on artificial data.On the other hand, (1) the full distribution does not need to follow the distribution, just the extremal part, and (2) stable distributions form a fairly wide class of distributions (Nolan, 2003).Still, we plan to study the relevance of this point by means of an analysis under varied Monte Carlo designs.

Application to observed, hydrological time series
The application of heavy tail estimation to observed data (Fig. 3) serves to illustrate the practical work.The runoff time series of the River Elbe at station Dresden (Fig. 3a) belongs to the longest observed hydrological records available.The data quality is assessed as excellent owing to the relatively constant observation situation at this station and the frequently updated runoff-water stage calibration curves (Mudelsee et al., 2003).We analyse the hydrological summer separately (Fig. 3) because the conditions for generating extreme floods (right tail) vary from summer to winter (Mudelsee et al., 2003).The resulting data size is n = 38 272.The clearly expressed optimal order for Hill estimation is k opt = 3732 (Fig. 3c).This means about 10 % of the data are utilized for heavy tail index estimation.This decrease in the  f) frequency plots showing densities and histograms at various axis scalings.The optimal order is k opt = 1071.The fitted heavy tail density function, f (x), has been scaled such that for x > x (k opt ), the tail probability, F (x) (Eq.1), times n equals the number of extreme events (right tail).(Note that f (x) = F (x).) ratio k opt /n with n, which is found when the observed series is compared with the artificial series of size n = 5000 (Sect.5), is compatible with theoretical recommendations (Hall, 1982).
Due to excessive computing costs associated with a brute force search for n = 38 272, the optimal order is found via a quasi-brute force, two-step search method.In the first step, we calculate the measure RMSE(k) (Algorithm 1) at kincrements of L k = 50.From the resulting 765 measure values, we select P mink = 5 % with minimal measure, for which we perform in the second step a fine search with increment 1.The quasi-brute force search, Monte Carlo experiments and further hints on the selection of L k and P mink are described in the manual (Supplement).
The resulting estimates with RMSE bars from n sim = 100 simulations are τ = 0.060±0.002a and α = 1.48±0.13.Although the observed time series clearly has more points (n = 38 272) than the artificial one (n = 5000), the error bar for the heavy tail index estimate is larger (RMSE α = 0.13) than for the artificial one (RMSE α = 0.06).The reason is that the estimated "equivalent autocorrelation coefficient" (Mudelsee, 2014), given by ā = exp(− d/ τ ), is larger for the observed time series ( ā = 0.91) than for the artificial one ( ā = 0.51).Stronger persistence means fewer independent data points and a larger estimation uncertainty.An additional Monte Carlo experiment revealed that for absent persistence (τ = 0), the observed, hydrological values yield a clearly smaller error bar (RMSE α = 0.03).
For the hydrological interpretation of the statistical results, not only the error bars (RMSE τ and RMSE α ) have to be considered, but also model mis-specification.
In the case of persistence estimation of runoff series, an alternative to the AR(1) model may be a long-memory model (Mudelsee, 2007).We think that the large estimated autocorrelation ( ā = 0.91) does already capture a large amount of the serial dependence structure (Sect.1) of the hydrological series.Therefore, an associated persistence model misspecification would likely have consequences (widened error bars) that are only minor.Still, it is worth studying more systematically long-memory models with heavy tail distributed innovations.
In the case of heavy tail index estimation, we think that the employed stable distribution model class does already capture the true distribution (Fig. 3d-f) quite well owing to the wide range of the stable class (Sect.1).Therefore, an associated distribution model mis-specification should not widen the error bars strongly, and the true estimate should not be far away from the estimate, α = 1.48.
However, the possibility of model mis-specification prevents us at this stage of the analysis from concluding unambiguously that with α < 2, the runoff-generating process has infinite variance (Nolan, 2003).This would have serious consequences for practical work since many types of statistical estimation problems (e.g.trend, spectrum) would be affected.We mention the study of the runoff series from the River Salt (Anderson and Meerschaert, 1998), which found  2).The optimal order is k opt = 3732; it has been detected using a quasi-brute force search (see text).Data courtesy Wasser-und Schifffahrtsverwaltung des Bundes, provided by the Bundesanstalt für Gewässerkunde (BfG), Koblenz, Germany.
α ≈ 3 (i.e.finite variance), in contrast to our finding.Further stages of the analysis, to be pursued in a future paper, will therefore include (1) a comparison between summer and winter for the runoff series from Dresden; (2) a quantification of the sensitivity to the removal of an annual cycle; (3) a comparison among various other stations on the River Elbe and (4) a comparison with other rivers, for which long, highquality runoff records are available.Evidently, an accurate heavy tail estimation technique with optimal order selection is helpful for this purpose.

Conclusions
The tail probability, P (X > x), is crucially important for practical risk analysis, for example, the calculation of the expected losses in the reinsurance business.Instead of a Gaussian exponential behaviour (light tail), many observed variables from complex networks show a power law (heavy tail).This law (Eq.1), which is parameterized by means of the heavy tail index, α, allows us to extrapolate the probability into unobserved, extreme data ranges.
The accurate estimation of α on the basis of observed data is therefore also crucially important.The "Achilles' heel" of tail index estimation is order selection, that is, to set how many of the largest values to utilize for the estimation.This paper focuses on a new, optimal order selector (Algorithm 1).The superiority of the new selector is demonstrated in a Monte Carlo simulation experiment (Fig. 1).
The new selector is claimed to utilize the data in an optimum way for performing an estimation.The resulting error bars (RMSE α ), which are calculated from computingintensive simulations (Sect.4), are comparably small.Hence, the new method allows us to study, more accurately than previously possible, various extremal behaviours, such as the spatial dependence of α in geostatistical applications or the time dependence of α on long time series.The time dependence may shed light on tipping points in complex systems.In particular, changes in α over time may possibly be used to predict the approach of a sudden change in a geophysical variable (e.g.climate).
The data-generating process (AR(1) with stable distributed innovations) achieves "distributional robustness" because the full distribution does not need to follow Eq. ( 1), just the extremal part.It also achieves "persistence robustness" because Eq. (3) ensures that for many time series (also unevenly spaced), the persistence dynamics is captured at least to first order.As a result, the presented method is accurate and widely applicable, and it delivers robust results.
However, at this stage of method development, it is still useful to perform more Monte Carlo simulation studies on heavy tail index estimation.These simulations should include varied designs, in particular, prescribed shapes other than a stable distribution.Furthermore, it is interesting to study estimators other than Hill (on which this paper focuses).The computer program associated with optimal index estimation (ht) has also implemented the estimation routine following Pickands III (1975).
The application to an observed, hydrological time series (Fig. 3) delivered the intriguing result of infinite variance (but finite mean) ( α = 1.48) of the data-generating process.Infinite variance would have serious consequences for many types of statistical estimation to be carried out on hydrological data.We recommend analysing more, independent hydrological data to corroborate or refute this finding.
The wider impact of optimal heavy tail estimation may be not only on the application to the area of instrumental environmental measurements, but also to reconstructed variables from the areas of paleoclimatology (Cronin, 2010), paleohydrology (Gasse, 2009) and dendrochronology (D'Arrigo et al., 2011;Gholami et al., 2015).Furthermore, since extreme events in hydrology and related fields may also show the duration aspect (e.g.droughts, heatwaves), the estimation should not be restricted to measured or reconstructed variables.Rather, heavy tail index estimation should be a useful tool also for the analysis of index variables (Kürbis et al., 2009).

Figure 2 .
Figure 2. Application of heavy tail estimation with optimal order selection to artificial data: (a) time series (dark line), drawn from an AR(1) process with even spacing unity and stable distributed innovations (n = 5000, τ = 1.5, α = 1.75);(b) sequence α k (solid dark line) ± RMSE(k) (shaded) for the Hill estimator (right tail); (c) measure RMSE(k); (d)-(f) frequency plots showing densities and histograms at various axis scalings.The optimal order is k opt = 1071.The fitted heavy tail density function, f (x), has been scaled such that for x > x (k opt ), the tail probability, F (x) (Eq.1), times n equals the number of extreme events (right tail).(Note that f (x) = F (x).)

Figure 3 .
Figure 3. Application of heavy tail estimation with optimal order selection to observed data: (a) time series (dark line) and average daily runoff of the River Elbe at Dresden (Germany) during summer (May to October) from 1 May 1806 to 31 October 2013 (n = 38 272); units: m 3 s −1 ; (b) sequence α k (solid dark line) ± RMSE(k) (shaded) for the Hill estimator (right tail); (c) measure RMSE(k); (d)-(f) frequency plots showing densities and histograms at various axis scalings (cf.Fig.2).The optimal order is k opt = 3732; it has been detected using a quasi-brute force search (see text).Data courtesy Wasser-und Schifffahrtsverwaltung des Bundes, provided by the Bundesanstalt für Gewässerkunde (BfG), Koblenz, Germany.