E ffi cient Bayesian inference for ARFIMA processes

Efficient Bayesian inference for ARFIMA processes T. Graves, R. B. Gramacy, C. L. E. Franzke, and N. W. Watkins URS Corporation, London, UK The University of Chicago, Booth School of Business, Chicago, USA Meteorological Institute and Center for Earth System Research and Sustainability (CEN), University of Hamburg, Hamburg, Germany Centre for the Analysis of Time Series, London School of Economics and Political Science, London, UK Centre for Fusion Space and Astrophysics, University of Warwick, Coventry, UK Max Planck Institute for the Physics of Complex Systems, Dresden, Germany Centre for Complexity and Design, Open University, Milton Keynes, UK


Introduction
In this paper we are concerned with Bayesian analysis of specific types of stochastic processes capable of possessing 'long memory', or "long-range dependence" (LRD) (Beran, 1994b;Palma, 2007;Beran et al., 2013). Long memory is the notion of there being correlation between the present and all points in the past. A standard definition is that a (finite variance, stationary) process has long memory if its autocorrelation function (ACF) has power-law decay: ρ(·) such that ρ(k) ∼ c ρ k 2d−1 as k → ∞, for some non-zero constant c ρ , and where 0 < d < 1 2 . The parameter d is the memory parameter; if d = 0 the process does not exhibit long memory, while if − 1 2 < d < 0 the process is said to have negative memory. The study of long memory originated in the 1950s in the field of hydrology, where studies of the levels of the river Nile (Hurst, 1951) demonstrated anomalously fast growth of the rescaled range of the time series. After protracted debates 1 about whether this was a transient (finite time) effect, the mathematical pioneer Benoît B. Mandelbrot showed that if one retained the assumption of stationarity, novel mathematics would then be essential to sufficiently explain the Hurst effect. In doing so he rigorously defined (Mandelbrot and Van Ness, 1968;Mandelbrot and Wallis, 1968) the concept of long memory.
Most research into long memory and its properties has been based on classical statistical methods, spanning parametric, semi-parametric and non-parametric modeling (see Beran et al., 2013, for a review). Very few Bayesian methods have been studied, most probably due to computational difficulties. The earliest works are parametric and include Koop et al. (1997) Pai and Ravishanker (1998), and Hsu and Breidt (2003). If computational challenges could be mitigated, the Bayesian paradigm would offer advantages over classical methods including flexibility in specification of priors (i.e., physical expertise could be used to elicit an informative prior). It would offer the ability to marginalise out aspects of a model apparatus and data, such as short memory or seasonal effects and missing observations, so that statements about long memory effects can be made unconditionally.
Towards easing the computational burden, we focus on the ARFIMA class of processes (Granger and Joyeux, 1980;Hosking, 1981) as the basis of developing a systematic and unifying Bayesian framework for modeling a variety of common time series phenomena, with particular emphasis on detecting potential long memory effects. ARFIMA has become very popular in statistics and econometrics because it is generalisable and its connection to the ARMA family (and to fractional Gaussian noise) is relatively transparent. A key property of ARFIMA is its ability to simultaneously yet separately model long and short memory. Both Liseo et al. (2001) and Holan et al. (2009) argued, echoing a sentiment in the classical literature, that full parametric long memory models (like ARFIMA) are 'too hard' to work with. Furthermore, often d is the only object of real interest, and consideration of a single class of models, such as ARFIMA, is too restrictive. They therefore developed methods which have similarities to classical periodograms.
We think ARFIMA deserves another look-that many of the above drawbacks, to ARFIMA in particular and Bayesian computation more generally, can be addressed with a careful treatment. We provide a new approximate likelihood for ARFIMA processes that can be computed quickly for repeated evaluation on large time series, and which underpins an efficient MCMC scheme for Bayesian inference. Our sampling scheme can be best described as a modernisation of a blocked MCMC scheme proposed by Pai and Ravishanker (1998)adapting it to the approximate likelihood and extending it to handle a richer form of (known) short memory effects. We then further extend the analysis to the case where the short memory form is unknown, which requires transdimensional MCMC. This aspect is similar to the work of Ehlers and Brooks (2008) who considered the simpler ARIMA model class, and to Holan et al. (2009) who worked with a nonparametric long memory process. Our contribution has aspects in common with Egrioglu and Günay (2010) who presented a more limited method focused on model selection rather than averaging. The advantage of averaging is that the unknown form of short memory effects can be integrated out, focusing on long-memory without conditioning on nuisance parameters.
The aim of this paper is to introduce an efficient Bayesian algorithm for the inference of the parameters of the ARFIMA(p, d, q) model, with particular emphasis on the LRD parameter d. Our Bayesian inference algorithm has been designed in a flexible fashion so that, for instance, the innovations can come from a wide class of different distributions; e.g., α-stable or t-distribution. The remainder of the paper is organised as follows. Section 2 summarises of ARFIMA required for our purposes. Section 3 discusses the important numerical calculation of likelihoods, representing a hybrid between earlier classical statistical methods, and our new contributions towards a full Bayesian approach. Section 4 describes our proposed Bayesian framework and methodology method in detail, focusing on longmemory only. Then, in Section 5, we consider extensions for additional short memory. Empirical illustration and comparison of all methods is provided in Section 6. The paper concludes with a discussion in Section 7 focused on potential for further extension.

Time series definitions and the ARFIMA model
Following (Brockwell and Davis, 1991) a time series will mean a set of univariate real-valued observations {x t }, each recorded at a specified time t ∈ Z, and sampled at discrete, regular, intervals. A process will refer to a corresponding set of random variables {X t }. The process {X t } is strictly stationary if the joint distributions (X t 1 , . . . , X t k ) and (X t 1 +h , . . . , X t k +h ) are the same for all positive integers k, and for all t 1 , . . . , t k , h ∈ Z. It is weakly stationary if: (1) EX t = µ < ∞ for all t ∈ Z; (2) E|X t | 2 < ∞ for all t ∈ Z; and (3) Cov(X r , X s ) = Cov(X r+t , X s+t ) for all r, s, t ∈ Z. A process {X t } is Gaussian if the distribution of (X t 1 , . . . , X t k ) is multivariate normal (MVN) for all positive integers k, and for all t 1 , . . . , t k ∈ Z. Throughout, stationary Gaussian processes will be assumed for convenience, where 'strong' and 'weak' are equivalent and consequently those qualifiers will be dropped.
From the above, we see that the covariance depends only on the temporal difference which motivates defining an autocovariance ACV γ(·) of a weakly stationary process as γ(k) = Cov(X t , X t+k ), where k is referred to as the (time) 'lag'. The (normalised) autocorrelation function ACF ρ(·) is defined as: ρ(k) = γ(k) γ(0) . Another useful time domain tool is the 'backshift' operator B, where BX t = X t−1 , and powers of B are defined iteratively: B k X t = B k−1 (BX t ) = B k−1 X t−1 = · · · = X t−k . A stationary process {X t } is said to be causal if there exists a sequence of coefficients {ψ k }, with finite total mean square ∞ k=0 ψ 2 k < ∞ such that for all t, a given member of the process can be expanded as a power series in the backshift operator acting on the 'innovations', {ε t }: The innovations are a white (i.e. stationary, zero mean, iid) noise process with variance σ 2 . Causality specifies that for every t, X t can only depend on the past and present values of the innovations {ε t }. Furthermore Wold's theorem shows that any purely non-deterministic stationary process has a unique causal representation (referred to as the Wold expansion). A stationary process {X t } is said to be invertible if there exists a sequence of coefficients {π k } such that ∞ k=0 π 2 k < ∞, allowing innovations to be written as a power series The expansion in (2) has many uses, but an additional reason for assuming invertibility is that it is closely related to identifiability-it is possible for two different processes to have the same ACF, however this cannot happen for two invertible ones. Therefore in what follows we restrict ourselves to models that are causal (and hence stationary) and in addition invertible. A process {X t } is said to be an auto-regressive process of order p, AR(p), if for all t: AR(p) processes are invertible, stationary and causal if and only if Φ(z) = 0 for all z ∈ C such that |z| ≤ 1. From (2) invertibility is equivalent to the process having an AR(∞) representation. Similarly, {X t } is said to be a moving average process of order q, MA(q), if for all t. 2 MA(q) processes are stationary and causal, and are invertible if and only if Θ(z) = 0 for all z ∈ C such that |z| ≤ 1. A natural extension of the AR and MA classes arises by combining them (Box and Jenkins, 1970). The process {X t } is said to be an auto-regressive moving average (ARMA) process process of orders p and q, ARMA(p, q), if for all t: Although there is no simple closed form for the ACV of an ARMA process with arbitrary p and q, so long as the process is causal and invertible, then |ρ(k)| ≤ Cr k , for k > 0, i.e., it decays exponentially fast. In other words, although correlation between nearby points may be high, dependence between distant points is negligible. Before turning to 'long memory', we require one further result. Under some extra conditions, stationary processes with ACV γ(·) possess a spectral density function (SDF) f (·) defined such that: γ(k) = π −π e ikλ f (λ) dλ, ∀k ∈ Z. This can be inverted to obtain an explicit expression for the SDF (e.g. Brockwell and Davis, 1991, §4.3 where −π ≤ λ ≤ π. 3 Finally, the SDF of an ARMA process is The restriction |d| < 1 2 is necessary to ensure stationarity; clearly if |d| ≥ 1 2 the ACF would not decay. The continuity between stationary and non-stationary processes around |d| = 1 2 is similar to that which occurs for AR(1) process with |φ 1 | → 1 (such processes are stationary for |φ 1 | < 1, but the case |φ 1 | = 1 is the non-stationary random-walk).
There are a number of alternative definitions of LRD, one of which is particularly useful, as it considers the frequency domain: A stationary process has long memory when its SDF follows f (λ) ∼ c f λ −2d , as λ → 0 + for some positive constant c f , and where 0 < d < 1 2 . Similarly, it is said to have negative memory if that relationship holds for − 1 2 < d < 0. The simplest way of creating a process which exhibits long memory is through the SDF.
Therefore, assuming stationarity, the process which has this SDF (or any scalar multiple of it) is a long memory process. More generally, a process having spectral density is called fractionally integrated with memory parameter d, FI(d) (Barnes and Allan, 1966;Adenstedt, 1974). The full trichotomy of negative, short, and long memory is determined solely by d. When d = 0, the SDF is flat, yielding white noise. In practice this model is of limited appeal to time series analysts because the entire memory structure determined by just one parameter, d. One often therefore generalises by taking any short memory SDF f * (·), and defining a new SDF: f (λ) = f * (λ) 1 − e iλ −2d , 0 ≤ λ ≤ π. An obvious class of short memory processes to use this way is ARMA. Taking f * from (6) yields so-called auto-regressive fractionally integrated moving average process with parameter d, and orders p and q (ARFIMA(p, d, q)), having SDF: Choosing p = q = 0 recovers FI(d) ≡ ARFIMA(0, d, 0).
Practical utility from the perspective of (Bayesian) inference demands finding a representation in the temporal domain. To obtain this, consider the operator (1 − B) d for real d > −1, which is formally defined using the generalised form of the binomial expansion (Brockwell and Davis, 1991, Eq. 13.2.2): From this observation, one can show that where {Z t } is an ARMA process, has SDF (8). The operator (1 − B) d is called the 'fractional differencing' operator since it allows a degree of differencing between zeroth and first order. The process {X t } is fractionally 'inverse-differenced', i.e. it is an 'integrated' process. The operator is used to redefine both the ARFIMA(0, d, 0) and more general ARFIMA(p, d, q) processes in the time domain. A process {X t } is an ARFIMA(0, d, 0) process if for all t: where Φ and Θ are given in (3) and (4) respectively. Finally, to connect back to our first definition of long memory, consider the ACV of the ARFIMA(0, d, 0) process. By using the definition of spectral density to directly integrate (7), and an alternative expression for π one can obtain the following representation of the ACV of the ARFIMA(0, d, 0) process: .

Likelihood evaluation for Bayesian inference
For now we restrict our attention to (a Bayesian) analysis of an ARFIMA(0, d, 0) process, having no short-ranged ARMA components, placing emphasis squarely on the memory parameter d. We present two alternative likelihoods, 'exact' and 'approximate'. The exact one is not original, but is presented here to highlight some important (particularly computational) issues that prevent effective use in a Bayesian context where MCMC inference requires thousands of evaluations. The approximate one represents a novel contribution.

Exact likelihood calculation
For Gaussian processes, all information is contained in the covariance structure, so inference about memory behaviour only can proceed through the covariance matrix Σ given σ and d: . Therefore, the vector X = (X 1 , . . . , X n ) is MVN with mean µ1 n and covariance Σ(σ, d), so the likelihood is: To simplify the development below, write σ 2 Σ d as a shorthand for Σ(σ, d), whereby we have det[Σ(σ, d)] = σ 2n det(Σ d ). Also, denote the quadratic term as: , so the log-likelihood can be re-written as Numerical evaluation requires computing the determinant and inverse of a dense, symmetric positive-definite n×n matrix, an O(n 3 ) operation-too slow for the large n typically encountered in long memory contexts. 4 Simplifications arise upon recognising that Σ d is symmetric Toeplitz, being expressible by just n scalars c 0 , . . . , c n−1 , i.e., Σ(d, σ) i,j = c i−j for i ≥ j. The Durbin-Levinson algorithm (Palma, 2007, §4.1.2) exploits this form, yielding an O(n 2 ) cost. However even that remains too large in practice for most applications.
Our proposed likelihood approximation uses a truncated AR(∞) approximation (cf. Haslett and Raftery (1989)). We first re-write the AR(∞) approximation of ARFIMA(0, d, 0) to incorporate the unknown parameter µ, and drop the (d) superscript for convenience: . Then we truncate this AR(∞) representation to obtain an AR(P ) one, with P large enough to retain low frequency effects, e.g., P = n. We denote: Π P = P k=0 π k and, with π 0 = 1, rearrange terms to obtain the following modified model: It is now possible to write down a conditional likelihood. For convenience the notation x k = (x 1 , . . . , x k ) for k = 1, . . . , n will be used (and x 0 is interpreted as appropriate where necessary). Denote the unobserved P -vector of random variables (x 1−P , . . . , x −1 , x 0 ) by x A (in the Bayesian context these will be 'auxiliary', hence 'A'). Consider the likelihood L(x|ψ) as a joint density which can be factorised as a product of conditionals. Writing g t (x t |x t−1 , ψ) for the density of X t conditional on x t−1 , we obtain L(x|ψ) = n t=1 g t (x t |x t−1 , ψ). This is still of little use because the g t may have a complicated form. However by further conditioning on x A , and writing h t (x t |x A , x t−1 , ψ) for the density of X t conditional on x t−1 and x A , we obtain: (14) observe that, conditional on both the observed and unobserved past values, X t is simply distributed according to the innovations' density f with a suitable change in location: Then using location-scale representation: ; λ , or equivalently: Evaluating this expression efficiently depends upon efficient calculation of c = (c 1 , . . . , c n ) t and log f (·). From (15), c is a convolution of the augmented data, (x, x A ), and coefficients depending on d, which can be evaluated quickly in R via convolve via FFT. Consequently, evaluation of the conditional likelihood in the Gaussian case costs only O(n log n)-a clear improvement over the 'exact' method. Obtaining the unconditional likelihood requires marginalisation over x A , which is analytically infeasible. However this conditional form will suffice in the context of our Bayesian inferential scheme, presented below.

A Bayesian approach to long memory inference
We are now ready to consider Bayesian inference for ARFIMA(0, d, 0) processes. Our method can be succinctly described as a modernisation of the blocked MCMC method of Pai and Ravishanker (1998). Isolating parameters by blocking provides significant scope for modularisation which helps accommodate our extensions for short memory. Pairing with efficient likelihood evaluations allows much longer time series to be entertained than ever before. Our description begins with appropriate specification of priors which are more general than previous choices, yet still encourages tractable inference. We then provide the relevant updating calculations for all parameters, including those for auxiliary parameters x A .
We follow earlier work (Koop et al., 1997;Pai and Ravishanker, 1998) and assume a priori independence for components of ψ. Each component will leverage familiar prior forms with diffuse versions as limiting cases. Specifically, we use a diffuse Gaussian prior on µ: µ ∼ N (µ 0 , σ 2 0 ), with σ 0 large. The improper flat prior is obtained as the limiting distribution when σ 0 → ∞: p µ (µ) ∝ 1. We place a gamma prior on the preci- Updating µ: Following Pai and Ravishanker (1998), we use a symmetric random walk (RW) MH update with proposals ξ µ ∼ N (µ, σ 2 µ ), for some σ 2 µ . The acceptance ratio is under the approximate likelihood. With the exact likelihood, recall (13) to obtain: Updating σ: We diverge from Pai and Ravishanker (1998) here, who suggest independent MH with moment-matched inverse gamma proposals, finding poor performance under poor moment estimates. We instead prefer a Random Walk (RW) Metropolis-Hastings (MH) approach, which we conduct in log space since the domain is R + . Specifically, set: log ξ σ = log σ + υ, where υ ∼ N (0, σ 2 σ ) for some σ 2 σ . ξ σ |σ is log-normal and we obtain: Recalling (17) the MH acceptance ratio under the approximate likelihood is When using the exact likelihood, (13) gives The MH algorithm, applied alternately in a Metropolis-within-Gibbs fashion to the parameters µ and σ, works well. However actual Gibbs sampling is an efficient alternative in this two-parameter case (i.e., for known d). Since inference for d is a primary goal, we have relegated a derivation of the resulting updates to Appendix A.
Update of d: Updating the memory parameter d is far less straightforward than either µ or σ. Regardless of the innovations' distribution, the conditional posterior is not amenable to Gibbs sampling. We use RW proposals from truncated Gaussian In particular, we use ξ d |d . Although this may seem inefficient, it is perfectly acceptable: as an example, if σ d = 0.5 the expected number of required variates is still less than 2, regardless of d. More refined methods of directly sampling from truncated normal distributions exist-see for example Robert (1995)-but we find little added benefit in our context.
A useful cancellation in q(d; ξ d )/q(ξ d ; d) obtained from (18) yields Then in the approximate case: In the exact likelihood case, from (13) we obtain: Optional update of x A : When using the approximate likelihood method, one must account for the auxiliary variables x A , a P -vector (where P = n is sensible). We find that, in practice, it is not necessary to update all the auxiliary parameters at each iteration. In fact the method can be shown to work perfectly well, empirically, if we never update them, provided they are given a sensible initial value (such as the sample mean of the observed datax). This is not an uncommon tactic in the long memory (big-n) context (e.g., Beran, 1994a); for further discussion refer to Graves (2013, Appendix C).
For a full MH approach, we recommend an independence sampler to 'backward project' the observed time series. Specifically, first relabel the observed data: where the coefficients {π} are determined by the current value of the memory parameter(s). Then take the proposed x A , denoted ξ x A , as the reverse sequence: ξ x −i = y i+1 , i = 0, . . . , n − 1. Since this is an independence sampler, calculation of the acceptance probability is straightforward. It is only necessary to evaluate the proposal density q(ξ x A |x, ψ). But this is easy using the results from section 3.2. For simplicity, we prefer uniform prior for x A .
Besides simplicity, justification for this approach lies primarily in is preservation of the auto-correlation structure-this is clear since the ACF is symmetric in time. The proposed vector has a low acceptance rate, and the potential remedies (e.g., multiple-try methods) seem unnecessarily complicated given the success of the simpler method.

Extensions to accommodate short memory
Simple ARFIMA(0, d, 0) are mathematically convenient but have limited practical applicability because the entire memory structure is determined by just one parameter, d. Although d is often of primary interest, it may be unrealistic to assume no short memory effects. This issue is often implicitly acknowledged since semi-parametric estimation methods, such as those used as comparators in Section 6.1, are motivated by a desire to circumvent the problem of specifying precisely (and inferring) the form of short memory (i.e., the values of p and q in an ARIMA model). Full parametric Bayesian modelling of ARFIMA(p, d, q) processes represents an essentially untried alternative, primarily due to computational challenges. Related, more discrete, alternatives show potential. Pai and Ravishanker (1998) considered all four models with p, q ≤ 1, whereas Koop et al. (1997) considered sixteen with p, q ≤ 3.
Such approaches, especially ones allowing larger p, q, can be computationally burdensome as much effort is spent modelling unsuitable processes towards a goal (inferring p, q) which is not of primary interest (d is). To develop an efficient, fully-parametric, Bayesian method of inference that properly accounts for varying models, and to marginalise out these nuisance quantities, we use reversible-jump (RJ) MCMC (Green, 1995). We extend the parameter space to include the set of models (p and q), with chains moving between and within models, and focus on the marginal posterior distribution of d obtained by (Monte Carlo) integration over all models and parameters therein. RJ methods have previously been applied to both auto-regressive models (Vermaak et al., 2004), and full ARMA models (Ehlers andBrooks, 2006, 2008). In the long memory context, Holan et al. (2009) applied RJ to FEXP processes. However for ARFIMA, the only related work we are aware of is by Egrioglu and Günay (2010) who demonstrated a promising if limited alternative.
Below we show how the likelihood may be calculated with extra short-memory components when p and q are known, and subsequently how Bayesian inference can be applied in this case. Then, the more general case of unknown p and q via RJ is described.

Likelihood derivation and inference for known short memory
Recall that short memory components of an ARFIMA process are defined by the AR and MA polynomials, Φ and Θ respectively, (see Section 2). Here, we distinguish between the polynomial, Φ, and the vector of its coefficients, φ = (φ 1 , . . . , φ p ). When the polynomial degree is required explicitly, bracketed superscripts will be used; Φ (p) , φ (p) , Θ (p) , θ (p) , respectively.
To focus the exposition, consider the simple, yet useful, ARFIMA(1, d, 0) model where the full memory parameter is ω = (d, φ 1 ). Because the parameter spaces of d and φ 1 are independent, it is simplest to update each of these parameters separately; d with the methods of section 4 and φ 1 similarly: ξ φ 1 |φ 1 ∼ N (−1,1) (φ 1 , σ 2 φ 1 ), for some σ 2 φ 1 . In practice however, the posteriors of d and φ 1 typically exhibit significant correlation so independent proposals are inefficient. One solution would be to reparametrise to some d * and orthogonal φ * 2 , but the interpretation of d * would not be clear. An alternative to explicit reparametrisation is to update the parameters jointly, but in such a way that proposals are aligned with the correlation structure. This will ensure a reasonable acceptance rate and mixing.
To propose parameters in the manner described above, a two-dimensional, suitably truncated Gaussian random walk, with covariance matrix aligned with the posterior covariance, is required. To make proposals of this sort, and indeed for arbitrary ω in larger p and q cases, requires sampling from a hypercuboid-truncated MVN N (a,b) r (ω, Σ ω ), where (a, b) describe the coordinates of the hypercube. We find that rejection sampling based unconstrained similarly parameterised MVNs samples [e.g., using mvtnorm (Genz et al., 2012)] works well, because in the RW setup the mode of the distribution always lies inside the hypercuboid. Returning to the specific ARFIMA(1, d, 0) case, clearly r = 2, b = (0.5, 1) and a = −b, is appropriate. Calculation of the MH acceptance ratio A ω (ω, ξ ω ) is trivial; it simply requires numerical evaluation of Φ r (·; ·, Σ ω ), e.g., via mvtnorm, since the ratios of hypercuboid normalisation terms would cancel. We find that initial φ [0] chosen uniformly in C 1 , i.e. the interval (−1, 1), and d [0] are systematically from {−0.4, −0.2, 0, 0.2, 0.4} work well. Any choice of prior for ω can be made, although we prefer flat (proper) priors.
The only technical difficulty is the choice of proposal covariance matrix Σ ω . Ideally, it would be aligned with the posterior covariance-however this is not a priori known. We find that running a 'pilot' chain with independent proposals via N (a,b) r (ω, σ 2 ω I r ) can help choose a Σ ω . A rescaled version of the sample covariance matrix from the pilot posterior chain, following Roberts and Rosenthal (2001), works well [see Section 6.2].

Unknown short memory form
We now expand the parameter space to include models M ∈ M, the set of ARFIMA models with p and q short memory parameters, indexing the size of the parameter space Ψ (M ) . For our 'transdimensional moves', we only consider adjacent models, on which we will be more specific later. For now, note that the choice of bijective function mapping between models spaces (whose Jacobian term appears in the acceptance ratio), is crucial to the success of the sampler. To illustrate, consider transforming from Φ (p+1) ∈ C p+1 down to Φ (p) ∈ C p . This turns out to be a non-trivial problem however because, for p > 1, C p has a very complicated shape. The most natural map would be: (φ 1 , . . . , φ p , φ p+1 ) −→ (φ 1 , . . . , φ p ). However there is no guarantee that the image will lie in C p . Even if the model dimension is fixed, difficulties are still encountered; a natural proposal method would be to update each component of φ separately but, because of the awkward shape of C p , the 'allowable' values for each component are a complicated function of the others. Nontrivial proposals are required.
A potential approach is to reparametrise in terms of the inverse roots (poles) of Φ, as advocated by Ehlers andBrooks (2006, 2008): By writing Φ(z) = p i=1 (1 − α i z), we have that φ (p) ∈ C p ⇐⇒ |α i | < 1 for all i. This looks attractive because it transforms C p into D p = D × · · · × D (p times) where D is the open unit disc, which is easy to sample from. But this method has serious drawbacks when we consider the RJ step. To decrease dimension, the natural map would be to remove one of the roots from the polynomial. But because it is assumed that Φ has real coefficients (otherwise the model has no realistic interpretation), any complex α i must appear as conjugate pairs. There is then no obvious way to remove a root; a contrived method might be to remove the conjugate pair and replace it with a real root with the same modulus, however it is unclear how this new polynomial is related to the original, and to other aspects of the process, like ACV.
Besides mathematical convenience, this bijection has a very useful property (Kay and Marple, 1981, cf.) which helps motivate its use in defining RJ maps . In Appendix B we show that the ACFs of the φ (p) and φ (p−1) are identical. In other words, if d = q = 0, using this parametrisation for ϕ when moving between different values of p allows one to automatically choose processes that have very closely matching ACFs at low lags. In the MCMC context this is useful because it allows the chain to propose models that have a similar correlation structure to the current one. Although this property is nice, it may be of limited value for full ARFIMA models, since the proof of the main result [Theorem 1] does not easily lend itself to the inclusion of either a MA or long memory component. Nevertheless, our empirical results similarly indicate a 'near-match' for a full ARFIMA(p, d, q) model.

Application of RJ MCMC to ARFIMA(p, d, q) processes
We now use this reparametrisation to efficiently propose new parameter values. Firstly, it is necessary to propose a new memory parameter whilst keeping the model fixed. Attempts at updating each component individually suffer from the same problems of excessive posterior correlation that were encountered in section 5.1. Therefore the simultaneous update of the entire r = (p + q + 1)-dimensional parameter is performed using the hypercuboidtruncated Gaussian distribution from definition ξ | ∼ N Hr r ( , Σ ), where H r defines the r-dimensional rectangle. The covariance matrix Σ is discussed in some detail below. The choice of prior p (·) is arbitrary. Pai and Ravishanker (1998) used a uniform prior for ω which has an explicit expression in the parameterisation (Monahan, 1984). However, their expression is unnecessarily complicated since a uniform prior over Ω holds no special interpretation. We therefore prefer uniform prior overΩ: p ( ) ∝ 1, ∈Ω. Now consider the 'between-models' transition. We must first choose a model prior p M (·). A variety of priors are possible; the simplest option would be to have a uniform prior over M, but this would of course be improper. We may in practice want to restrict the possible values of p, q to 0 ≤ p ≤ P and 0 ≤ q ≤ Q for some P ,Q (say 5), which would render the uniform prior proper. However even in this formulation, a lot of prior weight is being put onto complicated models which, in the interests of parsimony, might be undesired. We prefer a truncated joint Poisson distribution with parameter λ: p M (p, q) ∝ λ p+q p!q! I(p ≤ P, q ≤ Q). Now, denote the probability of jumping from model M p,q to model M p ,q by U (p,q),(p ,q ) . U could allocate non-zero probability for every model pair, but for convenience we severely restrict the possible jumps (whilst retaining irreducibility) using a two-dimensional bounded birth and death process. Consider the subgraph of Z 2 : G = {(p, q) : 0 ≤ p ≤ P, 0 ≤ q ≤ Q}, and allocate uniform non-zero probability only to neighboring values, i.e., if and only if |p − p | + |q − q | = 1. Each point in the 'body' of G has four neighbours; each point on the 'line boundaries' has three; and each of the four 'corner points' has only two neighbours. Therefore the model transition probabilities U (p,q),(p ,q ) are either 1/4, 1/3, 1/2, or 0. Now suppose the current (p + q + 3)-dimensional parameter is ψ (p,q) , given by ψ (p,q) = (µ, σ, d, ϕ (p) , ϑ (q) ), using a slight abuse of notation. Because the mathematical detail of the AR and MA components are almost identical, we consider only the case of de/increasing p by 1 here; all of the following remains valid if p is replaced by q, and ϕ replaced by ϑ. We therefore seek to propose a parameter ξ (p+1,q) = (ξ µ , ξ σ , ξ d , ξ ), that is somehow based on ψ (p,q) . We further simplify by regarding the other three parameters (µ, σ, and d) as having the same interpretation in every model, choosing ξ µ = µ, ξ σ = σ and ξ d = d. For simplicity we also set ξ ϕ . To specify a bijection we 'dimension-match' by adding in a random scalar u. The most obvious map is to specify u so that its support is the interval (−1, 1) and then set: ξ . In other words, we either add, or remove the final parameter, whilst keeping all others fixed with the identity map, so the Jacobian is unity. The proposal q(u|ψ (p,q) ) can be made in many ways-we prefer the simple U(−1, 1). With these choices the RJ acceptance ratio is which applies to both increasing and decreasing dimensional moves. Construction of Σ : Much of the efficiency of the above scheme, including within-and between-model moves, depends on the choice of Σ ≡ Σ (p,q) , the within-model move RW proposal covariance matrix. We first seek an appropriate Σ (1,1) , as in section 5.1, with a pilot tuning scheme. That matrix is shown on the left below, where we've 'blocked it out' (where each block is a scalar) so that we can extend this idea to the (p, q) case in the obvious way-on the right above-where Σ ϕ (p) ,ϕ (p) is a p × p matrix, Σ ϑ (q) ,ϑ (q) is a q × q matrix, etc. If either (or both) p, q = 0 then the relevant blocks are simply omitted. To specify the various sub-matrices, we propose ϕ 2 , . . . , ϕ p with equal variances, and independently of d, ϕ 1 , ϑ 1 , (and similarly for ϑ 2 , . . . , ϑ q ). In the context of (22), the following hold: where the dotted lines indicate further blocking, 0 is a row-vector of zeros, and O is a zero matrix. This choice of Σ is conceptually simple, computationally easy and preserves the positive-definiteness as required; this is shown by a simple relabeling of the rows/columns, and then repeated application of theorems 2 and 3 which appear in B.

Empirical illustration and comparison
Here we provide empirical illustrations for the methods above: for classical and Bayesian analysis of long memory models, and extensions for short memory. To ensure consistency throughout, the location and scale parameters will always be chosen as µ I = 0 and σ I = 1. Furthermore, unless stated otherwise, the simulated series will be of length n = 2 10 = 1024. This is a reasonable size for many applications; it is equivalent to 85 years' monthly observations. When using the approximate likelihood method we set P = n. Unless otherwise stated the priors used will be those simple defaults suggested in the previous sections.

Long memory
We begin by demonstrating that the approximate likelihood of section 3.2 is accurate. We then conduct a Monte Carlo study varying length of the input, n. Finally, we compare the Bayesian point-estimates and with common non/semi-parametric alternatives. Standard MCMC diagnostics were used throughout to ensure, and tune for, good mixing. Because d is the parameter of primary interest, the initial values d [0] will be chosen to systematically cover its parameter space, usually starting five chains at the regularly-spaced points {−0.4, −0.2, 0, 0.2, 0.4}. Initial values for other parameters are not varied: µ will start at the sample meanx; σ at the sample standard deviation of the observed series x. When using the approximate likelihood method, setting each of x A tox turns out to be a sufficiently good strategy. For other MCMC particulars, see Graves (2013, §4.3.3).

Efficacy of approximate likelihood method
Start with the 'null case', i.e., how does the algorithm perform when the data are not from a long memory process? One hundred independent ARFIMA(0, 0, 0), or Gaussian white noise, processes are simulated, from which marginal posterior means, standard deviations, and credibility interval endpoints are extracted. Table 1 shows averages over the runs. The average estimate for each of the three parameters is less than a quarter of a standard deviation away from the truth. Credibility intervals are nearly symmetric about the estimate and the marginal posteriors are, to a good approximation, locally Gaussian (not shown). Upon, applying a proxy 'credible-interval-based hypothesis test' one would conclude in ninety-eight of the cases that d = 0 could not be ruled out. A similar analysis for µ and σ shows that hypotheses µ = 0 and σ = 1 would each have been accepted ninety-six times. These results indicate that the 95% credibility intervals are approximately correctly sized.
Next, consider the more interesting case of d I = 0. We repeat the above experiment except that ten processes are generated with d I set to each of {−0.45, −0.35, . . . , 0.45}, giving 100 series total. Figure 1 shows a graphical analog of results from this experiment.
The plot axes involve a Bayesian residual estimate of d, d R from plot (a) that the estimate of σ also appears to be unaffected by the input value d I . The situation is different however in plot (b) for the location parameter µ. Although the bias appears to be roughly zero for all d I , the posterior variance clearly is affected by d I .
To ascertain the precise functional dependence, consider plot (c) which shows, on a semi-log scale, the marginal posterior standard deviation of µ, σ µ (B) , against d I .
It appears that the marginal posterior standard deviation σ µ (B) is a function of d I ; specifically: σ µ (B) ∝ A d I , for some A. The constant A could be estimated via least-squares regression. Instead however, inspired by asymptotic results in literature concerning classical estimation of long memory processes (Beran, 1994b) we set A = n and plotted the best fitting such line (shown in plot (c)). Observe that, although not fitting exactly, the relation σ µ (B) ∝ n d I holds reasonably well for d I ∈ (− 1 2 , 1 2 ). Indeed, Beran motivated long memory in this way, and derived asymptotic consistency results for optimum (likelihood-based) estimators and found indeed that the standard error for µ is proportional to n d−1/2 (theorem 8.2) but the standard errors of all other parameters are proportional to n −1/2 (theorem 5.1).
To study the impact of improper priors for µ and σ we twice repeated the analysis of the 100 ARFIMA(0, 0, 0) above, changing p µ (·) in the first instance and p σ (·) in the second. When using p µ (·) ∼ N (0, 100 2 ), the maximum difference between estimates was 0.0012 for d, 0.0017 for σ and 0.0019 for µ. When using p σ (·) ∼ R(0.01, 0.01), the corresponding values were 0.0020 for d, 0.0017 for σ and 0.0027 for µ. Since these maximum differences are well below the Monte Carlo error, we conclude that there is practically no difference between using an improper prior and a vague proper prior.

Comparison of likelihood methods
To compare the methods based on approximate and exact likelihoods we consider fifty ARFIMA(0, 0, 0) generated as described above. The output summary statistics are presented in a table within figure 3, which is accompanied by a useful visualisation. Observe that both  Figure 3: Table: Comparison of posterior summary statistics for ARFIMA(0, 0, 0) process obtained via approximate and exact likelihood methods. Average of 50 runs. Plot: Comparison of 95% credibility intervals for d from ARFIMA(0, 0, 0) processes obtained via approximate and exact likelihood methods. Black is used for approximate, red for exact. The crosses are the respective point estimates.
methods produce highly correlated point estimates and credibility interval endpoints. The plot shows that the posteriors of d are similar; there is only a very slight 'shift' between the approximate and exact results, implying that the approximate method is consistently underestimating d (by less than 0.01). The credibility intervals have excellent frequentist coverage properties; 48 (96%) of the 95%-level intervals contain the input d I . We also found (not shown) that µ and σ exhibit the same pattern of behaviour.
To check for similarity between the approximate and exact likelihood methods across the entire parameter space we repeated the ARFIMA(0, d, 0) simulations with varying d (analyzed in figure 1). The results for the residuals, d R , are presented in figure 4. We observe the same pattern here as with the d I = 0 case. Note that the estimates of d obtained using the two methods appear to be slightly closer for negative d I than for positive d I , with very low discrepancies (0.01) even for the largest d I = 0.45. This suggests that the approximate method generally performs better for smaller d I .

Effect of varying time series length
We now analyse the effect of changing the time series length. For this we conduct a similar experiment but fix d I = 0 and vary n.  Observe that all three marginal posterior standard deviations are proportional to 1 √ n , although the posterior of µ is less 'reliable'. Combining these observations with our earlier deduction that σ (B) µ ∝ n d I , we conclude that for an ARFIMA(0, d I , 0) process of length n, the marginal posterior standard deviations follow those of Beran given previously.

Influence of prior
Throughout, uninformative priors are used by default. However, one of the principal advantages of the Bayesian approach is the ability to include genuine prior information. To explore the effect prior variations we performed a series of tests in which p d (d) ∝ N (0, 0.15 2 ) in an analysis of ARFIMA(0, 0.25, 0) processes. Note that the true value d = 0.25 is outside of the central 90% of the prior distribution, with the density there being less than a quarter of that at the maximum (i.e. when d = 0). For each length of n = 2 7 , 2 8 , 2 9 , 2 10 , ten series were analysed and compared with the equivalent analyses with the flat prior.
As expected, in each case the Bayesian estimated (B) is always lower when the Gaussian prior is used, compared to the flat prior. However the average difference between the two estimates shows a clear inverse relationship with n; see the table in  n, the average difference is fairly significant, but for n = 1024, the difference is below the Monte Carlo error, as the likelihood has overwhelmed the prior in this case. For a graphical demonstration the convergence of the priors, refer to the plots in figure 6. Plot (a) shows that for n = 128, the two posteriors are quite clearly distinct. However plot (b) shows that for n = 1024, the posteriors have essentially coincided.

Comparison with common estimators
In many practical applications, the long memory parameter is estimated using non/semiparametric methods. These may be appropriate in many situations, where the exact form of the underlying process is unknown. However when a specific model form is known (or at least assumed) they tend to perform poorly compared with fully parametric alternatives (Franzke et al., 2012). Our aim here is to demonstrate, via a short Monte Carlo study involving ARFIMA(0, d, 0) data, that our Bayesian likelihood-based method significantly outperforms other common methods in that case. We consider the following comparators: (i) rescaled adjusted range, or R/S Hurst (1951); Graves (2013)-we use the R implementation in the FGN (McLeod et al., 2007) package; (ii) Semi-parametric Geweke-Porter-Hudak (GPH) method (Geweke and Porter-Hudak, 1983)-implemented in R package fracdiff (Fraley et al., 2012); (iii) detrended fluctuation analysis (DFA), originally devised by Peng et al. (1994)-in the R package PowerSpectrum (Vyushin et al., 2009). (iv) wavelet-based semiparametric estimators Abry et al. (2003) available in R package fARMA (Wuertz, 2012). Each of these four methods will be applied to the same 100 time series with varying d I as were used earlier experiments above. We extend the idea of a residual, d R method, and moreover the R/S is positively biased. Actually, the bias in some cases would seem to depend on d I : R/S is significantly (i.e. > 0.25) biased for d I < −0.3 but slightly negatively biased for d > 0.3 (not shown); DFA is only unbiased for d I > 0; both the GPH and wavelet methods are unbiased for all d ∈ (− 1

Extensions for short memory
Known form: We first consider the MCMC algorithm from section 5.1 for sampling under an ARFIMA(1, d, 0) model where the full memory parameter is ω = (d, φ 1 ). Recall that that method involved proposals from a hypercuboid MVN using a pilot-tuned covariance matrix. Also recall that it is a special case of the re-parametrised method from section 5.2. In general, this method works very well; two example outputs are presented in figure 8, under two similar data generating mechanisms. Plot (a) shows relatively mild correlation (ρ = 0.21) compared with (b) which shows strong correlation (ρ = 0.91). This differential behaviour can be explained heuristically by considering the differing data-generating values. For the process in plot (a) the short memory and long memory components exhibit their effects at opposite ends of the spectrum; see figure 9(a). The resulting ARFIMA spectrum, with peaks at either end, makes it easy to distinguish between short and long memory effects, and consequently the posteriors of d and φ are largely uncorrelated. In contrast, the parameters of the process in plot (b) express their behaviour at the same end of the spectrum. With negative d these effects partially cancel each other out, except very near the origin where the negative memory effect dominates; see figure 9(b). Distinguishing between the effects of φ and d is much more difficult in this case, consequently the posteriors are much more dependent. In cases where there is significant correlation between d and φ, it arguably makes little sense to consider only the marginal posterior distribution of d. For example the 95% credibility interval for d from plots (b) is (−0.473, −0.247), and the corresponding interval for φ is (−0.910, −0.753), yet these clearly give a rather pessimistic view of our joint knowledge about d and φ-see figure 9(c). In theory an ellipsoidal credibility set could be constructed, although this is clearly less practical when dim ω > 2.
shows posterior samples of (d, φ) from series considered in pane (b) with credibility sets: red is 95% credibility set for (d, φ), green is 95% credibility interval for d, blue is 95% credibility interval for φ.
Unknown form: The RJ scheme outlined in section 5.2 works well for data simulated with p and q up to 3. The marginal posteriors for d are generally roughly centred around d I (the data generating value) and the modal posterior model probability is usually the 'correct' one. To illustrate, consider again the two example data generating contexts used above. how the densities obtained via the RJ method are very close to those obtained assuming p = 1 and q = 0. The former are slightly more heavy-tailed, reflecting a greater level of uncertainty about d. Interestingly, the corresponding plots for the posteriors of µ and σ do not appear to exhibit this effect-see figure 10(c)-(d). The posterior model probabilities are presented in table 2, showing that the 'correct' modes are being picked up consistently.   As a test of the robustness of the method, consider a complicated short memory input combined with a heavy tailed α-stable innovations distribution. Specifically, the time series that will be used is the following ARFIMA(2, d, 1) process where ε t ∼ S α=1.75,0 .
For more details, see (Graves, 2013, §7.1). The marginal posterior densities of d and α are presented in figure 11. Performance looks good despite the complicated structure. The posterior estimate for d is d (B) = 0.22, with 95% CI (0.04, 0.41). Although this interval is admittedly rather wide, it is reasonably clear that long memory is present in the signal. The corresponding interval for α is (1.71, 1.88) with estimate α (B) = 1.79. Finally, we see from table 3 that the algorithm is very rarely in the 'wrong' model. The Nile Data: We conclude with an application of our methods to the famous annual Nile minima data. Because of the fundamental importance of the river to the civilisations it has supported, local rulers kept measurements of the annual maximal and minimal heights obtained by the river at certain points (called gauges). The longest uninterrupted sequence of recordings is from the Roda gauge (near Cairo), between 622 and 1284 AD (n = 663).    p\q 0 1 2 3 4 5 marginal 0 0.638 0.101 0.010 0.000 0.000 0.000 0.750 1 0.097 0.124 0.011 0.000 0.000 0.000 0.232 2 0.007 0.010 0.000 0.000 0.000 0.000 0.018 3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.000 0.000 0.000 0.000 0.000 0.000 5 0.000 0.000 0.000 0.000 0.000 0.000 0.000 marginal 0.742 0.236 0.022 0.000 0.000 0.000 the model with the highest posterior probability is the ARFIMA(0, d, 0) model with d ≈ 0.4. This suggests a strong, 'pure', long memory feature. Our results compare favourably with other studies (Liseo et al., 2001;Hsu and Breidt, 2003;Ko and Vannucci, 2006a).

Discussion
We have provided a systematic treatment of efficient Bayesian inference for ARFIMA models, the most popular parametric model combining long and short memory effects. Through a mixture of theoretical and empirical work we have demonstrated that the methods can handle the sorts of time series data that are typically confronted with possible long memory in mind.
Many of the choices made throughout, but in particular those leading to our likelihood approximation stem from a need to accommodate further extension. For example, in future work we intend to extend them to cope with a heavy-tailed innovations distribution. For more evidence of potential in this context, see Graves (2013, §7). Along similar lines, there is scope for further generalisation to incorporate seasonal (long memory) effects. Finally, an advantage of the Bayesian approach is that it provides a natural mechanism for dealing with missing data, via data augmentation. This is particularly relevant for long historical time series which may, for a myriad of reasons, have recording gaps. For example, some of the data recorded at other gauges along the river Nile have missing observations although otherwise span a similarly long time frame. For a demonstration of how this might fit within our framework, see §5.6 of Graves dissertation.
A Gibbs sampling of µ and σ Assuming Gaussianity, and the Gaussian and root-inverse-gamma independent priors (deliberately chosen to ensure prior-posterior conjugacy), it is possible to use Gibbs updating for the parameters µ and σ. Demonstrating this requires the following result: If g(x) ∝ exp − 1 2 (αx 2 − 2βx) , and R g(x) dx = 1, then g ∼ N ( β α , 1 α ). Now, letc =: 1 n n t=1 c t . Then, updating µ by combining its prior with the approximate (log) likelihood (17), yields the following conditional posterior: Then, our result reveals

B Preservation of ACF under bijection
Proof. where M = m + m , and B m,m = B t m ,m . Furthermore, let x ∈ R M be block divided as x t = y t z t , where y ∈ R m and z ∈ R m . Then: x t Ax = y t B m,m y + 2y t B m,m z + z t B m ,m z.
Theorem 2. Let Σ M be an M × M positive-definite matrix, and let Σ m be the leading m-submatrix of Σ M . Then Σ m is also positive definite.
Proof. Σ M is positive definite so x t Σ M x > 0, for all x ∈ R M . In particular, this is true for all block-divided x t = y t 0 t , where y ∈ R m . Then from (28): 0 < x t Σ M x = y t Σ m y + 2y t B m,m 0 + 0 t B m ,m 0 = y t Σ m y.
Clearly therefore, y t Σ m y > 0 for all y ∈ R m , i.e. Σ m is positive-definite.
Theorem 3. Suppose the m × m matrix Σ m and the n × n matrix Σ n are positive definite. Then the following (m + n) × (m + n) matrix is also positive definite: Proof. Let x ∈ R m+n be partitioned as x t = y t z t where y ∈ R m and z ∈ R n . Then from (28), x t Σ m,n x = y t Σ m y + 2y t Oz + z t Σ n z = y t Σ m y + z t Σ n z. And because these two terms are positive, we have that x t Σ m,n x > 0, i.e. Σ m,n is positive-definite.