### abstract ###
Risk maps estimating the spatial distribution of infectious diseases are required to guide public health policy from local to global scales.
The advent of model-based geostatistics has allowed these maps to be generated in a formal statistical framework, providing robust metrics of map uncertainty that enhances their utility for decision-makers.
In many settings, decision-makers require spatially aggregated measures over large regions such as the mean prevalence within a country or administrative region, or national populations living under different levels of risk.
Existing MBG mapping approaches provide suitable metrics of local uncertainty the fidelity of predictions at each mapped pixel but have not been adapted for measuring uncertainty over large areas, due largely to a series of fundamental computational constraints.
Here the authors present a new efficient approximating algorithm that can generate for the first time the necessary joint simulation of prevalence values across the very large prediction spaces needed for global scale mapping.
This new approach is implemented in conjunction with an established model for P. falciparum allowing robust estimates of mean prevalence at any specified level of spatial aggregation.
The model is used to provide estimates of national populations at risk under three policy-relevant prevalence thresholds, along with accompanying model-based measures of uncertainty.
By overcoming previously unchallenged computational barriers, this study illustrates how MBG approaches, already at the forefront of infectious disease mapping, can be extended to provide large-scale aggregate measures appropriate for decision-makers.
### introduction ###
Risk maps estimating the spatial distribution of infectious diseases in relation to underlying populations are required to support public health decision-making at local to global scales CITATION CITATION.
The advancement of theory, increasing availability of computation and growing recognition of the importance of robust handling of uncertainty have all contributed to the emergence in recent years of a new paradigm in the mapping of disease: the use of a special family of generalised linear models known as model-based geostatistics, generally implemented in a Bayesian framework CITATION, CITATION .
MBG models take point observations of disease prevalence from dispersed survey locations and generate continuous maps by interpolating prevalence at unsampled locations across raster grid surfaces.
The most striking advantage of MBG in disease mapping is its handling of uncertainty.
Interpolating sparse, often imperfectly sampled, survey data to predict disease prevalence across wide regions results in inherently uncertain risk maps, with the level of uncertainty varying spatially as a function of the density, quality, and sample size of available survey data, and moderated by the underlying spatial variability of the disease in question.
MBG approaches allow these sources of uncertainty to be propagated to the final mapped output, predicting a probability distribution for the prevalence at each location of interest.
Where predictions are made with small uncertainty, these distributions will be tightly concentrated around a central value; where uncertainty is large they will be more dispersed.
These techniques have been used to generate robust and informative risk maps for malaria CITATION CITATION, as well as a range of other infectious diseases CITATION CITATION, at scales varying from national to global.
Some studies have extended the handling of variation through space to also include the temporal dimension, allowing disease risk to be modelled and quantified over time as well as space CITATION, CITATION .
Implementation of MBG models over even relatively small areas is extremely computationally expensive.
Not only are the matrix algebra operations required to generate predictions at each individual pixel costly compared to simpler interpolation methods CITATION, CITATION, but this cost must be multiplied many times because prediction uncertainty is evaluated by generating many, equally probable, realisations of prevalence at each pixel.
Implementations of MBG disease models over large areas therefore tend to be via per-pixel computation whereby complete maps are built up by generating predictive realisations for each pixel independently.
This allows the computational task to be broken down into many small, more easily manageable, operations.
Such an approach yields appropriate measures of local uncertainty: the set of realisations for each pixel represents a posterior predictive distribution of prevalence from which summary statistics such as the mean, inter-quartile range or 95 percent credible intervals can be readily extracted, providing the user with valid uncertainty information for each individual location considered in isolation.
There is often a need to evaluate disease prevalence aggregated across spatial regions, temporal periods, or combinations of both CITATION, CITATION.
This may be to quantify and compare mean prevalence between countries or administrative units, for example, or to measure a shift in mean prevalence between the start and end of an intervention period or policy change.
In addition, MBG prevalence models can be used to estimate derived quantities such as population totals living in regions at different levels of risk, or the burden of disease cases expected within individual countries or continents as a function of underlying prevalence CITATION, quantities that by definition exist only over aggregated space-time units.
It is not possible, however, to construct posterior distributions for these aggregate quantities using a per-pixel approach.
To estimate the mean of a region made up of multiple pixels, and the uncertainty around this estimate, the correlation between all the pixels in the region must be known.
In a per-pixel approach, each pixel is modelled as independent of its neighbours, ignoring any spatial or temporal correlation.
Failing to account for correlation between pixels leads to gross underestimates of the uncertainty in the aggregated quantity, especially over large regions CITATION .
The solution to the problem outlined above is to replace per-pixel simulation of prevalence realisations with the simultaneous or joint simulation of all pixels to be aggregated, recreating appropriate spatial and temporal correlation between them CITATION.
Crucially, the set of pixel values can then be aggregated in any way, or used as input in derived aggregated quantities, and realisations of these aggregations will have the appropriate posterior predictive distributions.
Whilst conceptually simple, the extension from local to regional simulation induces a fundamental computational constraint in that the necessary calculations can no longer be disaggregated into separate tasks for each pixel.
This constraint has thus far prevented any use of MBG in disease mapping for the evaluation of aggregate quantities over very large areas, despite the profound public health importance of such measures.
Where examples of joint simulation in MBG disease mapping exist, they tend either to be over very small spatial regions CITATION or are achieved by simply breaking larger regions down manually into smaller more manageable tiles CITATION .
In this paper we use a new approximate algorithm for joint simulation to quantify, for the first time, aggregated uncertainty over space and time in a global scale MBG disease model for Plasmodium falciparum malaria prevalence CITATION.
We exemplify how this approach allows uncertainty in prevalence predictions to be enumerated at the continental, national, and sub-national scales at which public-health decisions are usually made.
We then extend the model architecture to estimate a second quantity of particular epidemiological interest: national populations at risk under different policy-relevant strata of P. falciparum transmission intensity.
PAR estimates form a fundamental metric for malaria decision-makers at national and international levels CITATION, CITATION and have also been used to assess equity in donor funding distributions CITATION, chart the changing exposure of human populations to the disease CITATION and provide baselines for predicted changes in exposure under climate change scenarios CITATION.
A range of techniques have been used to estimate PAR, including the use of MBG and other prevalence models to delineate risk strata in relation to underlying population distributions CITATION, CITATION, CITATION CITATION.
None of these studies have incorporated the inherent uncertainty in prevalence estimates, however, and the resulting PAR estimates are presented as point values with no uncertainty metrics.
Here we use the joint simulation framework to generate posterior predictive distributions of PAR living under conditions of low, medium, and high stable transmission within each malaria endemic country, allowing the uncertainty inherent in these estimates to be quantified in a formal statistical framework.
These PAR estimates are presented in full with this paper, making them available to any interested parties to support theoretical and applied epidemiological and public health applications.
In the remainder of this introductory section we outline the computational challenges of large scale joint simulation and review existing approaches to overcoming them.
In the methods section we present our algorithm for efficient joint simulation over very large grids, detail its implementation and testing with the global P. falciparum model, and its extension to estimating PAR.
The results section provides the outcome of the testing and validation procedures and examples of jointly simulated realisations of continental, national, and locally aggregated estimates of P. falciparum prevalence in 2007.
We present our national level estimates of PAR and exemplify how the accompanying uncertainty metrics can be communicated effectively to enhance their utility to decision-makers.
We conclude by discussing the strengths and weaknesses of our modelling architecture, the implications for the future of disease mapping, and useful directions for further research.
A general form for MBG models can be defined as follows:FORMULAsuch that in a disease survey of FORMULA individuals at a given location, the number observed to be infected FORMULA is modelled as binomially distributed with probability of infections given by FORMULA, the underlying prevalence of the disease in question, which is modelled as a transformation via an inverse link-function FORMULA of an unknown Gaussian process FORMULA CITATION, CITATION.
A Gaussian process in the context of disease mapping is a convenient probability distribution for 2-d surfaces, describing probabilities associated with different forms of the surface.
Using Bayesian inference, the Gaussian process can be updated to take account of the input data, providing a refined description of these probabilities.
Possible surfaces can then be drawn from this updated Gaussian process which, after passing through the inverse link-function, provide realisations of the target disease surface.
The Gaussian process can take a wide range of forms: the central tendency at any location is governed by the underlying mean function FORMULA, whilst textural properties are governed by the covariance function FORMULA.
The symbol FORMULA denotes a set of FORMULA parameters that define the form of the covariance and mean, which can include covariate coefficients.
In MBG, the aim is to estimate the joint posterior distribution of the model parameters FORMULA and the values of FORMULA evaluated at all locations and times for which a prediction is required - generally across the nodes of a regular raster grid.
Computationally, this task can be split into two distinct phases.
Firstly, Markov chain Monte Carlo can be used to generate realisations from the joint posterior of FORMULA and FORMULA at only the FORMULA space-time locations FORMULA where data exist, denoted FORMULA.
This is intuitive because it is only at these locations that the fit of the Gaussian process is evaluated, and this means the MCMC must only consider a multivariate normal distribution of dimension FORMULA, which is generally several orders of magnitude smaller than if all prediction locations across the raster grid were considered.
A realisation of FORMULA and FORMULA provides a skeleton from which the Gaussian process can be evaluated at all prediction locations across a raster grid in a second computational stage.
Conditional on these skeleton realisations, the value of FORMULA at each prediction location and time FORMULA can be sampled from its posterior predictive distribution:FORMULAwhere the posterior predictive mean and covariance parameters are given by the standard conditioning formulas for multivariate normal variables CITATION :FORMULAFORMULABy carrying out this two-step procedure over many realisations, samples are built up from the target posterior predictive distribution FORMULA.
In a per-pixel implementation, the predictive distributions FORMULA, FORMULA, FORMULA FORMULA at all FORMULA prediction locations in the output raster are realised independently to generate local models of uncertainty.
In this case, the largest single computational component is the population and factorisation of the data-to-data covariance matrix FORMULA which, in typical disease prevalence data sets where FORMULA is in the hundreds or thousands, is a relatively minor task that could generally be achieved on a standard desktop computer.
The subsequent sampling from the posterior predictive distribution is trivial: the posterior predictive mean and covariance refer to a single prediction location and sampling therefore amounts to drawing from a univariate normal distribution.
Total computation for each pixel is therefore modest, and the cost of generating the maps grows simply in proportion to the number of pixels involved, FORMULA.
Switching from a per-pixel implementation to a joint simulation over many prediction locations increases profoundly the computational challenge.
The efficiency of a per-pixel approach arises from the effective reduction of FORMULA to one, as each pixel is considered in isolation.
Joint simulation requires that FORMULA is preserved as the total number of prediction points, which can be many millions if large areas are considered at reasonably fine spatial resolution.
In addition to the FORMULA FORMULA data-to-data covariance matrix, the FORMULA FORMULA prediction-to-prediction and FORMULA FORMULA data-to-prediction covariance matrices must be populated.
More importantly, in the subsequent sampling from the posterior predictive multivariate normal distribution, the prediction-to-prediction covariance matrix must be factorised CITATION.
The computational cost of this operation is proportional to the cube of FORMULA.
To put this non-linear scaling in context, if a direct joint simulation of a 100 100 raster grid could be computed in one minute, a 1000 1000 grid would take approximately 6 10 7 seconds.
In practice these scaling factors along with those of memory and storage requirements mean direct joint simulation using the equations outlined above is generally limited to predictions at a maximum of around 10,000 points CITATION, CITATION, at least two orders of magnitude too few for global scale mapping at sub-10 km resolution, even at a single time period.
