Cover Page

Contents

Preface and Acknowledgements to Second Edition

Preface and Acknowledgements

Part I: The Nature of Spatial Epidemiology

1 Definitions, Terminology and Data Sets

1.1 Map Hypotheses and Modelling Approaches

1.2 Definitions and Data Examples

1.3 Further Definitions

1.4 Some Data Examples

2 Scales of Measurement and Data Availability

2.1 Small Scale

2.2 Large Scale

2.3 Rate Dependence

2.4 Data Quality and the Ecological Fallacy

2.5 Edge Effects

3 Geographical Representation and Mapping

3.1 Introduction and Definitions

3.2 Maps and Mapping

3.3 Statistical Accuracy

3.4 Aggregation

3.5 Mapping Issues Related to Aggregated Data

3.6 Conclusions

4 Basic Models

4.1 Sampling Considerations

4.2 Likelihood-Based and Bayesian Approaches

4.3 Point Event Models

4.4 Count Models

5 Exploratory Approaches, Parametric Estimation and Inference

5.1 Exploratory Methods

5.2 Parameter Estimation

5.3 Residual Diagnostics

5.4 Hypothesis Testing

5.5 Edge Effects

Part II: Important Problems in Spatial Epidemiology

6 Small Scale: Disease Clustering

6.1 Definition of Clusters and Clustering

6.2 Modelling Issues

6.3 Hypothesis Tests for Clustering

6.4 Space-Time Clustering

6.5 Clustering Examples

6.6 Other Methods Related to Clustering

7 Small Scale: Putative Sources of Hazard

7.1 Introduction

7.2 Study Design

7.3 Problems of Inference

7.4 Modelling the Hazard Exposure Risk

7.5 Models for Case Event Data

7.6 A Case Event Example

7.7 Models for Count Data

7.8 A Count Data Example

7.9 Other Directions

8 Large Scale: Disease Mapping

8.1 Introduction

8.2 Simple Statistical Representation

8.3 Basic Models

8.4 Advanced Methods

8.5 Model Variants and Extensions

8.6 Approximate Methods

8.7 Multivariate Methods

8.8 Evaluation of Model Performance

8.9 Hypothesis Testing in Disease Mapping

8.10 Space-Time Disease Mapping

8.11 Spatial Survival and Longitudinal Data

8.12 Disease Mapping: Case Studies

9 Ecological Analysis and Scale Change

9.1 Ecological Analysis: Introduction

9.2 Small-Scale Modelling Issues

9.3 Changes of Scale and MAUP

9.4 A Simple Example: Sudden Infant Death in North Carolina

9.5 A Case Study: Malaria and IDDM

10 Infectious Disease Modelling

10.1 Introduction

10.2 General Model Development

10.3 Spatial Model Development

10.4 Modelling Special Cases for Individual-Level Data

10.5 Survival Analysis with Spatial Dependence

10.6 Individual-Level Data Example

10.7 Underascertainment and Censoring

10.8 Conclusions

11 Large Scale: Surveillance

11.1 Process Control Methodology

11.2 Spatio-Temporal Modelling

11.3 S-T Monitoring

11.4 Syndromic Surveillance

11.5 Multivariate–Multifocus Surveillance

11.6 Bayesian Approaches

11.7 Computational Considerations

11.8 Infectious Diseases

11.9 Conclusions

A Monte Carlo Testing, Parametric Bootstrap and Simulation Envelopes

A.1 Nuisance Parameters and Test Statistics

A.2 Monte Carlo Tests

A.3 Null Hypothesis Simulation

A.4 Parametric Bootstrap

A.5 Simulation Envelopes

B Markov Chain Monte Carlo Methods

B.1 Definitions

B.2 Metropolis and Metropolis–Hastings Algorithms

C Algorithms and Code

C.1 Data Exploration

C.2 Likelihood and Bayesian Models

C.3 Likelihood Models

C.4 Bayesian Hierarchical Models

C.5 Space-Time Analysis

D Glossary of Estimators

D.1 Case Event Estimators

D.2 Tract Count Estimators

E Software

E.1 Software

Bibliography

Index

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by WALTER A. SHEWHART and SAMUEL S. WILKS

Editors: David J. Balding, Peter Bloomfield, Noel A. C. Cressie, Nicholas I. Fisher, Iain M. Johnstone, J. B. Kadane, Geert Molenberghs, Louies M. Ryan, David W. Scott, Adrian F. M. Smith, Jozef L. Teugels;

Editors Emeriti: Vic Barnett, J. Stuart Hunter, David G. Kendall

A complete list of the titles in this series appears at the end of this volume.

Title Page

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester,

West Sussex PO19 8SQ, England

Telephone (+44) 1243 779777

Email (for orders and customer service enquiries): cs-books@wiley.co.uk.

Visit our Home Page on www.wiley.com

All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the Publisher. Requests to the Publisher should be addressed to the Permissions Department, John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England, or emailed to permreq@wiley.co.uk, or faxed to (+44) 1243 770620.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The Publisher is not associated with any product or vendor mentioned in this book.

This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the Publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought.

Other Wiley Editorial Offices

John Wiley & Sons Inc., 111 River Street, Hoboken, NJ 07030, USA

Jossey-Bass, 989 Market Street, San Francisco, CA 94103-1741, USA

Wiley-VCH Verlag GmbH, Boschstr. 12, D-69469 Weinheim, Germany

John Wiley & Sons Australia Ltd, 42 McDougall Street, Milton, Queensland 4064, Australia

John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01, Jin Xing Distripark, Singapore 129809

John Wiley & Sons Canada Ltd, 22 Worcester Road, Etobicoke, Ontario, Canada M9W 1L1

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

ISBN-13: 978-0-470-01484-4

ISBN-10: 0-470-01484-9

“. . . a story is a letter the author writes to themself, to tell themself things that they would be unable to discover otherwise.”

after Carlos Ruiz Zafón

‘to Keir, Fraser, and Hugh and all my family’

Preface and Acknowledgements to Second Edition

Since the appearance of the first edition of this book there has been a considerable development of interest in statistical methodology in the area of spatial epidemiology. This development has seen the increased output of research papers and books marking the maturity of certain areas of concern. For example, close to that date when the edited volume by Elliott et al. (2000) appeared, and since special issues of the Journal of the Royal Statistical Society, Series A (2001), Environmental and Ecological Statistics (2005), Statistical Methods in Medical Research (2005, 2006) and Statistics in Medicine (2006) have all contributed to the appearance of novel methodology. The development of software has also facilitated the wider use of the more advanced methods. In particular, the availability of free packages such as R, WinBUGS and SaTScan has led to wide dissemination of the available methods.

In particular, the area of disease map modelling has seen much development with Bayesian modelling as a particular feature. The use of mixture models and variants of likelihoods has seen development, while the routine application of sophisticated random-effect models is now relatively straightforward. The areas of disease clustering, ecological analysis and infectious disease modelling have all seen advances. In addition, the area of surveillance has re-emerged due to interest in early detection of potential bioterrorism attacks and in particular syndromic surveillance has become a major focus.

I would like to take this opportunity to acknowledge the influence and support of the following: Linda Pickle (NIH), Ram Tiwari (NIH), Martin Kulldorff, Dan Wartenburg, Peter Rogerson, Andrew Moore, Sudipto Banerjee, Ken Kleinman, William Browne, Carmen Vidal Rodeiro, Monir Hossain, Allan Clark, Yang Wang, Yuan Liu, Bo Ma, Huafeng Zhou. Finally I should also like to acknowledge the helpful interactions with staff at Wiley Europe over the years: Kathryn Sharples, Sian Jones, Helen Ramsey, Sharon Clutton and Lucy Bryan.

Andrew Lawson, Columbia, South Carolina
December 2005

Preface and Acknowledgements

The development of statistical methods in spatial epidemiology has had a chequered career. One of the earliest examples of the analysis of geographical locations of disease in relation to a putative health hazard was John Snow’s analysis of cholera cases in relation to the location of the Broad Street water pump in London (Snow, 1854). However, until recently, developments in statistical methods in this area have been sporadic. While medical geography developed in the 1960s (Howe, 1963), only a number of papers on space-time clustering (Mantel, 1967; Knox, 1964) appeared in the statistical literature. More recently, developments of methods in spatial statistics, image processing, and in particular Bayesian methods and computation, have seen parallel developments in methods for spatial epidemiology (see Marshall (1991b) for a review). It is notable that methods for the analysis of case locations around a source of hazard (such as Snow’s cholera map) have only recently been developed (Diggle, 1989; Lawson, 1989). The current increased level of interest in statistical methods in spatial epidemiology is a reflection, in part, of the increased concern in society for environmental issues and their relation to the health of individuals. Hence, the ‘detection’ of pollution sources or sources of health hazard can be seen as the backdrop to many studies in environmental epidemiology (Diggle, 1993). The correct allocation of resources for health care in different areas by health services is also greatly enhanced by the development of statistical methods which allow more accurate depiction of ‘true’ disease incidence and its relation to explanatory variables. Previous work in this area has been reviewed by Lawson and Cressie (2000), and Marshall (1991b) and Elliott et al. (1992a) discuss the general epidemiological issues surrounding spatial epidemiological problems.

It is the purpose of this book to provide an overview of the main statistical methods currently available in the field of spatial epidemiology. Inevitably, some selectivity in choice of methods reviewed will be apparent, but it is hoped that our coverage will encompass the most important areas of development. One area which we do not examine in detail is that of space-time analysis of epidemiological data, although the modelling of infectious disease data is considered in Chapter 11.

As this book is mainly a review of recent research work, its target audience is largely confined to those with some statistical knowledge and is appropriate for third level degree and postgraduate students in statistics, or epidemiology with a strong statistical background.

A considerable number of people have directly or indirectly contributed to the production of this book. First, I acknowledge the support of Sharon Clutton and Helen Ramsey at Wiley and Tony Johnson of Statistics in Medicine for their support from Budapest onwards. Fundamental influences in the development of my ideas in spatial epidemiology have been Richard Cormack and Peter Diggle. I also acknowledge the encouragement of Noel Cressie, who has supported my work through visits to Iowa State and Ohio State Universities, and important collaborations with Martin Kulldorff, Annibale Biggeri, Dankmar Boehning, Peter Schlattmann, Emmanuel Lesaffre, Jean-Francois Viel, Adrian Baddeley, Niels Becker and Andrew Cliff.

Andrew B. Lawson, Aberdeen,
March 2000

Part I

The Nature of Spatial Epidemiology

1 Definitions, Terminology and Data Sets

Spatial epidemiology concerns the analysis of the spatial/geographical distribution of the incidence of disease. In its simplest form the subject concerns the use and interpretation of maps of the locations of disease cases, and the associated issues relating to map production and the statistical analysis of mapped data must apply within this subject. In addition, the nature of disease maps ensures that many epidemiological concepts also play an important role in the analysis. In essence, these two different aspects of the subject have their own impact on the methodology which has developed to deal with the many issues which arise in this area.

First, since mapped data are spatial in nature, the application of spatial statistical methods forms a core part of the subject area. The reason for this lies in the fact that the study of any data which are georeferenced (i.e. have a spatial/geographical location associated with them) may have properties which relate to the location of individual data items and also the surrounding data. For example, Figure 1.1 shows the total number of deaths from respiratory cancer found in 26 small areas (census tracts) in central Scotland over the period 1976–1983. This map displays a number of features which commonly arise when the geographical distribution of disease is examined. On this map the numbers (counts) of cases within each area are displayed. In some areas of the map the counts are similar to those found in the immediately surrounding areas (e.g. in the south and southeast of the map counts of 4 and 6 are recorded, while in the northwest of the map, lower counts are found in many areas). This similarity in the count data in groups of tracts is unlikely to have arisen from the allocation of a random sample of counts from a common statistical distribution. The counts may display some form of correlation in their levels based on their location, i.e. counts close to each other in space are similar. This form of correlation does not arise from the usual statistical models assumed to apply to independent observations found in, for example, clinical medical studies or other conventional statistical application areas. Hence, methods which apply to the analysis of these data must be able to address the possibility of such correlation existing in the mapped data under study. Another feature of this example, which commonly arises in the study of spatial epidemiology, is the irregular nature of the regions within which the counts are observed, i.e. the census tracts have irregular geographical boundaries. This may arise as a feature of the whole study region (study window) or may be found associated with tracts themselves. In some countries, notably in North America, small areas are often regular in shape and size and this feature simplifies the resulting analysis. However, in many other areas irregular region geometries are common. Finally, in some studies, the spatial distribution of cases or counts of disease are to be related to other locations on the map. For example, in Figure 1.1 the location of a potential (putative) environmental health hazard is also mapped (a metal-processing plant), and the focus of the study may be to assess the relationship of the disease incidence on the map to that location, perhaps to make inferences about the environmental risk in its vicinity.

Figure 1.1 Falkirk: central Scotland respiratory cancer counts in 26 census enumeration districts over a fixed time period. * Putative health hazard.

The second feature which uniquely defines the study of spatial epidemiology is that the mapped data are often discrete. Unlike other areas of spatial statistical analysis, which are often focused on continuous data, e.g. geostatistical methods, the data found in spatial epidemiology often take the form of point locations (the address locations of cases of disease) or counts of disease within regions such as census tracts or, at larger scale, counties or municipalities. Hence, the mapped data often consist of cartesian coordinates in the form of a grid reference or longitude/latitude of an address of a case, or a count of cases within a region with the associated location of that region (either as a point location of a centroid or as a set of boundary line segments defining the region). Given this form of data format, it is not surprising that models which have been developed for applications within this area are derived from stochastic point process theory (for case locations) and associated discrete probability distributions (for counts within arbitrary regions).

Finally, the epidemiological nature of these discrete spatial data leads to the derivation of models and methods which are related to conventional epidemiological studies. For example, the case–control study, where individual cases are matched to control individuals based on specific criteria, has parallels in spatial epidemiology where spatial control distributions are used to provide a locational control for cases. This is akin to the estimation of background hazard in survival studies. One fundamental epidemiological issue which arises in these studies is the incorporation of the local population which is at risk of contracting the disease in question. As we must control for the spatial variation in the underlying population, then we must be able to obtain good estimates of the population from which the cases or counts arise. This estimation often leads to the derivation of expected rates in the region count case and further to the estimation of the ratio of count to expected count/rate or the relative risk, in each area. Relative risk is a fundamental epidemiological concept (Clayton and Hills, 1993) in non-spatial epidemiological studies.

1.1 Map Hypotheses and Modelling Approaches

In any spatial epidemiological analysis, there will usually be a study focus which specifies the nature and style of the methods to be used. This focus will usually consist of a hypothesis or hypotheses about the nature of the spatial distribution of the disease which is to be examined, and it is convenient to categorise these hypotheses into three broad classes: disease mapping, ecological analysis and disease clustering. Usually, the distribution of cases of disease, whether in the form of counts or case address locations, can be thought to follow an underlying model, and the observed data may contain extra noise in the form of random variation around the model of interest. Often, the model will include aspects of the null (hypothesis) spatial distribution of the cases, which captures the ‘normal’ variation which is expected, and also aspects of the alternative spatial distribution. In much of spatial epidemiology, the focus of attention is on identifying features of the spatial distribution which are not captured by the null hypothesis distribution. This is mainly related to excess spatial aggregation of cases in areas of the map. That is, once the normal variation is allowed for, the residual spatial incidence above the normal incidence is the focus. Seldom is there any need to examine areas of lower aggregation than would be normally expected. Note that ‘normal’ variation is usually assumed to be defined by the underlying population distribution of the study region/window and cases are thought to arise in relation to the local variation in that distribution.

The first class, that of disease mapping, concerns the use of models to describe the overall disease distribution on the map. In disease mapping, often the object is simply to ‘clean’ the map of disease of the extra noise to uncover the underlying structure. In that situation, the null hypothesis could be that the case distribution arises from an unspecified or partly specified null spatial distribution (which includes the population spatial distribution) and the object is to remove the extra noise/variation. In this sense disease mapping is close in spirit to image processing where segmentation usually describes the process of allocating pixels or groups of pixels to classes.

The second class, that of ecological analysis, concerns the analysis of the relation between the spatial distribution of disease incidence and measured explanatory factors. This is usually carried out at an aggregated spatial level, and usually concerns regional incidence compared to explanatory factors measured at regional or other levels of aggregation (Greenberg et al., 1996). This contrasts with studies which use measurements made on individual subjects. However, many of the issues concerning interpretation of ecological studies are concerned with change in aggregation level and not aggregated data per se. For example, the ecological fallacy concerns making inference about individuals from analyses carried out at a higher scale, e.g. regional or country-wide level. Equally, the atomistic fallacy concerns making inferences about average characteristics from individual measurements. In what follows we assume a relatively wide definition of ecological, more in the sense of ecology itself, as any study which seeks to describe/explain the spatial distribution of disease based on the inclusion of explanatory variables. Two classic studies of this kind are presented by Cook and Pocock (1983), who examined the relation of cardiovascular incidence in the UK to a variety of variables (including water hardness, climate, location, socioeconomic and genetic factors and air pollution), and Donnelly (1995), who examined the respiratory health of school children and volatile organic compounds in the outdoor atmosphere. Note that this general definition can include the situation where case address locations are related to a pollution hazard via explanatory variables such as distance and direction from the hazard. In that case individual data are related to explanatory variables.

The final class, that of disease clustering, concerns the analysis of ‘unusual’ aggregations of disease, i.e. assessing whether there are any areas of elevated incidence of disease within a map. This type of analysis could take a variety of forms. First, the analysis could include the assessment of a complete map to ascertain whether the map is clustered. This is often termed general clustering. In this case, the null hypothesis would be that the disease map represents normal variation in incidence given the population distribution. The alternative hypothesis would include some specified clustering mechanism for the disease cases. This mechanism could be descriptive or include some notion of how the clusters form (e.g. clusters can form if infectious diseases are examined, and the contact rate of individuals can be modelled). General clustering is often treated as a form of autocorrelation and models for such effects are often employed. This form of clustering can be termed non-specific as it does not seek to determine where clusters are found but instead simply seeks to determine whether the pattern is clustered.

Second, specific cluster studies attempt to ascertain the locations of any clusters if they exist on the map. These clusters could have known (fixed) locations and the incidence of disease around these locations may be assessed for its relation to the location(s). Studies of putative pollution hazards fall within this category. This is often termed focused clustering. If the locations of clusters are unknown a priori, then the locations must also be estimated from the data; this is termed non-focused clustering. Often, ecological regression methods can be used in focused clustering studies, whereas, for non-focused studies, special methods must be constructed which allow the estimation of cluster locations and their form.

In all the above areas of study, fundamental to the methods employed is the inclusion of spatial location in the analysis and so spatial statistical methods are often employed to model the observed data; that epidemiological considerations should be employed in any study of the distribution of disease incidence, in that the concept of normal variation of disease (i.e. that generated from the population at risk from the disease) must be catered for in any model of incidence; and that methods used should be appropriate to the analysis of georeferenced discrete data.

1.2 Definitions and Data Examples

In this section, some basic definitions and concepts are introduced which are used throughout this book. In addition, a number of data examples make their first appearance and these will be referred to at various stages throughout the work.

In what follows we will mainly be concerned with data which are available within a single period of time. Hence, we do not provide notation for space-time problems here. Where such notation is appropriate, we provide it locally.

We define ‘epidemiology’ as the study of the occurrence of disease in relation to explanatory factors. A strict dictionary definition of the term implies the study of ‘epidemic diseases’. However, in this work we mainly restrict attention to fixed time period studies and do not directly examine the dynamic behaviour of disease incidence. This area has recently been reviewed in Mollison (1995), Daley and Gani (1999) and Andersson and Britton (2000). Some discussion of epidemic models appears in Chapter 10. Here the term ‘spatial epidemiology’ is defined to mean the study of the occurrence of disease in spatial locations and its explanatory factors. Usually, the disease to be examined occurs within a map and the data are expressed as a point location (case event) or are aggregated as a count of disease within a subregion of the map. Two examples of such data are provided in Figures 1.2 and 1.3. These two data types lead to different modelling approaches, and we make specific the following definitions as a basis for further discussion.

1.2.1 Case event data

We define the study window (W), within which m disease case events occur at locations x_i, i = 1 …, m. The area of W is denoted by |W|, Lebesque measure on . Figure 1.4 displays these definitions.

Figure 1.2 The locations of larynx cancer cases in an area of central Lancashire, UK, for the period 1974–1983.

Figure 1.3 Respiratory cancer counts within census tracts (enumeration districts) of Falkirk, central Scotland, for the period 1978–1983.

1.2.2 Count data

We define the study window (W) as above, within which m arbitrarily bounded subregions, wholly or in part, lie. The count in m subregion tracts is denoted n_i, i = 1,…, m. In Figure 1.5, only regions 4, 5 and 6 are wholly within the window. Regions 1, 2, 3 and 7 are cut by the window boundary. The effect of this region truncation will be discussed in detail later. However, it should be noted that, usually, the count available (n_i) is from the complete region and not from the truncated region which appears in the study window.

Figure 1.4 A notional study area (W) and a guard area (T).

Figure 1.5 A study region within which counts are observed in subregions (tracts).

Usually, the m subregions are politically defined administrative regions and are often tracts defined for the purposes of population censuses. We adopt the term ‘census tract’ to denote an arbitrarily defined region. In addition, the counts in census tracts are just an aggregation of case event data counted within the bounding tract boundaries. Hence, the data in Figure 1.5 could be derived from the data in Figure 1.4 by counting case events in census tract subregions of the window.

The object of analysis of case event or count data can define the type of summary measures used to describe the data. Usually, as a basic summary measure it is common to compute a local measure of relative risk, or to use a local measure of relative risk as the dependent variable in a more substantial analysis. Here, relative risk is taken to mean the measure of excess risk found in relation to that supported purely by the local population, which is ‘at risk’. This population is sometimes called the ‘at-risk’ population or background. Relative risk is derived or computed from the relation of observed incidence to that which would be expected based on the ‘at-risk’ background. It is common practice within epidemiology to derive such risk estimates. In the case of spatial epidemiology it is common, when tract count data are available, to compute a standardised mortality (or morbidity) ratio (SMR), which is simply the ratio of the observed count within a tract to the expected count based on the ‘at-risk’ background. A ratio greater than 1.0 would suggest an excess of risk within the tract. These SMRs are often the basis for atlases of disease risk (see, for example, Pickle et al., 1999).

1.3 Further Definitions

Some further definitions are required in relation to data which arise in such studies.

1.3.1 Control events and processes

Often, an additional process or realisation of disease events is used to provide an estimate of the ‘background’ incidence of disease in an area. Define x_{c_j}, j = 1,…, m_c, to be these m_c control event locations. The use of such data will be detailed in a later section.

1.3.2 Census tract information

The census tract count of a control disease is defined to be n_c.

Instead of using a control disease to represent ‘background’, the ‘expected’ incidence of disease can be used. This is usually based on known rates of disease in the population (Inskip et al., 1983). Denote this expected incidence as e_i, i = 1,…, m. The total population of a tract is p_i, while the extent of the tract is defined as a_i. The tract centroid, however defined, is denoted by x_{n_i}.

For models involving explanatory variables measured at tract level, we define F as an m × p matrix whose columns represent p explanatory variables, and α as a p × 1 vector of parameters. (For case event models the row dimension of F will usually be m also.)

1.3.3 Clustering definitions

In cases where clustering is studied, a number of additional definitions are required. First, cluster centre locations are defined as y_j, j = 1,…,k,where k is the number of centres in a suitably defined window. The term ‘parent’ is used here synonymously with cluster centre. This does not imply any genetic linkage with the observed data. The observed data belonging to a cluster are sometimes referred to as offspring. Again, there is no genetic linkage implied by this term. In adition the offspring (or tract count) associated with a particular parent, y_i say, have an integer label, z_i, denoting their associated parent. These definitions are displayed in Figure 1.6.

Figure 1.6 Pictorial representation of clustering definitions;., offspring {x}; +, centre, {yc}.

1.4 Some Data Examples

In the following discussion a number of data examples will be examined. These are used to motivate discussion of certain modelling issues and to provide insight into the nature of the data which arise in this area. The examples are chosen to represent different approaches to the study of the spatial distribution of disease. The data sets are available as a link from a website: www.sph.sc.edu/alawson/default.htm. In Chapters 9 and 10 additional data sets are introduced which are only referenced in those chapters.

1.4.1 Case event examples

The following examples have been analysed previously and represent different aspects of analysis.

Arbroath: multiple disease study

Arbroath is a small town on the east coast of Scotland. A retrospective study of the health status in that town was initiated following concerns over airborne emissions from a centrally located steel foundry. For the period 1966–1976, the address locations of death certificates for a range of diseases were recorded. The diseases chosen were thought to be related to air pollution risk. These included respiratory cancer, gastric and oesophageal cancer, and bronchitis. To provide a representation of the background ‘at-risk’ population at case event locations, a realisation of a ‘control’ disease was also recorded. The control disease was a composite of lower-body cancers (prostate, penis, breast, testes, cervix, uterus, colon and rectum). These diseases are thought to be largely unaffected by air pollution.

Figure 1.7 displays the location map and Figures 1.8, 1.9, 1.10 and 1.11 display the case event maps of the three case diseases and the control disease.

Figure 1.7 European location map.

Figure 1.8 Arbroath: respiratory cancer case event map.

Figure 1.9 Arbroath: gastric and oesophageal cancer case event map.

Figure 1.10 Arbroath: bronchitis case event map.

Armadale: respiratory cancer data

This data set was first analysed by Lloyd (1982) and consists of 49 respiratory cancer death certificate addresses for the period 1968–1974 for the small town of Armadale, central Scotland. This town is located in an industrial area close to Falkirk (see location map Figure 1.7). A standardised mortality ratio of 150 for each of the years of the period was recorded and this unusual excess of deaths was dubbed the Armadale Epidemic. Accompanying the case locations are a realisation of coronary heart disease (CHD) death certificate locations which have been used as a control disease realisation (Lawson and Williams, 1994). A circular study window was used so that directional sampling bias would be minimised. The case and control realisations are displayed in Figures 1.12 and 1.13.

Figure 1.11 Arbroath: control disease (lower-body cancers) case event map.

Humberside leukaemia and lymphoma data

This data set was first analysed by Cuzick and Edwards (1990) and consists of a realisation of case events of childhood leukaemia and lymphoma in the north Humberside region of England for the period 1974–1986. As a ‘control’ for the population ‘at risk’ in the area the authors obtained a large sample of births from the birth register for the region and period. This provides a spatial ‘childhood’ control but not a disease specific control. Figures 1.14 and 1.15 display the case event and control maps for this example. The original purpose of the example was to examine the clustering tendency of the case events.

Lancashire: larynx cancer

The incidence of cancer of the larynx in a part of Lancashire, England, has been studied by Diggle (1990). This example consists of a realisation of 58 larynx cancer case events in the period 1974–1983. A control event realisation of 978 cases of respiratory cancer in the same period was also available. Figures 1.16 and 1.17 display the case and control maps. The object of the original analysis was to assess evidence for the existence of an environmental air pollution source in the area of the map (an incinerator; location: (35450, 41400)). While respiratory cancer may represent the ‘at-risk’ population for larynx cancer, its distribution is also affected by air pollution and hence the comparison of these two diseases is a relative risk comparison only. Discussion of the choice of control disease or other sources of standardisation is postponed to a later section.

Figure 1.12 Armadale: 49 respiratory cancer death certificate addresses, within circular window.

Figure 1.13 Armadale: realisation of CHD death certificate addresses, within circular window.

Figure 1.14 Humberside: leukaemia and lymphoma case event map (1974–1986). Reproduced from Lawson and Cressie (2000) with permission from Elsevier Science.

Figure 1.15 Humberside: leukaemia and lymphoma control event map (1974–1986).

Figure 1.16 Lancashire: larynx cancer case event map (1974–1983).

Figure 1.17 Lancashire: respiratory cancer control event map (1974–1983).

Burkitt’s lymphoma in Uganda

This spatio-temporal data set consists of the locations of cases of Burkitt’s lymphoma in the Western Nile district of Uganda for the period of 1960–1975. The time variable is recorded as the number of days starting from an origin of 1 Jan 1960. The data set has been used widely and is available in the Splancs R/S-Plus package. The data consist of the spatial coordinates of the case locations (easting, northing), with an accompanying time (daynumber). The age of the patient (child) is also recorded, and an exact date is also available as a factor in the original data set. There is no control disease available. Figure 1.18 displays one year of monthly case maps for this example.

Figure 1.18 Monthly case event maps for Burkitt’s lymphoma in Uganda in 1970. Spatial coordinates are eastings and northings.

1.4.2 Count data examples

In this work we also examine a number of examples of count data maps where the disease of interest has been collected within small areas. These small areas vary from census enumeration districts (Falkirk) to districts (North Carolina), municipalities (Tuscany), Landkreise (Germany) and counties (Ohio, South Carolina).

Falkirk: respiratory cancer mortality

In this example, counts of respiratory cancer in 26 census enumeration districts for the period of 1978–1983 in central Falkirk, a large town in central Scotland, are given. These data form a small part of a larger study of respiratory cancer incidence in this urban area. The enumeration district map with associated counts is displayed in Figure 1.19. Total expected rates based on Scottish national rates for 18 age × sex groups are also available.

North Carolina: sudden infant mortality

The incidence of sudden infant death (SID) in North Carolina, USA, has been studied by Cressie and Chan (1989) and Lawson (1997), amongst others. The counts of infant death in the 100 counties for the period of 1974–1978 have been collected and total births for the counties are also available. Figure 1.20 displays the county map and death counts.

Tuscany: gastric cancer morbidity

The incidence of gastric cancer in the Tuscany region of Italy is of particular interest due to large variations in incidence between the northeastern areas and the south and west. Figure 1.21 displays the standardised mortality ratios as a choropleth map for the period 1980–1989.

Figure 1.19 Falkirk: map of respiratory cancer enumeration district counts (1978–1983).

Figure 1.20 Thematic map of counts of sudden infant deaths (SIDs) in North Carolina for the period 1974–1978.

Figure 1.21 Tuscany: gastric cancer morbidity 1980–1989. Standardised mortality ratios.

Lip cancer in Eastern Germany

This data set consists of age–sex standardised counts for lip cancer in administrative regions in Eastern Germany for the period 1980–1989. A set of counts for the regions (Landkreise) is provided, and the standardised mortality ratio map is displayed (Figure 1.22).

Ohio respiratory cancer mortality

This data set has been widely used (see e.g. Carlin and Louis, 2000; Knorr-Held and Besag, 1998; Lawson et al., 2003) and is available (amongst other places) from the University of Munich data archive : www.stat.uni-muenchen.de/service/datenarchiv/ohio/ohio_e.html. This spatio-temporal data set consists of counts of deaths from respiratory cancer broken by county and over the yearly periods 1968–1988. The 21 years of counts are also broken down into age, sex and race groups. Simpler subsets of these data have been examined where only county total counts have been used. Figure 1.23 displays a selection of four years of total counts by county.

Figure 1.22 Thematic map of SMRs for lip cancer in Eastern Germany for the period 1980–1989.

Figure 1.23 Ohio respiratory cancer mortality (1968–1988): total counts by county for a selection of four years (1968, 1977, 1983, 1988).

Figure 1.24 South Carolina influenza confirmed positive notifications: count profiles for the period 18 December 2004–16 April 2005 for four counties.

Figure 1.25 South Carolina influenza confirmed positive notifications: count thematic maps for a selection of three time periods in 2004–2005 season.

South Carolina influenza confirmation

This data set consists of counts of laboratory confirmed positive (+ve) influenza cases within the 46 counties of South Carolina, USA, by one and two weekly period, over the winter flu season of 2004/2005, beginning on 18 December 2004. These data are publicly available from the SC Department of Health and Environmental Control (DHEC) flu surveillance website: www.scdhec.net/health/disease/acute/flu.htm. Figure 1.24 displays the counts for four of the counties in the state which usually have higher density of case notifications: Beaufort, Charleston, Horry and Richland. Charleston and Richland are the main urban areas (Richland includes the state capital Columbia) and Beaufort and Horry include coastal resort communities (Myrtle Beach, Hilton Head and Beaufort). Figure 1.25 displays three examples of count thematic maps for the periods w1–15 January, 15–22 January, and 29 January – 12 February 2005.

2

Scales of Measurement and Data Availability

It has long been recognised that analysis of spatial data should be carried out at appropriate scales. Examples of such discussion extend back to the 1960s in geography (Schumm and Lichty, 1965). It is clear that not only are certain scales appropriate for examination of particular spatial structures, but also changes of scale will change the structural features of the data themselves. For example, the occurrence of four cases of a rare disease in a suburban street (street level), in its isolation, could be regarded as a ‘cluster’ of disease, by some definition. However, when the incidence is aggregated with that from a large number of streets (which could have negligible incidence), then the total incidence for the area may not be detectable as representing a ‘cluster’. Essentially, the effect of arbitrary aggregation was to change the scale of analysis and effectively produced a smoothing of the incidence surface. The duality of smoothing and scale change occurs wherever case events are aggregated into tract counts. In that case, the locational information held in the case events is ‘blurred’ by the scale change to tract level. This loss of information was noted by Diggle (1993) and Lawson (1993c), and both authors have stressed the importance of using methods appropriate to the observation scale when this is available.

The related aspect of appropriate scales of analysis relates to ‘a scale at which phenomena occur’, and it is this sense of scale which geographers have addressed. This is also of great relevance to any statistical analysis as (1) within any window, phenomena of interest may occur at different scales, (2) the scales of spatial variation may be required to be estimated, and (3) there may be regions within the window which are associated with certain spatial scales. Examples of all three situations are plentiful. For (1), a localised pollution hazard may increase incidence of disease around a source, but not elsewhere. For (2), the size of disease clusters can vary, due to spatial variation in aetiological factors. For (3), the boundaries between urbanised and rural areas can occur in study regions and the effect may yield considerably different spatial variation in these regions. The appropriate spatial scale of analysis can be defined, on an increasing measure, for a variety of types of study and these scales are detailed in the following sections.

2.1 Small Scale

The analysis of an individual disease cluster or group of clusters, whether related to a known source of hazard or not, usually requires the examination of areas of size 0.5–10 km². Often the exact scale of operation of the process or processes affecting the clustering is unknown, and hence a reasonably large study window is used, large enough to encompass the scale of the clustering effect. At this small scale it is also possible to analyse spatial ‘ecological’ problems, i.e. the study of the relation of disease to explanatory variables (Cuzick and Elliott,1992). Indeed, the analysis of the relation between a cluster of data events and a putative hazard could be regarded as a special case of this type of analysis. Usually, the object of spatial ecological analysis is to assess the general relations between data and covariates (see, for example, Donnelly et al. (1994), Cressie and Chan (1989) and Marshall (1991a) for a review).

2.2 Large Scale

At larger spatial scales, the aggregative effects of scale lead to different analysis objectives. The analysis of variation in incidence of disease within regions of a country could be to provide a disease map of the country or to carry out large-scale ecological analysis. Disease mapping has as its objective the provision of a ‘clean’ map of disease incidence, with all random effects removed, so that an accurate estimate of the underlying rate in different areas is provided. In this sense, the objective is a type of smoothing, and methods related to smoothing are typically employed.

The screening of large areas of a country for ‘anomalies’ in incidence (or ‘clusters’) (Besag et al., 1991a) has as its objective the isolation of ‘areas of raised incidence’. These studies are related to disease mapping, in that a ‘clean’ disease map can be used to assess such ‘areas of raised incidence’. Such cluster detection can be based on case event data (Openshaw et al., 1987). Disease mapping on the other hand is usually based on regional count data. Large-scale ecological analysis can also be based on either data type. Cook and Pocock (1983) give an example of the analysis of regional variation in heart disease, within the UK, based on regional explanatory variables.

2.3 Rate Dependence

While the scale change criteria in Sections 2.1 and 2.2 apply to a given disease, a suitable scale for analysis will depend on the normal rate of occurrence of that disease. For example, with very rare diseases it may require a continental scale to analyse even a case event map. Indeed cluster patterns may even have scale cycles. This should be considered when making choices of scale for analysis.

2.4 Data Quality and the Ecological Fallacy

A number of issues arise in the use of case event and count data and their interpretations. As mentioned above, count data are formed, usually, as an aggregation of case event data. In that sense, count data are an approximation to case event data. However, there are significant advantages and disadvantages associated with both data types.

Case event data are usually available as the street address of a case of disease recorded as having occurred within a fixed time period. While this is an exact location, its relation to the disease aetiology may be uncertain. For example, if the event is a morbidity event (e.g. the address at diagnosis), it may be that (1) the disease was not contracted at that address, (2) the case has subsequently moved. If the event is a mortality event (e.g. the address on a death certificate), it may be that the disease was contracted while at another address. In both cases, the exact location may not be appropriate. For example, the case may be someone who has a work-related disease. Therefore, the home address may be of little importance. Alternatively, a pollution source may have been influential in causing cases of bronchitis amongst people travelling daily to work. Hence, home address could be, at least, an approximation to the ‘at-risk’ environment.

However, case event locations, when properly validated (Lawson and Williams,1994), can provide detailed spatial information, which would be lost when counts are used. This information could be very important in detailed assessment of environmental gradients. The conflicting results which can occur when data are aggregated to counts are evidenced by Diggle (1990) and Elliott et al. (1992b). Two disadvantages of case event data are that (1) the exact addresses are not always readily available due to possible confidentiality problems, and (2) often inferences about individuals at locations are functions of a smoothed regional ‘at-risk’ population surface which is interpolated to the case address. Hence, regional average characteristics are being ascribed to a particular location and a particular individual. This is an example of the ecological fallacy, which affects many studies in this area (Rothman, 1986).

On the other hand, count data, by the fact that they are aggregated, can avoid some of the problems of case events but introduce some new problems. Aggregation increases the ‘local’ sample size and avoids the need to use ‘exact’ addresses which may not be truly exact for the reasons specified above. Census tract data are more closely matched to the underlying population in tracts, and are usually readily available from central government agencies, with little confidentiality problems associated. However, the smoothing involved in counts does yield an invariance at regional level, and disjunction between individual risk and location. There is no relation which maps the number of cases in a region to locations within the region and, usually, region-wide explanatory factors are only available. Hence, there is another level of ecological fallacy in operation with count data: that is, the problem of ascribing each item in the count to a location and to ascribe a relevant value of explanatory factor at that location. This additional problem is somewhat balanced by the gain in sample size.

In general, if case event data are available, then this level of resolution should be analysed. If only aggregated data are available, then methods suitable for counts should be employed, although case event models should be used, given that these underlie the counting process. It is not usually recommended that spatial information be lost by aggregation of case events into counts.

2.5 Edge Effects

within