Preferential sampling for presence/absence data and for fusion of presence/absence data with presence-only data
Abstract
Presence/absence data and presence-only data are the two customary sources for learning about species distributions over a region. We present an ambitious agenda with regard to the analysis of such data. We illuminate the fundamental modeling differences between the two types of data. Most simply, locations are considered to be fixed under presence/absence data; locations are random under presence-only data. The definition of "probability of presence" is incompatible between the two. We are not comfortable with modeling strategies in the literature that ignore this incompatibility and that assume that presence/absence modeling can be induced from presence-only specifications and, therefore, that fusion of presence-only and presence/absence data sources is routine. While, in some cases, data collection may not support this, we propose that, since, in nature, presence/absence is seen at the point locations, presence/absence data should be modeled at point level. If so, we need to specify two surfaces. The first provides the probability of presence at any location in the region. The second provides a realization from this surface in the form of a binary map yielding the results of Bernoulli trials across all locations; this surface is only partially observed. On the other hand, presence-only data should be modeled as a (partially observed) point pattern, arising from a random number of individuals seen at random locations, driven by specification of an intensity function. There is no notion of Bernoulli trials; events are associated with areas. We further suggest that, with just presence/absence data, preferential sampling of locations may arise. Accounting for this, using a shared process perspective, can improve our estimated presence/absence surface as well as prediction of presence. We further propose that preferential sampling can enable a probabilistically coherent fusion of the two data types. We illustrate with two real data sets, one presence/absence, one presence-only, for invasive species presence in New England in the United States. We demonstrate that potential bias in sampling locations can affect inference with regard to presence/absence and show that inference can be improved with preferential sampling ideas. We also provide a probabilistically coherent fusion of the two data sets again with the goal of improving inference for presence/absence. The importance of our work is to encourage more careful modeling when studying species distributions. Ignoring incompatibility between data types and adopting nongenerative modeling specifications results in invalid inference; the quantitative ecological community should benefit from this recognition.
- Publication:
-
Ecological Monographs
- Pub Date:
- August 2019
- DOI:
- 10.1002/ecm.1372
- arXiv:
- arXiv:1809.01322
- Bibcode:
- 2019EcoM...89E1372G
- Keywords:
-
- Statistics - Methodology
- E-Print:
- Ecological Monographs, in press