Scott Rinnan

Quantitative Ecology and Resource Management, University of Washington

How to construct a bias file with R for use in MaxEnt modeling

4 Comments

Maximum entropy classification is a popular tool for different types of ecological modeling. Using the MaxEnt software, species distribution modeling has never been easier! Input your climate data, some species occurrence records, and boom! You’ve got yourself a nice probability map of where your little critters can be found. This software is user-friendly by design, but distances the user from the complicated multinomial logistic regression that takes place under the hood. It is not necessary to understand exactly how the model selection works in order to benefit from it, but it is important that you have all the right pieces, in order to ensure a good model fit. (If you do want to understand the software better, the references at the end of the post are an excellent place to begin.)

One of the common issues modelers face is sampling bias. Let’s say you have records of 50 observations of a species of marmot, all occurring on the Olympic Peninsula. What can we infer from this data? Can the species be found only on the Peninsula, or have other areas simply not been surveyed to the same extent? Is the classic conundrum with presence-only data; presence/absence data, by contrast, gives us much more information, since it explicitly contains information where the species does not occur. Unfortunately, people are (understandably) much more likely to record that they saw a creature than they are to record that they didn’t. Most of the data in the public domain are presence-only observations. If we want to infer meaningful information from these data, we’re going to need a way to account for the sampling bias inherent within it.

Lots of the literature out there stresses the importance of accounting for sampling bias [1,4,5], and some papers tell how you might go about doing it. Unfortunately, there doesn’t seem to be much information about how to actually do it. When I first confronted this issue, I was incredibly confused, and the usual online coding resources seemed to be equally full of confusion. Having finally figured it out, I thought I would share the process. I make no claims as to the most efficient or the sleekest code, but it gets the job done.

We will account for the bias in the data by constructing a bias layer. Basically, this layer leverages occurrence data from species in a similar taxon as the one of interest. The Christmas Bird Count is an excellent example: every Christmas, thousands of citizen scientists go out and record every bird they can find, which results in millions of observations. Let’s say that Ms. Birdbrain spent her Christmas hiking around in the mountains, and saw a northern spotted owl and two barred owls, each in a different location. Essentially, we can treat the presence of the barred owl as a “pseudo-absence” of the spotted owl, because if there had been a spotted owl there, presumably Ms. Birdbrain would have recorded it as well. Thus, if we look at the collective records of all birds, we’d have a good idea of the geographical sampling bias, which we can then use to account for the bias in a single species.

This technique assumes that you have the MaxEnt software already installed on your computer. We will be creating a bias file in R, and then interfacing with MaxEnt by use of the dismo package.

library(dismo) # interface with MaxEnt
library(raster) # spatial data manipulation
library(MASS) # for 2D kernel density function
library(magrittr) # for piping functionality, i.e., %>%
library(maptools) # reading shapefiles

The GBIF database provides huge amounts of occurrence data. I downloaded a list of all native bird species in the US from the IUCN database, and then extracted occurrence records for these species using the gbif function from the dismo package. (Note: this takes a long time!)

for(i in 1:nrow(birds)){
     temp<-na.omit(temp[ ,c("lon", "lat")])
     write.csv(temp, paste0("H:/Species occurrences/birds/", birds$Genus[i], " ",
     birds$Species[i]," occurrence.csv"), row.names = F)
}

We now have hundreds of csv files, each containing occurrence records for a different species of bird. Let’s merge these into one giant data frame:

occurdat <- list.files("H:/Species occurrences/Birds", pattern = ".csv$", full = T)
locations <- read.csv(occurdat[1], colClasses = "numeric")

for(i in 2:length(occurdat)){
     temp <- read.csv(occurdat[i], colClasses = "numeric")
     locations <- rbind(locations, temp)
}

We’re going to need these occurrences on a spatial grid, so let’s turn them into a raster. Ultimately, if we’re going to be constructing some species distribution models, we’ll need some climate data. Lots is available in the WorldClim database. We want the occurrence raster to have the same resolution as the climate data, so we’ll use the climate data as an argument in the rasterize function.

climdat <- brick("H:/Bioclim_brick.tif")
occur.ras <- rasterize(locations, climdat, 1)
plot(occur.ras)

Bird occurrences

Now that we have our occurrences on a grid, we can finally make our bias layer. Let’s mask our occurrence data to the United States, so we capture the occurrences just within this country.

states <- readShapePoly("H:/Shapefiles/US_States/states.shp", proj4string = occur.ras@crs)
occur.states <- mask(occur.ras, states) %>% crop(states)

We will use the kde2d function from the MASS package, which gives us a two-dimensional kernel density estimate, based on the coordinates of the occurrence points. (Note: the output of this function is sensitive to the bandwidth selection; if in doubt, use the default.)

presences <- which(values(occur.states) == 1)
pres.locs <- coordinates(occur.states)[presences, ]

dens <- kde2d(pres.locs[,1], pres.locs[,2], n = c(nrow(occur.states), ncol(occur.states)))
dens.ras <- raster(dens)
plot(dens.ras)

Bird bias

And that, dear readers, is our bias layer! Let’s save it for future use in MaxEnt:

writeRaster(dens.ras, "H:/Species Occurrences/Bird bias file.tif")

Now that that’s taken care of, we are finally free to start modeling individual species, which is comparatively simple. Let’s do it for the first species of bird we downloaded. The bias file is included in the MaxEnt model as one of the options you can specify with the args argument. (A complete list of model options can be found in the MaxEnt help file, or here.)

occurrences <- read.csv(occurdat[1])
mod1 <- maxent(climdat, occurrences, args = "biasfile=dens.ras")

Tah-dah! You just unbiased your occurrence data. Ms. Birdbrain would be proud.

Here are some examples of visualization of model output. These are for Ochotona princeps, the American pika.

This slideshow requires JavaScript.

Good luck, and happy modeling!


References:

  1. Beck, Jan, et al. “Spatial bias in the GBIF database and its effect on modeling species’ geographic distributions.” Ecological Informatics 19 (2014): 10-15.
  2. Elith, Jane, et al. “A statistical explanation of MaxEnt for ecologists.” Diversity and Distributions 17.1 (2011): 43-57.
  3. Phillips, Steven J., Robert P. Anderson, and Robert E. Schapire. “Maximum entropy modeling of species geographic distributions.” Ecological modelling 190.3 (2006): 231-259.
  4. Phillips, Steven J., et al. “Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data.” Ecological Applications 19.1 (2009): 181-197.
  5. Syfert, Mindy M., Matthew J. Smith, and David A. Coomes. “The effects of sampling bias and model complexity on the predictive performance of MaxEnt species distribution models.” PloS one 8.2 (2013): e55158.
Advertisements

4 thoughts on “How to construct a bias file with R for use in MaxEnt modeling

  1. Thanks for this valuable resource Dr. Rinnan

  2. Hello Scott,

    Thanks for the useful code! Nicely presented as well.

    For people that are less comfortable with R although, using the Gaussian Kernel Density (GKD) Tool in the SDMToolbox of ArcGIS is also an option.

    All that needs to be done is to import all the records that one wishes to use, convert them a shapefile and then use the GKD tool with the produced shapefile.

    Thanks again for this useful information.

    Regards,
    Lorenzo

  3. Am I missing something? Dismo seems to completely ignore the argument: biasfile=dens.ras.

    • Hi Sam,

      Including it as an argument of the maxent function works for me, but it’s probably more proper to call it with the ‘args’ argument, like so:

      maxent(…, args = “biasfile=dens.ras”)

      I have updated my code on the blog post to reflect this.

      Regards,
      Scott

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s