--- title: "Using customized gazetteers" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using customized gazetteers} %\VignetteEngine{knitr::rmarkdown} %\VignetteSuggests{rgbif} \usepackage[utf8]{inputenc} --- ```{r options, echo = FALSE} knitr::opts_chunk$set(eval = FALSE) ``` CoordinateCleaner identifies potentially erroneous geographic records with coordinates assigned to the sea, countr coordinate, country capitals, urban areas, institutions, the GBIF headquarters and countries based on the comparison with geographic gazetteers (i.e. reference databases). All of these functions include default reference databases compiled from various sources. These default references have been selected suitable for regional to global analyses. They will also work for smaller scale analyses, but in some case different references might be desirable and available. this could be for instance centroids of small scale political units, a different set of urban areas, or a different coastline when working with coastal species. To account for this, each *CoordinateCleaner* function using a gazetteer has a `ref` argument to specify custom gazetteers. We will use the case of coastlines and a coastal species to demonstrate the application of custom gazetteers. The purpose of `cc_sea` is to flag records in the sea, since these often represent erroneous and undesired records for terrestrial organisms. The standard gazetteer for this function is fetched from naturalearthdata.com at a 1:50m scale. However, often coordinates available from public databases are only precise at the scale of kilometres, which might lead to an overly critical flagging of coordinates close to the coastline, which is a problem especially for coastal or intertidal species. WE illustrate the issue on for the mangrove tree genus *Avicennia*. ```{r} library(CoordinateCleaner) library(dplyr) library(ggplot2) library(rgbif) library(viridis) library(terra) #download data from GBIF dat <- rgbif::occ_search(scientificName = "Avicennia", limit = 1000, hasCoordinate = T) dat <- dat$data dat <- dat %>% dplyr::select(species = name, decimalLongitude = decimalLongitude, decimalLatitude = decimalLatitude, countryCode) # run with default gazetteer outl <- cc_sea(dat, value = "flagged") ## OGR data source with driver: ESRI Shapefile ## Source: "C:\Users\az64mycy\AppData\Local\Temp\Rtmp4SRhHV", layer: "ne_110m_land" ## with 127 features ## It has 3 fields plo <- data.frame(dat, outlier = as.factor(!outl)) #plot results ggplot() + borders(fill = "grey60") + geom_point(data = plo, aes(x = decimalLongitude, y = decimalLatitude, col = outlier)) + scale_color_viridis(discrete = T, name = "Flagged outlier") + coord_fixed() + theme_bw() + theme(legend.position = "bottom") ``` ![plot of chunk cusgaz1](cusgaz-cusgaz1-1.png) A large number of the coastal records gets flagged, which in this case is undesirable, because it is not a function of the records being wrong, but rather of the precision of the coordinates and the resolution of the reference. To avoid this problem you can use a buffered reference, which avoids flagging records close to the coast line and only flags records from the open ocean. *CoordinateCleaner* comes with a one degree buffered reference (`buffland`). In case a narrower or distance true buffer is necessary, you can provide any SpatVector similar in structure to `buffland` via the `ref` argument. ```{r} # The buffered custom gazetteer data("buffland") buffland <- terra::vect(buffland) plot(buffland) ``` ![plot of chunk cusgaz2](cusgaz-cusgaz2-1.png) ```{r} # run with custom gazetteer outl <- cc_sea(dat, value = "flagged", ref = buffland) plo <- data.frame(dat, outlier = as.factor(!outl)) #plot results ggplot()+ borders(fill = "grey60")+ geom_point(data = plo, aes(x = decimalLongitude, y = decimalLatitude, col = outlier))+ scale_color_viridis(discrete = T, name = "Flagged outlier")+ coord_fixed()+ theme_bw()+ theme(legend.position = "bottom") ``` ![plot of chunk cusgaz2](cusgaz-cusgaz2-2.png)