Package 'CoordinateCleaner' reference manual

Title:	Automated Cleaning of Occurrence Records from Biological Collections
Description:	Automated flagging of common spatial and temporal errors in biological and paleontological collection data, for the use in conservation, ecology and paleontology. Includes automated tests to easily flag (and exclude) records assigned to country or province centroid, the open ocean, the headquarters of the Global Biodiversity Information Facility, urban areas or the location of biodiversity institutions (museums, zoos, botanical gardens, universities). Furthermore identifies per species outlier coordinates, zero coordinates, identical latitude/longitude and invalid coordinates. Also implements an algorithm to identify data sets with a significant proportion of rounded coordinates. Especially suited for large data sets. The reference for the methodology is: Zizka et al. (2019) <doi:10.1111/2041-210X.13152>.
Authors:	Alexander Zizka [aut, cre], Daniele Silvestro [ctb], Tobias Andermann [ctb], Josue Azevedo [ctb], Camila Duarte Ritter [ctb], Daniel Edler [ctb], Harith Farooq [ctb], Andrei Herdean [ctb], Maria Ariza [ctb], Ruud Scharn [ctb], Sten Svanteson [ctb], Niklas Wengstrom [ctb], Vera Zizka [ctb], Alexandre Antonelli [ctb], Bruno Vilela [ctb] (Bruno updated the package to remove dependencies on sp, raster, rgdal, maptools, and rgeos packages), Irene Steves [rev] (Irene reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/210>), Francisco Rodriguez-Sanchez [rev] (Francisco reviewed the package for ropensci, see <https://github.com/ropensci/onboarding/issues/210>)
Maintainer:	Alexander Zizka <zizka.alexander@gmail.com>
License:	GPL-3
Version:	3.0.1
Built:	2025-03-14 07:02:02 UTC
Source:	https://github.com/ropensci/CoordinateCleaner

Artificial Hotspot Occurrence Inventory

Description

A data frame with information on Artificial Hotspot Occurrence Inventory (AHOI) as available in Park et al 2022. For more details see reference.

Source

https://onlinelibrary.wiley.com/doi/10.1111/jbi.14543

References

Park, D. S., Xie, Y., Thammavong, H. T., Tulaiha, R., & Feng, X. (2023). Artificial Hotspot Occurrence Inventory (AHOI). Journal of Biogeography, 50, 441–449. doi:10.1111/jbi.14543

Examples


data("aohi")
data("aohi")

Global Coastlines buffered by 1 degree

Description

A SpatVector with global coastlines, with a 1 degree buffer to extent coastlines as alternative reference for cc_sea. Can be useful to identify species in the sea, without flagging records in mangroves, marshes, etc.

Source

https://www.naturalearthdata.com/downloads/10m-physical-vectors/

Examples


data("buffland")
data("buffland")

Global Coastlines buffered by -1 degree

Description

A SpatVector with global coastlines, with a -1 degree buffer to extent coastlines as alternative reference for cc_sea. Can be useful to identify marine species on land without flagging records in estuaries, etc.

Source

https://www.naturalearthdata.com/downloads/10m-physical-vectors/

Examples


data("buffsea")
data("buffsea")

Identify Coordinates in Artificial Hotspot Occurrence Inventory

Description

Removes or flags records within Artificial Hotspot Occurrence Inventory. Poorly geo-referenced occurrence records in biological databases are often erroneously geo-referenced to highly recurring coordinates that were assessed by Park et al 2022. See the reference for more details.

Usage

cc_aohi(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  taxa = c("Aves", "Insecta", "Mammalia", "Plantae"),
  buffer = 10000,
  geod = TRUE,
  value = "clean",
  verbose = TRUE
)
cc_aohi(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  taxa = c("Aves", "Insecta", "Mammalia", "Plantae"),
  buffer = 10000,
  geod = TRUE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	character string. The column with the species identity. Only required if verify = TRUE.
`taxa`	Artificial Hotspot Occurrence Inventory (AHOI) were created based on four different taxa, birds, insecta, mammalia, and plantae. Users can choose to keep all, or any specific taxa subset to define the AHOI locations. Default is to keep all: c("Aves", "Insecta", "Mammalia", "Plantae").
`buffer`	The buffer around each capital coordinate (the centre of the city), where records should be flagged as problematic. Units depend on geod. Default = 10 kilometres.
`geod`	logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Depending on the ‘value’ argument, either a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flagged”), with TRUE = test passed and FALSE = test failed/potentially problematic . Default = “clean”.

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

References

Park, D. S., Xie, Y., Thammavong, H. T., Tulaiha, R., & Feng, X. (2023). Artificial Hotspot Occurrence Inventory (AHOI). Journal of Biogeography, 50, 441–449. doi:10.1111/jbi.14543

Examples


x <- data.frame(species = letters[1:10], 
                decimalLongitude = c(runif(99, -180, 180), -47.92), 
                decimalLatitude = c(runif(99, -90,90), -15.78))
cc_aohi(x)

x <- data.frame(species = letters[1:10], 
                decimalLongitude = c(runif(99, -180, 180), -47.92), 
                decimalLatitude = c(runif(99, -90,90), -15.78))
cc_aohi(x)

Identify Coordinates in Vicinity of Country Capitals.

Description

Removes or flags records within a certain radius around country capitals. Poorly geo-referenced occurrence records in biological databases are often erroneously geo-referenced to capitals.

Usage

cc_cap(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 10000,
  geod = TRUE,
  ref = NULL,
  verify = FALSE,
  value = "clean",
  verbose = TRUE
)
cc_cap(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 10000,
  geod = TRUE,
  ref = NULL,
  verify = FALSE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	character string. The column with the species identity. Only required if verify = TRUE.
`buffer`	The buffer around each capital coordinate (the centre of the city), where records should be flagged as problematic. Units depend on geod. Default = 10 kilometres.
`geod`	logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits.
`ref`	SpatVector (geometry: polygons). Providing the geographic gazetteer. Can be any SpatVector (geometry: polygons), but the structure must be identical to `countryref`. Default = `countryref`.
`verify`	logical. If TRUE records are only flagged if they are the only record in a given species flagged close to a given reference. If FALSE, the distance is the only criterion
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples

## Not run: 
x <- data.frame(species = letters[1:10],
                decimalLongitude = c(runif(99, -180, 180), -47.882778),
                decimalLatitude = c(runif(99, -90, 90), -15.793889))

cc_cap(x)
cc_cap(x, value = "flagged")

## End(Not run)
## Not run: 
x <- data.frame(species = letters[1:10],
                decimalLongitude = c(runif(99, -180, 180), -47.882778),
                decimalLatitude = c(runif(99, -90, 90), -15.793889))

cc_cap(x)
cc_cap(x, value = "flagged")

## End(Not run)

Identify Coordinates in Vicinity of Country and Province Centroids

Description

Removes or flags records within a radius around the geographic centroids of political countries and provinces. Poorly geo-referenced occurrence records in biological databases are often erroneously geo-referenced to centroids.

Usage

cc_cen(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 1000,
  geod = TRUE,
  test = "both",
  ref = NULL,
  verify = FALSE,
  value = "clean",
  verbose = TRUE
)
cc_cen(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 1000,
  geod = TRUE,
  test = "both",
  ref = NULL,
  verify = FALSE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	character string. The column with the species identity. Only required if verify = TRUE.
`buffer`	numerical. The buffer around each province or country centroid, where records should be flagged as problematic. Units depend on geod. Default = 1 kilometre.
`geod`	logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits.
`test`	a character string. Specifying the details of the test. One of c(“both”, “country”, “provinces”). If both tests for country and province centroids.
`ref`	SpatVector (geometry: polygons). Providing the geographic gazetteer. Can be any SpatVector (geometry: polygons), but the structure must be identical to `countryref`. Default = `countryref`.
`verify`	logical. If TRUE records are only flagged if they are the only record in a given species flagged close to a given reference. If FALSE, the distance is the only criterion
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


x <- data.frame(species = letters[1:10], 
                decimalLongitude = c(runif(99, -180, 180), -47.92), 
                decimalLatitude = c(runif(99, -90,90), -15.78))
cc_cen(x, geod = FALSE)

## Not run: 
cc_inst(x, value = "flagged", buffer = 50000) #geod = T

## End(Not run)

x <- data.frame(species = letters[1:10], 
                decimalLongitude = c(runif(99, -180, 180), -47.92), 
                decimalLatitude = c(runif(99, -90,90), -15.78))
cc_cen(x, geod = FALSE)

## Not run: 
cc_inst(x, value = "flagged", buffer = 50000) #geod = T

## End(Not run)

Identify Coordinates Outside their Reported Country

Description

Removes or flags mismatches between geographic coordinates and additional country information (usually this information is reliably reported with specimens). Such a mismatch can occur for example, if latitude and longitude are switched.

Usage

cc_coun(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  iso3 = "countrycode",
  value = "clean",
  ref = NULL,
  ref_col = "iso_a3",
  verbose = TRUE,
  buffer = NULL
)
cc_coun(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  iso3 = "countrycode",
  value = "clean",
  ref = NULL,
  ref_col = "iso_a3",
  verbose = TRUE,
  buffer = NULL
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`iso3`	a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”.
`value`	character string. Defining the output value. See value.
`ref`	SpatVector (geometry: polygons). Providing the geographic gazetteer. Can be any SpatVector (geometry: polygons), but the structure must be identical to `rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")`. Default = `rnaturalearth::ne_countries(scale = "medium", returnclass = "sf")`
`ref_col`	the column name in the reference dataset, containing the relevant ISO codes for matching. Default is to "iso_a3_eh" which refers to the ISO-3 codes in the reference dataset. See notes.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`buffer`	numeric. Units are in meters. If provided, a buffer is created around each country polygon.

Value

Note

The ref_col argument allows to adapt the function to the structure of alternative reference datasets. For instance, for rnaturalearth::ne_countries(scale = "small"), the default will fail, but ref_col = "iso_a3" will work.

With the default reference, records are flagged if they fall outside the terrestrial territory of countries, hence records in territorial waters might be flagged. See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


## Not run: 
x <- data.frame(species = letters[1:10],
                decimalLongitude = runif(100, -20, 30),
                decimalLatitude = runif(100, 35,60),
                countrycode = "RUS")

cc_coun(x, value = "flagged")#non-terrestrial records are flagged as wrong.

## End(Not run)

## Not run: 
x <- data.frame(species = letters[1:10],
                decimalLongitude = runif(100, -20, 30),
                decimalLatitude = runif(100, 35,60),
                countrycode = "RUS")

cc_coun(x, value = "flagged")#non-terrestrial records are flagged as wrong.

## End(Not run)

Identify Duplicated Records

Description

Removes or flags duplicated records based on species name and coordinates, as well as user-defined additional columns. True (specimen) duplicates or duplicates from the same species can make up the bulk of records in a biological collection database, but are undesirable for many analyses. Both can be flagged with this function, the former given enough additional information.

Usage

cc_dupl(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  additions = NULL,
  value = "clean",
  verbose = TRUE
)
cc_dupl(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  additions = NULL,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	a character string. The column with the species name. Default = “species”.
`additions`	a vector of character strings. Additional columns to be included in the test for duplication. For example as below, collector name and collector number.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Examples


x <- data.frame(species = letters[1:10], 
                decimalLongitude = sample(x = 0:10, size = 100, replace = TRUE), 
                decimalLatitude = sample(x = 0:10, size = 100, replace = TRUE),
                collector = "Bonpl",
                collector.number = c(1001, 354),
                collection = rep(c("K", "WAG","FR", "P", "S"), 20))

cc_dupl(x, value = "flagged")
cc_dupl(x, additions = c("collector", "collector.number"))

x <- data.frame(species = letters[1:10], 
                decimalLongitude = sample(x = 0:10, size = 100, replace = TRUE), 
                decimalLatitude = sample(x = 0:10, size = 100, replace = TRUE),
                collector = "Bonpl",
                collector.number = c(1001, 354),
                collection = rep(c("K", "WAG","FR", "P", "S"), 20))

cc_dupl(x, value = "flagged")
cc_dupl(x, additions = c("collector", "collector.number"))

Identify Records with Identical lat/lon

Description

Removes or flags records with equal latitude and longitude coordinates, either exact or absolute. Equal coordinates can often indicate data entry errors.

Usage

cc_equ(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  test = "absolute",
  value = "clean",
  verbose = TRUE
)
cc_equ(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  test = "absolute",
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`test`	character string. Defines if coordinates are compared exactly (“identical”) or on the absolute scale (i.e. -1 = 1, “absolute”). Default is to “absolute”.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Examples


x <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(100, -180, 180), 
                decimalLatitude = runif(100, -90,90))

cc_equ(x)
cc_equ(x, value = "flagged")

x <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(100, -180, 180), 
                decimalLatitude = runif(100, -90,90))

cc_equ(x)
cc_equ(x, value = "flagged")

Identify Records Assigned to GBIF Headquarters

Description

Removes or flags records within 0.5 degree radius around the GBIF headquarters in Copenhagen, DK.

Usage

cc_gbif(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 1000,
  geod = TRUE,
  verify = FALSE,
  value = "clean",
  verbose = TRUE
)
cc_gbif(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 1000,
  geod = TRUE,
  verify = FALSE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	character string. The column with the species identity. Only required if verify = TRUE.
`buffer`	numerical. The buffer around the GBIF headquarters, where records should be flagged as problematic. Units depend on geod. Default = 100 m.
`geod`	logical. If TRUE the radius is calculated based on a sphere, buffer is in meters. If FALSE the radius is calculated in degrees. Default = T.
`verify`	logical. If TRUE records are only flagged if they are the only record in a given species flagged close to a given reference. If FALSE, the distance is the only criterion
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Details

Not recommended if working with records from Denmark or the Copenhagen area.

Value

Examples


x <- data.frame(species = "A", 
                decimalLongitude = c(12.58, 12.58), 
                decimalLatitude = c(55.67, 30.00))
                
cc_gbif(x)
cc_gbif(x, value = "flagged")

x <- data.frame(species = "A", 
                decimalLongitude = c(12.58, 12.58), 
                decimalLatitude = c(55.67, 30.00))
                
cc_gbif(x)
cc_gbif(x, value = "flagged")

Identify Records in the Vicinity of Biodiversity Institutions

Description

Removes or flags records assigned to the location of zoos, botanical gardens, herbaria, universities and museums, based on a global database of ~10,000 such biodiversity institutions. Coordinates from these locations can be related to data-entry errors, false automated geo-reference or individuals in captivity/horticulture.

Usage

cc_inst(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 100,
  geod = FALSE,
  ref = NULL,
  verify = FALSE,
  verify_mltpl = 10,
  value = "clean",
  verbose = TRUE
)
cc_inst(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 100,
  geod = FALSE,
  ref = NULL,
  verify = FALSE,
  verify_mltpl = 10,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	character string. The column with the species identity. Only required if verify = TRUE.
`buffer`	numerical. The buffer around each institution, where records should be flagged as problematic, in decimal degrees. Default = 100m.
`geod`	logical. If TRUE the radius around each capital is calculated based on a sphere, buffer is in meters and independent of latitude. If FALSE the radius is calculated assuming planar coordinates and varies slightly with latitude. Default = TRUE. See https://seethedatablog.wordpress.com/ for detail and credits.
`ref`	SpatVector (geometry: polygons). Providing the geographic gazetteer. Can be any SpatVector (geometry: polygons), but the structure must be identical to `institutions`. Default = `institutions`
`verify`	logical. If TRUE, records close to institutions are only flagged, if there are no other records of the same species in the greater vicinity (a radius of buffer * verify_mltpl).
`verify_mltpl`	numerical. indicates the factor by which the radius for verify exceeds the radius of the initial test. Default = 10, which might be suitable if geod is TRUE, but might be too large otherwise.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Details

Note: the buffer radius is in degrees, thus will differ slightly between different latitudes.

Value

Examples


x <- data.frame(species = letters[1:10],
                decimalLongitude = c(runif(99, -180, 180), 37.577800),
                decimalLatitude = c(runif(99, -90,90), 55.710800))

#large buffer for demonstration, using geod = FALSE for shorter runtime
cc_inst(x, value = "flagged", buffer = 10, geod = FALSE)

## Not run: 
#' cc_inst(x, value = "flagged", buffer = 50000) #geod = T

## End(Not run)

x <- data.frame(species = letters[1:10],
                decimalLongitude = c(runif(99, -180, 180), 37.577800),
                decimalLatitude = c(runif(99, -90,90), 55.710800))

#large buffer for demonstration, using geod = FALSE for shorter runtime
cc_inst(x, value = "flagged", buffer = 10, geod = FALSE)

## Not run: 
#' cc_inst(x, value = "flagged", buffer = 50000) #geod = T

## End(Not run)

Identify Records Outside Natural Ranges

Description

Removes or flags records outside of the provided natural range polygon, on a per species basis. Expects one entry per species. See the example or https://www.iucnredlist.org/resources/spatial-data-download for the required polygon structure.

Usage

cc_iucn(
  x,
  range,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 0,
  value = "clean",
  verbose = TRUE
)
cc_iucn(
  x,
  range,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  buffer = 0,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`range`	a SpatVector of natural ranges for species in x. Must contain a column named as indicated by `species`. See details.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	a character string. The column with the species name. Default = “species”.
`buffer`	numerical. The buffer around each species' range, from where records should be flagged as problematic, in meters. Default = 0.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Details

Download natural range maps in suitable format for amphibians, birds, mammals and reptiles from https://www.iucnredlist.org/resources/spatial-data-download. Note: the buffer radius is in degrees, thus will differ slightly between different latitudes.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples

library(terra)

x <- data.frame(species = c("A", "B"),
decimalLongitude = runif(100, -170, 170),
decimalLatitude = runif(100, -80,80))

range_species_A <- cbind(c(-45,-45,-60,-60,-45), c(-10,-25,-25,-10,-10))
rangeA <- terra::vect(range_species_A, "polygons")
range_species_B <- cbind(c(15,15,32,32,15), c(10,-10,-10,10,10))
rangeB <- terra::vect(range_species_B, "polygons")
range <- terra::vect(list(rangeA, rangeB))
range$binomial <- c("A", "B")

cc_iucn(x = x, range = range, buffer = 0)

library(terra)

x <- data.frame(species = c("A", "B"),
decimalLongitude = runif(100, -170, 170),
decimalLatitude = runif(100, -80,80))

range_species_A <- cbind(c(-45,-45,-60,-60,-45), c(-10,-25,-25,-10,-10))
rangeA <- terra::vect(range_species_A, "polygons")
range_species_B <- cbind(c(15,15,32,32,15), c(10,-10,-10,10,10))
rangeB <- terra::vect(range_species_B, "polygons")
range <- terra::vect(list(rangeA, rangeB))
range$binomial <- c("A", "B")

cc_iucn(x = x, range = range, buffer = 0)

Identify Geographic Outliers in Species Distributions

Description

Removes out or flags records that are outliers in geographic space according to the method defined via the method argument. Geographic outliers often represent erroneous coordinates, for example due to data entry errors, imprecise geo-references, individuals in horticulture/captivity.

Usage

cc_outl(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  method = "quantile",
  mltpl = 5,
  tdi = 1000,
  value = "clean",
  sampling_thresh = 0,
  verbose = TRUE,
  min_occs = 7,
  thinning = FALSE,
  thinning_res = 0.5
)
cc_outl(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  method = "quantile",
  mltpl = 5,
  tdi = 1000,
  value = "clean",
  sampling_thresh = 0,
  verbose = TRUE,
  min_occs = 7,
  thinning = FALSE,
  thinning_res = 0.5
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	character string. The column with the species name. Default = “species”.
`method`	character string. Defining the method for outlier selection. See details. One of “distance”, “quantile”, “mad”. Default = “quantile”.
`mltpl`	numeric. The multiplier of the interquartile range (`method == 'quantile'`) or median absolute deviation (`method == 'mad'`)to identify outliers. See details. Default = 5.
`tdi`	numeric. The minimum absolute distance (`method == 'distance'`) of a record to all other records of a species to be identified as outlier, in km. See details. Default = 1000.
`value`	character string. Defining the output value. See value.
`sampling_thresh`	numeric. Cut off threshold for the sampling correction. Indicates the quantile of sampling in which outliers should be ignored. For instance, if `sampling_thresh` == 0.25, records in the 25 (no sampling correction).
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`min_occs`	Minimum number of geographically unique datapoints needed for a species to be tested. This is necessary for reliable outlier estimation. Species with fewer than min_occs records will not be tested and the output value will be 'TRUE'. Default is to 7. If `method == 'distance'`, consider a lower threshold.
`thinning`	forces a raster approximation for the distance calculation. This is routinely used for species with more than 10,000 records for computational reasons, but can be enforced for smaller datasets, which is recommended when sampling is very uneven.
`thinning_res`	The resolution for the spatial thinning in decimal degrees. Default = 0.5.

Details

The method for outlier identification depends on the method argument. If “quantile”: a boxplot method is used and records are flagged as outliers if their mean distance to all other records of the same species is larger than mltpl * the interquartile range of the mean distance of all records of this species. If “mad”: the median absolute deviation is used. In this case a record is flagged as outlier, if the mean distance to all other records of the same species is larger than the median of the mean distance of all points plus/minus the mad of the mean distances of all records of the species * mltpl. If “distance”: records are flagged as outliers, if the minimum distance to the next record of the species is > tdi. For species with records from > 10000 unique locations a random sample of 1000 records is used for the distance matrix calculation. The test skips species with fewer than min_occs, geographically unique records.

The likelihood of occurrence records being erroneous outliers is linked to the sampling effort in any given location. To account for this, the sampling_cor option fetches the number of occurrence records available from www.gbif.org, per country as a proxy of sampling effort. The outlier test (the mean distance) for each records is than weighted by the log transformed number of records per square kilometre in this country. See for https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13152 an example and further explanation of the outlier test.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


x <- data.frame(species = letters[1:10],
                decimalLongitude = runif(100, -180, 180),
                decimalLatitude = runif(100, -90,90))

cc_outl(x)
cc_outl(x, method = "quantile", value = "flagged")
cc_outl(x, method = "distance", value = "flagged", tdi = 10000)
cc_outl(x, method = "distance", value = "flagged", tdi = 1000)

x <- data.frame(species = letters[1:10],
                decimalLongitude = runif(100, -180, 180),
                decimalLatitude = runif(100, -90,90))

cc_outl(x)
cc_outl(x, method = "quantile", value = "flagged")
cc_outl(x, method = "distance", value = "flagged", tdi = 10000)
cc_outl(x, method = "distance", value = "flagged", tdi = 1000)

Identify Non-terrestrial Coordinates

Description

Removes or flags coordinates outside the reference landmass. Can be used to restrict datasets to terrestrial taxa, or exclude records from the open ocean, when depending on the reference (see details). Often records of terrestrial taxa can be found in the open ocean, mostly due to switched latitude and longitude.

Usage

cc_sea(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ref = NULL,
  scale = 110,
  value = "clean",
  speedup = TRUE,
  verbose = TRUE,
  buffer = NULL
)
cc_sea(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ref = NULL,
  scale = 110,
  value = "clean",
  speedup = TRUE,
  verbose = TRUE,
  buffer = NULL
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`ref`	SpatVector (geometry: polygons). Providing the geographic gazetteer. Can be any SpatVector (geometry: polygons), but the structure must be identical to rnaturalearth::ne_download(scale = 110, type = 'land', category = 'physical', returnclass = 'sf'). Default = rnaturalearth::ne_download(scale = 110, type = 'land', category = 'physical', returnclass = 'sf').
`scale`	the scale of the default reference, as downloaded from natural earth. Must be one of 10, 50, 110. Higher numbers equal higher detail. Default = 110.
`value`	character string. Defining the output value. See value.
`speedup`	logical. Using heuristic to speed up the analysis for large data sets with many records per location.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`buffer`	numeric. Units are in meters. If provided, a buffer is created around the sea polygon, or ref provided.

Details

In some cases flagging records close of the coastline is not recommendable, because of the low precision of the reference dataset, minor GPS imprecision or because a dataset might include coast or marshland species. If you only want to flag records in the open ocean, consider using a buffered landmass reference, e.g.: buffland.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples

x <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(10, -30, 30), 
                decimalLatitude = runif(10, -30, 30))
                
cc_sea(x, value = "flagged")

x <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(10, -30, 30), 
                decimalLatitude = runif(10, -30, 30))
                
cc_sea(x, value = "flagged")

Identify Records Inside Urban Areas

Description

Removes or flags records from inside urban areas, based on a geographic gazetteer. Often records from large databases span substantial time periods (centuries) and old records might represent habitats which today are replaced by city area.

Usage

cc_urb(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ref = NULL,
  value = "clean",
  verbose = TRUE
)
cc_urb(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ref = NULL,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`ref`	a SpatVector. Providing the geographic gazetteer with the urban areas. See details. By default rnaturalearth::ne_download(scale = 'medium', type = 'urban_areas', returnclass = "sf"). Can be any `SpatVector`, but the structure must be identical to `rnaturalearth::ne_download()`.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


## Not run: 
x <- data.frame(species = letters[1:10],
                decimalLongitude = runif(100, -180, 180),
                decimalLatitude = runif(100, -90,90))

cc_urb(x)
cc_urb(x, value = "flagged")

## End(Not run)

## Not run: 
x <- data.frame(species = letters[1:10],
                decimalLongitude = runif(100, -180, 180),
                decimalLatitude = runif(100, -90,90))

cc_urb(x)
cc_urb(x, value = "flagged")

## End(Not run)

Identify Invalid lat/lon Coordinates

Description

Removes or flags non-numeric and not available coordinates as well as lat >90, lat <-90, lon > 180 and lon < -180 are flagged.

Usage

cc_val(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  value = "clean",
  verbose = TRUE
)
cc_val(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Details

This test is obligatory before running any further tests of CoordinateCleaner, as additional tests only run with valid coordinates.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


x <- data.frame(species = letters[1:10], 
                decimalLongitude = c(runif(106, -180, 180), NA, "13W33'", "67,09", 305), 
                decimalLatitude = runif(110, -90,90))
                
cc_val(x)
cc_val(x, value = "flagged")

x <- data.frame(species = letters[1:10], 
                decimalLongitude = c(runif(106, -180, 180), NA, "13W33'", "67,09", 305), 
                decimalLatitude = runif(110, -90,90))
                
cc_val(x)
cc_val(x, value = "flagged")

Identify Zero Coordinates

Description

Removes or flags records with either zero longitude or latitude and a radius around the point at zero longitude and zero latitude. These problems are often due to erroneous data-entry or geo-referencing and can lead to typical patterns of high diversity around the equator.

Usage

cc_zero(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  buffer = 0.5,
  value = "clean",
  verbose = TRUE
)
cc_zero(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  buffer = 0.5,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`buffer`	numerical. The buffer around the 0/0 point, where records should be flagged as problematic, in decimal degrees. Default = 0.5.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


x <- data.frame(species = "A", 
                decimalLongitude = c(0,34.84, 0, 33.98), 
                decimalLatitude = c(23.08, 0, 0, 15.98))
                
cc_zero(x)
cc_zero(x, value = "flagged")

x <- data.frame(species = "A", 
                decimalLongitude = c(0,34.84, 0, 33.98), 
                decimalLatitude = c(23.08, 0, 0, 15.98))
                
cc_zero(x)
cc_zero(x, value = "flagged")

Identify Datasets with a Degree Conversion Error

Description

This test flags datasets where a significant fraction of records has been subject to a common degree minute to decimal degree conversion error, where the degree sign is recognized as decimal delimiter.

Usage

cd_ddmm(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  pvalue = 0.025,
  diff = 1,
  mat_size = 1000,
  min_span = 2,
  value = "clean",
  verbose = TRUE,
  diagnostic = FALSE
)
cd_ddmm(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  pvalue = 0.025,
  diff = 1,
  mat_size = 1000,
  min_span = 2,
  value = "clean",
  verbose = TRUE,
  diagnostic = FALSE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`ds`	a character string. The column with the dataset of each record. In case `x` should be treated as a single dataset, identical for all records. Default = “dataset”.
`pvalue`	numeric. The p-value for the one-sided t-test to flag the test as passed or not. Both ddmm.pvalue and diff must be met. Default = 0.025.
`diff`	numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. Default = 1
`mat_size`	numeric. The size of the matrix for the binomial test. Must be changed in decimals (e.g. 100, 1000, 10000). Adapt to dataset size, generally 100 is better for datasets < 10000 records, 1000 is better for datasets with 10000 - 1M records. Higher values also work reasonably well for smaller datasets, therefore, default = 1000. For large datasets try 10000.
`min_span`	numeric. The minimum geographic extent of datasets to be tested. Default = 2.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`diagnostic`	logical. If TRUE plots the analyses matrix for each dataset.

Details

If the degree sign is recognized as decimal delimiter during coordinate conversion, no coordinate decimals above 0.59 (59') are possible. The test here uses a binomial test to test if a significant proportion of records in a dataset have been subject to this problem. The test is best adjusted via the diff argument. The lower diff, the stricter the test. Also scales with dataset size. Empirically, for datasets with < 5,000 unique coordinate records diff = 0.1 has proven reasonable flagging most datasets with >25% problematic records and all dataset with >50% problematic records. For datasets between 5,000 and 100,000 geographic unique records diff = 0.01 is recommended, for datasets between 100,000 and 1 M records diff = 0.001, and so on.

Value

Depending on the ‘value’ argument, either a data.frame with summary statistics and flags for each dataset (“dataset”) or a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”.

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


clean <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(100, -180, 180), 
                decimalLatitude = runif(100, -90,90),
                dataset = "FR")
                
cd_ddmm(x = clean, value = "flagged")

#problematic dataset
lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59)

prob <-  data.frame(species = letters[1:10], 
                decimalLongitude = lon, 
                decimalLatitude = lat,
                dataset = "FR")
                
cd_ddmm(x = prob, value = "flagged")

clean <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(100, -180, 180), 
                decimalLatitude = runif(100, -90,90),
                dataset = "FR")
                
cd_ddmm(x = clean, value = "flagged")

#problematic dataset
lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59)

prob <-  data.frame(species = letters[1:10], 
                decimalLongitude = lon, 
                decimalLatitude = lat,
                dataset = "FR")
                
cd_ddmm(x = prob, value = "flagged")

Identify Datasets with Rasterized Coordinates

Description

Flags datasets with periodicity patterns indicative of a rasterized (lattice) collection scheme, as often obtain from e.g. atlas data. Using a combination of autocorrelation and sliding-window outlier detection to identify periodicity patterns in the data. See https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13152 for further details and a description of the algorithm

Usage

cd_round(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  T1 = 7,
  reg_out_thresh = 2,
  reg_dist_min = 0.1,
  reg_dist_max = 2,
  min_unique_ds_size = 4,
  graphs = TRUE,
  test = "both",
  value = "clean",
  verbose = TRUE
)
cd_round(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  T1 = 7,
  reg_out_thresh = 2,
  reg_dist_min = 0.1,
  reg_dist_max = 2,
  min_unique_ds_size = 4,
  graphs = TRUE,
  test = "both",
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`ds`	a character string. The column with the dataset of each record. In case `x` should be treated as a single dataset, identical for all records. Default = “dataset”.
`T1`	numeric. The threshold for outlier detection in a in an interquantile range based test. This is the major parameter to specify the sensitivity of the test: lower values, equal higher detection rate. Values between 7-11 are recommended. Default = 7.
`reg_out_thresh`	numeric. Threshold on the number of equal distances between outlier points. See details. Default = 2.
`reg_dist_min`	numeric. The minimum detection distance between outliers in degrees (the minimum resolution of grids that will be flagged). Default = 0.1.
`reg_dist_max`	numeric. The maximum detection distance between outliers in degrees (the maximum resolution of grids that will be flagged). Default = 2.
`min_unique_ds_size`	numeric. The minimum number of unique locations (values in the tested column) for datasets to be included in the test. Default = 4.
`graphs`	logical. If TRUE, diagnostic plots are produced. Default = TRUE.
`test`	character string. Indicates which column to test. Either “lat” for latitude, “lon” for longitude, or “both” for both. In the latter case datasets are only flagged if both test are failed. Default = “both”
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Depending on the ‘value’ argument, either a data.frame with summary statistics and flags for each dataset (“dataset”) or a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flagged”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”.

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


#simulate bias grid, one degree resolution, 10% error on a 1000 records dataset
#simulate biased fraction of the data, grid resolution = 1 degree
#simulate non-biased fraction of the data
  bi <- sample(3 + 0:5, size = 100, replace = TRUE)
  mu <- runif(3, 0, 15)
  sig <- runif(3, 0.1, 5)
  cl <- rnorm(n = 900, mean = mu, sd = sig)
  lon <- c(cl, bi)
  
  bi <- sample(9:13, size = 100, replace = TRUE)
  mu <- runif(3, 0, 15)
  sig <- runif(3, 0.1, 5)
  cl <- rnorm(n = 900, mean = mu, sd = sig)
  lat <- c(cl, bi)
  
  #add biased data
  
  inp <- data.frame(decimalLongitude = lon,
                    decimalLatitude = lat,
                    dataset = "test")
            
          
  #run test
  ## Not run: 
  cd_round(inp, value = "dataset")
  
## End(Not run)
  

#simulate bias grid, one degree resolution, 10% error on a 1000 records dataset
#simulate biased fraction of the data, grid resolution = 1 degree
#simulate non-biased fraction of the data
  bi <- sample(3 + 0:5, size = 100, replace = TRUE)
  mu <- runif(3, 0, 15)
  sig <- runif(3, 0.1, 5)
  cl <- rnorm(n = 900, mean = mu, sd = sig)
  lon <- c(cl, bi)
  
  bi <- sample(9:13, size = 100, replace = TRUE)
  mu <- runif(3, 0, 15)
  sig <- runif(3, 0.1, 5)
  cl <- rnorm(n = 900, mean = mu, sd = sig)
  lat <- c(cl, bi)
  
  #add biased data
  
  inp <- data.frame(decimalLongitude = lon,
                    decimalLatitude = lat,
                    dataset = "test")
            
          
  #run test
  ## Not run: 
  cd_round(inp, value = "dataset")
  
## End(Not run)

Identify Fossils with Outlier Age

Description

Removes or flags records that are temporal outliers based on interquantile ranges.

Usage

cf_age(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  method = "quantile",
  size_thresh = 7,
  mltpl = 5,
  replicates = 5,
  flag_thresh = 0.5,
  uniq_loc = FALSE,
  value = "clean",
  verbose = TRUE
)
cf_age(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  method = "quantile",
  size_thresh = 7,
  mltpl = 5,
  replicates = 5,
  flag_thresh = 0.5,
  uniq_loc = FALSE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing fossil records with taxon names, ages, and geographic coordinates.
`lon`	character string. The column with the longitude coordinates. To identify unique records if `uniq_loc = TRUE`. Default = “decimalLongitude”.
`lat`	character string. The column with the longitude coordinates. Default = “decimalLatitude”. To identify unique records if `uniq_loc = T`.
`min_age`	character string. The column with the minimum age. Default = “min_ma”.
`max_age`	character string. The column with the maximum age. Default = “max_ma”.
`taxon`	character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”.
`method`	character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”.
`size_thresh`	numeric. The minimum number of records needed for a dataset to be tested. Default = 10.
`mltpl`	numeric. The multiplier of the interquartile range (`method == 'quantile'`) or median absolute deviation (`method == 'mad'`) to identify outliers. See details. Default = 5.
`replicates`	numeric. The number of replications for the distance matrix calculation. See details. Default = 5.
`flag_thresh`	numeric. The fraction of passed replicates necessary to pass the test. See details. Default = 0.5.
`uniq_loc`	logical. If TRUE only single records per location and time point (and taxon if `taxon` != "") are used for the outlier testing. Default = T.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Details

The outlier detection is based on an interquantile range test. A temporal distance matrix among all records is calculated based on a single point selected by random between the minimum and maximum age for each record. The mean distance for each point to all neighbours is calculated and the sum of these distances is then tested against the interquantile range and flagged as an outlier if $x > IQR(x) + q_75 * mltpl$ . The test is replicated ‘replicates’ times, to account for dating uncertainty. Records are flagged as outliers if they are flagged by a fraction of more than ‘flag.thresh’ replicates. Only datasets/taxa comprising more than ‘size_thresh’ records are tested. Distance are calculated as Euclidean distance.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


minages <- c(runif(n = 11, min = 10, max = 25), 62.5)
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                min_ma = minages,
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

cf_age(x, value = "flagged", taxon = "")

# unique locations only
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                decimalLongitude = c(runif(n = 10, min = 4, max = 16), 75, 7),
                decimalLatitude = c(runif(n = 12, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

cf_age(x, value = "flagged", taxon = "", uniq_loc = TRUE)

minages <- c(runif(n = 11, min = 10, max = 25), 62.5)
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                min_ma = minages,
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

cf_age(x, value = "flagged", taxon = "")

# unique locations only
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                decimalLongitude = c(runif(n = 10, min = 4, max = 16), 75, 7),
                decimalLatitude = c(runif(n = 12, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

cf_age(x, value = "flagged", taxon = "", uniq_loc = TRUE)

Identify Fossils with equal min and max age

Description

Removes or flags records with equal minimum and maximum age.

Usage

cf_equal(
  x,
  min_age = "min_ma",
  max_age = "max_ma",
  value = "clean",
  verbose = TRUE
)
cf_equal(
  x,
  min_age = "min_ma",
  max_age = "max_ma",
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing fossil records with taxon names, ages, and geographic coordinates.
`min_age`	character string. The column with the minimum age. Default = “min_ma”.
`max_age`	character string. The column with the maximum age. Default = “max_ma”.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


minages <- runif(n = 10, min = 0.1, max = 25)
x <- data.frame(species = letters[1:10], 
                min_ma = minages, 
                max_ma = minages + runif(n = 10, min = 0, max = 10))
x <- rbind(x, data.frame(species = "z", 
                min_ma = 5, 
                max_ma = 5))
                
cf_equal(x, value = "flagged")

minages <- runif(n = 10, min = 0.1, max = 25)
x <- data.frame(species = letters[1:10], 
                min_ma = minages, 
                max_ma = minages + runif(n = 10, min = 0, max = 10))
x <- rbind(x, data.frame(species = "z", 
                min_ma = 5, 
                max_ma = 5))
                
cf_equal(x, value = "flagged")

Identify Outlier Records in Space and Time

Description

Removes or flags records of fossils that are spatio-temporal outliers based on interquantile ranges. Records are flagged if they are either extreme in time or space, or both.

Usage

cf_outl(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  method = "quantile",
  size_thresh = 7,
  mltpl = 5,
  replicates = 5,
  flag_thresh = 0.5,
  uniq_loc = FALSE,
  value = "clean",
  verbose = TRUE
)
cf_outl(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  method = "quantile",
  size_thresh = 7,
  mltpl = 5,
  replicates = 5,
  flag_thresh = 0.5,
  uniq_loc = FALSE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing fossil records with taxon names, ages, and geographic coordinates.
`lon`	character string. The column with the longitude coordinates. To identify unique records if `uniq_loc = TRUE`. Default = “decimalLongitude”.
`lat`	character string. The column with the longitude coordinates. Default = “decimalLatitude”. To identify unique records if `uniq_loc = T`.
`min_age`	character string. The column with the minimum age. Default = “min_ma”.
`max_age`	character string. The column with the maximum age. Default = “max_ma”.
`taxon`	character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”.
`method`	character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”.
`size_thresh`	numeric. The minimum number of records needed for a dataset to be tested. Default = 10.
`mltpl`	numeric. The multiplier of the interquartile range (`method == 'quantile'`) or median absolute deviation (`method == 'mad'`) to identify outliers. See details. Default = 5.
`replicates`	numeric. The number of replications for the distance matrix calculation. See details. Default = 5.
`flag_thresh`	numeric. The fraction of passed replicates necessary to pass the test. See details. Default = 0.5.
`uniq_loc`	logical. If TRUE only single records per location and time point (and taxon if `taxon` != "") are used for the outlier testing. Default = T.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Details

The outlier detection is based on an interquantile range test. In a first step a distance matrix of geographic distances among all records is calculate. Subsequently a similar distance matrix of temporal distances among all records is calculated based on a single point selected by random between the minimum and maximum age for each record. The mean distance for each point to all neighbours is calculated for both matrices and spatial and temporal distances are scaled to the same range. The sum of these distanced is then tested against the interquantile range and flagged as an outlier if $x > IQR(x) + q_75 * mltpl$ . The test is replicated ‘replicates’ times, to account for temporal uncertainty. Records are flagged as outliers if they are flagged by a fraction of more than ‘flag.thres’ replicates. Only datasets/taxa comprising more than ‘size_thresh’ records are tested. Note that geographic distances are calculated as geospheric distances for datasets (or taxa) with fewer than 10,000 records and approximated as Euclidean distances for datasets/taxa with 10,000 to 25,000 records. Datasets/taxa comprising more than 25,000 records are skipped.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


minages <- c(runif(n = 11, min = 10, max = 25), 62.5)
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                lng = c(runif(n = 10, min = 4, max = 16), 75, 7),
                lat = c(runif(n = 12, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

cf_outl(x, value = "flagged", taxon = "")

minages <- c(runif(n = 11, min = 10, max = 25), 62.5)
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                lng = c(runif(n = 10, min = 4, max = 16), 75, 7),
                lat = c(runif(n = 12, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

cf_outl(x, value = "flagged", taxon = "")

Identify Fossils with Extreme Age Ranges

Description

Removes or flags records with an unexpectedly large temporal range, based on a quantile outlier test.

Usage

cf_range(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  method = "quantile",
  mltpl = 5,
  size_thresh = 7,
  max_range = 500,
  uniq_loc = FALSE,
  value = "clean",
  verbose = TRUE
)
cf_range(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  method = "quantile",
  mltpl = 5,
  size_thresh = 7,
  max_range = 500,
  uniq_loc = FALSE,
  value = "clean",
  verbose = TRUE
)

Arguments

`x`	data.frame. Containing fossil records with taxon names, ages, and geographic coordinates.
`lon`	character string. The column with the longitude coordinates. To identify unique records if `uniq_loc = TRUE`. Default = “decimalLongitude”.
`lat`	character string. The column with the longitude coordinates. Default = “decimalLatitude”. To identify unique records if `uniq_loc = T`.
`min_age`	character string. The column with the minimum age. Default = “min_ma”.
`max_age`	character string. The column with the maximum age. Default = “max_ma”.
`taxon`	character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”.
`method`	character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”.
`mltpl`	numeric. The multiplier of the interquartile range (`method == 'quantile'`) or median absolute deviation (`method == 'mad'`) to identify outliers. See details. Default = 5.
`size_thresh`	numeric. The minimum number of records needed for a dataset to be tested. Default = 10.
`max_range`	numeric. A absolute maximum time interval between min age and max age. Only relevant for `method` = “time”.
`uniq_loc`	logical. If TRUE only single records per location and time point (and taxon if `taxon` != "") are used for the outlier testing. Default = T.
`value`	character string. Defining the output value. See value.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.

Value

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


minages <- runif(n = 11, min = 0.1, max = 25)
x <- data.frame(species = c(letters[1:10], "z"),
                lng = c(runif(n = 9, min = 4, max = 16), 75, 7),
                lat = c(runif(n = 11, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = minages + c(runif(n = 10, min = 0, max = 5), 25))

cf_range(x, value = "flagged", taxon = "")

minages <- runif(n = 11, min = 0.1, max = 25)
x <- data.frame(species = c(letters[1:10], "z"),
                lng = c(runif(n = 9, min = 4, max = 16), 75, 7),
                lat = c(runif(n = 11, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = minages + c(runif(n = 10, min = 0, max = 5), 25))

cf_range(x, value = "flagged", taxon = "")

Geographic Cleaning of Coordinates from Biologic Collections

Description

Cleaning geographic coordinates by multiple empirical tests to flag potentially erroneous coordinates, addressing issues common in biological collection databases.

Usage

clean_coordinates(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  countries = NULL,
  tests = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas",
    "zeros"),
  capitals_rad = 10000,
  centroids_rad = 1000,
  centroids_detail = "both",
  inst_rad = 100,
  outliers_method = "quantile",
  outliers_mtp = 5,
  outliers_td = 1000,
  outliers_size = 7,
  range_rad = 0,
  zeros_rad = 0.5,
  capitals_ref = NULL,
  centroids_ref = NULL,
  country_ref = NULL,
  country_refcol = "iso_a3",
  country_buffer = NULL,
  inst_ref = NULL,
  range_ref = NULL,
  seas_ref = NULL,
  seas_scale = 50,
  seas_buffer = NULL,
  urban_ref = NULL,
  aohi_rad = NULL,
  value = "spatialvalid",
  verbose = TRUE,
  report = FALSE
)
clean_coordinates(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  species = "species",
  countries = NULL,
  tests = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas",
    "zeros"),
  capitals_rad = 10000,
  centroids_rad = 1000,
  centroids_detail = "both",
  inst_rad = 100,
  outliers_method = "quantile",
  outliers_mtp = 5,
  outliers_td = 1000,
  outliers_size = 7,
  range_rad = 0,
  zeros_rad = 0.5,
  capitals_ref = NULL,
  centroids_ref = NULL,
  country_ref = NULL,
  country_refcol = "iso_a3",
  country_buffer = NULL,
  inst_ref = NULL,
  range_ref = NULL,
  seas_ref = NULL,
  seas_scale = 50,
  seas_buffer = NULL,
  urban_ref = NULL,
  aohi_rad = NULL,
  value = "spatialvalid",
  verbose = TRUE,
  report = FALSE
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`species`	a character string. A vector of the same length as rows in x, with the species identity for each record. If NULL, `tests` must not include the "outliers" or "duplicates" tests.
`countries`	a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”. If missing, the countries test is skipped.
`tests`	a vector of character strings, indicating which tests to run. See details for all tests available. Default = c("capitals", "centroids", "equal", "gbif", "institutions", "outliers", "seas", "zeros")
`capitals_rad`	numeric. The radius around capital coordinates in meters. Default = 10000.
`centroids_rad`	numeric. The radius around centroid coordinates in meters. Default = 1000.
`centroids_detail`	a `character string`. If set to ‘country’ only country (adm-0) centroids are tested, if set to ‘provinces’ only province (adm-1) centroids are tested. Default = ‘both’.
`inst_rad`	numeric. The radius around biodiversity institutions coordinates in metres. Default = 100.
`outliers_method`	The method used for outlier testing. See details.
`outliers_mtp`	numeric. The multiplier for the interquartile range of the outlier test. If NULL `outliers.td` is used. Default = 5.
`outliers_td`	numeric. The minimum distance of a record to all other records of a species to be identified as outlier, in km. Default = 1000.
`outliers_size`	numerical. The minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7.
`range_rad`	buffer around natural ranges. Default = 0.
`zeros_rad`	numeric. The radius around 0/0 in degrees. Default = 0.5.
`capitals_ref`	a `data.frame` with alternative reference data for the country capitals test. If missing, the `countryref` dataset is used. Alternatives must be identical in structure.
`centroids_ref`	a `data.frame` with alternative reference data for the centroid test. If NULL, the `countryref` dataset is used. Alternatives must be identical in structure.
`country_ref`	a `SpatVector` as alternative reference for the countries test. If NULL, the `rnaturalearth:ne_countries('medium', returnclass = "sf")` dataset is used.
`country_refcol`	the column name in the reference dataset, containing the relevant ISO codes for matching. Default is to "iso_a3_eh" which referes to the ISO-3 codes in the reference dataset. See notes.
`country_buffer`	numeric. Units are in meters. If provided, a buffer is created around each country polygon.
`inst_ref`	a `data.frame` with alternative reference data for the biodiversity institution test. If NULL, the `institutions` dataset is used. Alternatives must be identical in structure.
`range_ref`	a `SpatVector` of species natural ranges. Required to include the 'ranges' test. See `cc_iucn` for details.
`seas_ref`	a `SpatVector` as alternative reference for the seas test. If NULL, the rnaturalearth::ne_download(scale = 110, type = 'land', category = 'physical', returnclass = "sf") dataset is used.
`seas_scale`	The scale of the default landmass reference. Must be one of 10, 50, 110. Higher numbers equal higher detail. Default = 50.
`seas_buffer`	numeric. Units are in meters. If provided, a buffer is created around sea polygon.
`urban_ref`	a `SpatVector` as alternative reference for the urban test. If NULL, the test is skipped. See details for a reference gazetteers.
`aohi_rad`	numeric. The radius around aohi coordinates in meters. Default = 1000.
`value`	a character string defining the output value. See the value section for details. one of ‘spatialvalid’, ‘summary’, ‘clean’. Default = ‘`spatialvalid`’.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`report`	logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE.

Details

The function needs all coordinates to be formally valid according to WGS84. If the data contains invalid coordinates, the function will stop and return a vector flagging the invalid records. TRUE = non-problematic coordinate, FALSE = potentially problematic coordinates.

capitals tests a radius around adm-0 capitals. The radius is capitals_rad.
centroids tests a radius around country centroids. The radius is centroids_rad.
countries tests if coordinates are from the country indicated in the country column. Switched off by default.
duplicates tests for duplicate records. This checks for identical coordinates or if a species vector is provided for identical coordinates within a species. All but the first records are flagged as duplicates. Switched off by default.
equal tests for equal absolute longitude and latitude.
gbif tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark.
institutions tests a radius around known biodiversity institutions from instiutions. The radius is inst_rad.
outliers tests each species for outlier records. Depending on the outliers_mtp and outliers.td arguments either flags records that are a minimum distance away from all other records of this species (outliers_td) or records that are outside a multiple of the interquartile range of minimum distances to the next neighbour of this species (outliers_mtp). Three different methods are available for the outlier test: "If “outlier” a boxplot method is used and records are flagged as outliers if their mean distance to all other records of the same species is larger than mltpl * the interquartile range of the mean distance of all records of this species. If “mad” the median absolute deviation is used. In this case a record is flagged as outlier, if the mean distance to all other records of the same species is larger than the median of the mean distance of all points plus/minus the mad of the mean distances of all records of the species * mltpl. If “distance” records are flagged as outliers, if the minimum distance to the next record of the species is > tdi.
ranges tests if records fall within provided natural range polygons on a per species basis. See cc_iucn for details.
seas tests if coordinates fall into the ocean.
urban tests if coordinates are from urban areas. Switched off by default
validity checks if coordinates correspond to a lat/lon coordinate reference system. This test is always on, since all records need to pass for any other test to run.
zeros tests for plain zeros, equal latitude and longitude and a radius around the point 0/0. The radius is zeros.rad.

Value

Depending on the output argument:

“spatialvalid”: an object of class spatialvalid similar to x with one column added for each test. TRUE = clean coordinate entry, FALSE = potentially problematic coordinate entries. The .summary column is FALSE if any test flagged the respective coordinate.
“flagged”: a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).
“clean”: a data.frame similar to x with potentially problematic records removed

Note

Always tests for coordinate validity: non-numeric or missing coordinates and coordinates exceeding the global extent (lon/lat, WGS84). See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

The country_refcol argument allows to adapt the function to the structure of alternative reference datasets. For instance, for rnaturalearth::ne_countries(scale = "small", returnclass = "sf"), the default will fail, but country_refcol = "iso_a3" will work.

Examples



exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
                    decimalLongitude = runif(250, min = 42, max = 51),
                    decimalLatitude = runif(250, min = -26, max = -11))

test <- clean_coordinates(x = exmpl, 
                          tests = c("equal"))
                                    
## Not run: 
#run more tests
test <- clean_coordinates(x = exmpl, 
                          tests = c("capitals", 
                          "centroids","equal", 
                          "gbif", "institutions", 
                          "outliers", "seas", 
                          "zeros"))

## End(Not run)
                                 
                                    
summary(test)

exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
                    decimalLongitude = runif(250, min = 42, max = 51),
                    decimalLatitude = runif(250, min = -26, max = -11))

test <- clean_coordinates(x = exmpl, 
                          tests = c("equal"))
                                    
## Not run: 
#run more tests
test <- clean_coordinates(x = exmpl, 
                          tests = c("capitals", 
                          "centroids","equal", 
                          "gbif", "institutions", 
                          "outliers", "seas", 
                          "zeros"))

## End(Not run)
                                 
                                    
summary(test)

Coordinate Cleaning using Dataset Properties

Description

Tests for problems associated with coordinate conversions and rounding, based on dataset properties. Includes test to identify contributing datasets with potential errors with converting ddmm to dd.dd, and periodicity in the data decimals indicating rounding or a raster basis linked to low coordinate precision. Specifically:

ddmm tests for erroneous conversion from a degree minute format (ddmm) to a decimal degree (dd.dd) format
periodicity test for periodicity in the data, which can indicate imprecise coordinates, due to rounding or rasterization.

Usage

clean_dataset(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  tests = c("ddmm", "periodicity"),
  value = "dataset",
  verbose = TRUE,
  ...
)
clean_dataset(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  tests = c("ddmm", "periodicity"),
  value = "dataset",
  verbose = TRUE,
  ...
)

Arguments

`x`	data.frame. Containing geographical coordinates and species names.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`ds`	a character string. The column with the dataset of each record. In case `x` should be treated as a single dataset, identical for all records. Default = “dataset”.
`tests`	a vector of character strings, indicating which tests to run. See details for all tests available. Default = c("ddmm", "periodicity")
`value`	a character string. Defining the output value. See value. Default = “dataset”.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`...`	additional arguments to be passed to `cd_ddmm` and `cd_round` to customize test sensitivity.

Details

These tests are based on the statistical distribution of coordinates and their decimals within datasets of geographic distribution records to identify datasets with potential errors/biases. Three potential error sources can be identified. The ddmm flag tests for the particular pattern that emerges if geographical coordinates in a degree minute annotation are transferred into decimal degrees, simply replacing the degree symbol with the decimal point. This kind of problem has been observed by in older datasets first recorded on paper using typewriters, where e.g. a floating point was used as symbol for degrees. The function uses a binomial test to check if more records than expected have decimals below 0.6 (which is the maximum that can be obtained in minutes, as one degree has 60 minutes) and if the number of these records is higher than those above 0.59 by a certain proportion. The periodicity test uses rate estimation in a Poisson process to estimate if there is periodicity in the decimals of a dataset (as would be expected by for example rounding or data that was collected in a raster format) and if there is an over proportional number of records with the decimal 0 (full degrees) which indicates rounding and thus low precision. The default values are empirically optimized by with GBIF data, but should probably be adapted.

Value

Depending on the ‘value’ argument:

“dataset”: a data.frame with the the test summary statistics for each dataset in x
“clean”: a data.frame containing only records from datasets in x that passed the tests
“flagged”: a logical vector of the same length as rows in x, with TRUE = test passed and FALSE = test failed/potentially problematic.

Note

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples

#Create test dataset
clean <- data.frame(dataset = rep("clean", 1000),
                    decimalLongitude = runif(min = -43, max = -40, n = 1000),
                    decimalLatitude = runif(min = -13, max = -10, n = 1000))
                    
bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1),
               round(runif(min = -42, max = -40, n = 300), 0),
               runif(min = -42, max = -40, n = 200))
bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1),
              round(runif(min = -12, max = -10, n = 300), 0),
              runif(min = -12, max = -10, n = 200))
bias <- data.frame(dataset = rep("biased", 1000),
                   decimalLongitude = bias.long,
                   decimalLatitude = bias.lat)
test <- rbind(clean, bias)

## Not run:                   
#run clean_dataset
flags <- clean_dataset(test)

#check problems
#clean
hist(test[test$dataset == rownames(flags[flags$summary,]), "decimalLongitude"])
#biased
hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimalLongitude"])


## End(Not run)
#Create test dataset
clean <- data.frame(dataset = rep("clean", 1000),
                    decimalLongitude = runif(min = -43, max = -40, n = 1000),
                    decimalLatitude = runif(min = -13, max = -10, n = 1000))
                    
bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1),
               round(runif(min = -42, max = -40, n = 300), 0),
               runif(min = -42, max = -40, n = 200))
bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1),
              round(runif(min = -12, max = -10, n = 300), 0),
              runif(min = -12, max = -10, n = 200))
bias <- data.frame(dataset = rep("biased", 1000),
                   decimalLongitude = bias.long,
                   decimalLatitude = bias.lat)
test <- rbind(clean, bias)

## Not run:                   
#run clean_dataset
flags <- clean_dataset(test)

#check problems
#clean
hist(test[test$dataset == rownames(flags[flags$summary,]), "decimalLongitude"])
#biased
hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimalLongitude"])


## End(Not run)

Geographic and Temporal Cleaning of Records from Fossil Collections

Description

Cleaning records by multiple empirical tests to flag potentially erroneous coordinates and time-spans, addressing issues common in fossil collection databases. Individual tests can be activated via the tests argument:

Usage

clean_fossils(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  tests = c("agesequal", "centroids", "equal", "gbif", "institutions", "spatiotemp",
    "temprange", "validity", "zeros"),
  countries = NULL,
  centroids_rad = 0.05,
  centroids_detail = "both",
  inst_rad = 0.001,
  outliers_method = "quantile",
  outliers_threshold = 5,
  outliers_size = 7,
  outliers_replicates = 5,
  zeros_rad = 0.5,
  centroids_ref = NULL,
  country_ref = NULL,
  inst_ref = NULL,
  value = "spatialvalid",
  verbose = TRUE,
  report = FALSE
)
clean_fossils(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  min_age = "min_ma",
  max_age = "max_ma",
  taxon = "accepted_name",
  tests = c("agesequal", "centroids", "equal", "gbif", "institutions", "spatiotemp",
    "temprange", "validity", "zeros"),
  countries = NULL,
  centroids_rad = 0.05,
  centroids_detail = "both",
  inst_rad = 0.001,
  outliers_method = "quantile",
  outliers_threshold = 5,
  outliers_size = 7,
  outliers_replicates = 5,
  zeros_rad = 0.5,
  centroids_ref = NULL,
  country_ref = NULL,
  inst_ref = NULL,
  value = "spatialvalid",
  verbose = TRUE,
  report = FALSE
)

Arguments

`x`	data.frame. Containing fossil records, containing taxon names, ages, and geographic coordinates..
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`min_age`	character string. The column with the minimum age. Default = “min_ma”.
`max_age`	character string. The column with the maximum age. Default = “max_ma”.
`taxon`	character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”.
`tests`	vector of character strings, indicating which tests to run. See details for all tests available. Default = c("centroids", "equal", "gbif", "institutions", "temprange", "spatiotemp", "agesequal", "zeros")
`countries`	a character string. The column with the country assignment of each record in three letter ISO code. Default = “countrycode”. If missing, the countries test is skipped.
`centroids_rad`	numeric. The radius around centroid coordinates in meters. Default = 1000.
`centroids_detail`	a `character string`. If set to ‘country’ only country (adm-0) centroids are tested, if set to ‘provinces’ only province (adm-1) centroids are tested. Default = ‘both’.
`inst_rad`	numeric. The radius around biodiversity institutions coordinates in metres. Default = 100.
`outliers_method`	The method used for outlier testing. See details.
`outliers_threshold`	numerical. The multiplier for the interquantile range for outlier detection. The higher the number, the more conservative the outlier tests. See `cf_outl` for details. Default = 3.
`outliers_size`	numerical. The minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7.
`outliers_replicates`	numeric. The number of replications for the distance matrix calculation. See details. Default = 5.
`zeros_rad`	numeric. The radius around 0/0 in degrees. Default = 0.5.
`centroids_ref`	a `data.frame` with alternative reference data for the centroid test. If NULL, the `countryref` dataset is used. Alternatives must be identical in structure.
`country_ref`	a `SpatVector` as alternative reference for the countries test. If NULL, the `rnaturalearth:ne_countries('medium', returnclass = "sf")` dataset is used.
`inst_ref`	a `data.frame` with alternative reference data for the biodiversity institution test. If NULL, the `institutions` dataset is used. Alternatives must be identical in structure.
`value`	a character string defining the output value. See the value section for details. one of ‘spatialvalid’, ‘summary’, ‘clean’. Default = ‘`spatialvalid`’.
`verbose`	logical. If TRUE reports the name of the test and the number of records flagged.
`report`	logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE.

Details

agesequal tests for equal minimum and maximum age.
centroids tests a radius around country centroids. The radius is centroids_rad.
countries tests if coordinates are from the country indicated in the country column. Switched off by default.
equal tests for equal absolute longitude and latitude.
gbif tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark.
institutions tests a radius around known biodiversity institutions from instiutions. The radius is inst_rad.
spatiotemp test for records which are outlier in time and space. See below for details.
temprange tests for records with unexpectedly large temporal ranges, using a quantile-based outlier test.
validity checks if coordinates correspond to a lat/lon coordinate reference system. This test is always on, since all records need to pass for any other test to run.
zeros tests for plain zeros, equal latitude and longitude and a radius around the point 0/0. The radius is zeros_rad. The outlier detection in ‘spatiotemp’ is based on an interquantile range test. In a first step a distance matrix of geographic distances among all records is calculate. Subsequently a similar distance matrix of temporal distances among all records is calculated based on a single point selected by random between the minimum and maximum age for each record. The mean distance for each point to all neighbours is calculated for both matrices and spatial and temporal distances are scaled to the same range. The sum of these distanced is then tested against the interquantile range and flagged as an outlier if $x > IQR(x) + q_75 * mltpl$ . The test is replicated ‘replicates’ times, to account for temporal uncertainty. Records are flagged as outliers if they are flagged by a fraction of more than ‘flag_thresh’ replicates. Only datasets/taxa comprising more than ‘size.thresh’ records are tested. Note that geographic distances are calculated as geospheric distances for datasets (or taxa) with fewer than 10,000 records and approximated as Euclidean distances for datasets/taxa with 10,000 to 25,000 records. Datasets/taxa comprising more than 25,000 records are skipped.

Value

Depending on the output argument:

“spatialvalid”: an object of class spatialvalid similar to x with one column added for each test. TRUE = clean coordinate entry, FALSE = potentially problematic coordinate entries. The .summary column is FALSE if any test flagged the respective coordinate.
“flagged”: a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).
“clean”: a data.frame similar to x with potentially problematic records removed

Note

Always tests for coordinate validity: non-numeric or missing coordinates and coordinates exceeding the global extent (lon/lat, WGS84).

See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.

Examples


minages <- runif(250, 0, 65)
exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE),
                    decimalLongitude = runif(250, min = 42, max = 51),
                    decimalLatitude = runif(250, min = -26, max = -11),
                    min_ma = minages,
                    max_ma = minages + runif(250, 0.1, 65))

test <- clean_fossils(x = exmpl)

summary(test)

minages <- runif(250, 0, 65)
exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE),
                    decimalLongitude = runif(250, min = 42, max = 51),
                    decimalLatitude = runif(250, min = -26, max = -11),
                    min_ma = minages,
                    max_ma = minages + runif(250, 0.1, 65))

test <- clean_fossils(x = exmpl)

summary(test)

Country Centroids and Country Capitals

Description

A data.frame with coordinates of country and province centroids and country capitals as reference for the clean_coordinates, cc_cen and cc_cap functions. Coordinates are based on the Central Intelligence Agency World Factbook https://www.cia.gov/the-world-factbook/, https://thematicmapping.org/downloads/world_borders.php and geolocate https://geo-locate.org.

Format

A data frame with 5,305 observations on 13 variables. #'

iso3: ISO-3 code for each country, in case of provinces also referring to the country.
iso2: ISO-2 code for each country, in case of provinces also referring to the country.
adm1_code: adm code for countries and provinces.
name: a factor; name of the country or province.
type: identifying if the entry refers to a country or province level.
centroid.lon: Longitude of the country centroid.
centroid.lat: Latitude of the country centroid.
capital: Name of the country capital, empty for provinces.
capital.lon: Longitude of the country capital.
capital.lat: Latitude of the country capital.
area_sqkm: The area of the country or province.
uncertaintyRadiusMeters: The uncertainty of the country centroid.
source: The data source. Currently only available for https://geo-locate.org

Source

CENTRAL INTELLIGENCE AGENCY (2014) The World Factbook, Washington, DC.

https://www.cia.gov/the-world-factbook/ https://thematicmapping.org/downloads/world_borders.php https://geo-locate.org

Examples


data(countryref)
head(countryref)
data(countryref)
head(countryref)

Global Locations of Biodiversity Institutions

Description

A global gazetteer for biodiversity institutions from various sources, including zoos, museums, botanical gardens, GBIF contributors, herbaria, university collections.

Format

A data frame with 12170 observations on 12 variables.

Source

Compiled from various sources:

Global Biodiversity Information Facility https://www.gbif.org/
Wikipedia https://www.wikipedia.org/
Geonames https://www.geonames.org/
The Global Registry of Biodiversity Repositories
Index Herbariorum https://sweetgum.nybg.org/science/ih/
Botanic Gardens Conservation International https://www.bgci.org/

Examples


data(institutions)
str(institutions)

data(institutions)
str(institutions)

Check spatialvalid object

Description

Test if its argument is a spatialvalid object

Usage

is.spatialvalid(x)
is.spatialvalid(x)

Arguments

`x`	the object to be tested

Value

returns TRUE if its argument is a spatialvalid

Example data from the Paleobiologydatabase

Description

A dataset of 5000 flowering plant fossil occurrences as example for data of the paleobiology Database, downloaded using the paleobioDB packages as specified in the vignette “Cleaning_PBDB_fossils_with_CoordinateCleaner”.

Format

A data frame with 5000 observations on 36 variables.

Source

The Paleobiology database https://paleobiodb.org/
Sara Varela, Javier Gonzalez Hernandez and Luciano Fabris Sgarbi (2016). paleobioDB: Download and Process Data from the Paleobiology Database. R package version 0.5.0. https://CRAN.R-project.org/package=paleobioDB.

Examples


data(institutions)
str(institutions)

data(institutions)
str(institutions)

Plot Method for Class Spatialvalid

Description

A set of plots to explore objects of the class spatialvalid. A plot to visualize the flags from clean_coordinates

Usage

## S3 method for class 'spatialvalid'
plot(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  bgmap = NULL,
  clean = TRUE,
  details = FALSE,
  pts_size = 1,
  font_size = 10,
  zoom_f = 0.1,
  ...
)
## S3 method for class 'spatialvalid'
plot(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  bgmap = NULL,
  clean = TRUE,
  details = FALSE,
  pts_size = 1,
  font_size = 10,
  zoom_f = 0.1,
  ...
)

Arguments

`x`	an object of the class `spatialvalid` as from `clean_coordinates`.
`lon`	character string. The column with the longitude coordinates. Default = “decimalLongitude”.
`lat`	character string. The column with the latitude coordinates. Default = “decimalLatitude”.
`bgmap`	an object of the class `SpatVector` or `sf` used as background map. Default = ggplot::borders()
`clean`	logical. If TRUE, non-flagged coordinates are included in the map.
`details`	logical. If TRUE, occurrences are color-coded by the type of flag.
`pts_size`	numeric. The point size for the plot.
`font_size`	numeric. The font size for the legend and axes
`zoom_f`	numeric. the fraction by which to expand the plotting area from the occurrence records. Increase, if countries do not show up on the background map.
`...`	arguments to be passed to methods.

Value

A plot of the records flagged as potentially erroneous by clean_coordinates.

Examples



exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
                    decimalLongitude = runif(250, min = 42, max = 51),
                    decimalLatitude = runif(250, min = -26, max = -11))

test <- clean_coordinates(exmpl, species = "species", 
                          tests = c("sea", "gbif", "zeros"),
                          verbose = FALSE)

summary(test)
plot(test)
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
                    decimalLongitude = runif(250, min = 42, max = 51),
                    decimalLatitude = runif(250, min = -26, max = -11))

test <- clean_coordinates(exmpl, species = "species", 
                          tests = c("sea", "gbif", "zeros"),
                          verbose = FALSE)

summary(test)
plot(test)

Create Input Files for PyRate

Description

Creates the input necessary to run Pyrate, based on a data.frame with fossil ages (as derived e.g. from clean_fossils) and a vector of the extinction status for each sample. Creates files in the working directory!

Usage

write_pyrate(
  x,
  status,
  fname,
  taxon = "accepted_name",
  min_age = "min_ma",
  max_age = "max_ma",
  trait = NULL,
  path = getwd(),
  replicates = 1,
  cutoff = NULL,
  random = TRUE
)
write_pyrate(
  x,
  status,
  fname,
  taxon = "accepted_name",
  min_age = "min_ma",
  max_age = "max_ma",
  trait = NULL,
  path = getwd(),
  replicates = 1,
  cutoff = NULL,
  random = TRUE
)

Arguments

`x`	data.frame. Containing fossil records with taxon names, ages, and geographic coordinates.
`status`	a vector of character strings of length `nrow(x)`. Indicating for each record “extinct” or “extant”.
`fname`	a character string. The prefix to use for the output files.
`taxon`	character string. The column with the taxon name. Default = “accepted_name”.
`min_age`	character string. The column with the minimum age. Default = “min_ma”.
`max_age`	character string. The column with the maximum age. Default = “max_ma”.
`trait`	a numeric vector of length `nrow(x)`. Indicating trait values for each record. Optional. Default = NULL.
`path`	a character string. giving the absolute path to write the output files. Default is the working directory.
`replicates`	a numerical. The number of replicates for the randomized age generation. See details. Default = 1.
`cutoff`	a numerical. Specify a threshold to exclude fossil occurrences with a high temporal uncertainty, i.e. with a wide temporal range between min_age and max_age. Examples: cutoff=NULL (default; all occurrences are kept in the data set) cutoff=5 (all occurrences with a temporal range of 5 Myr or higher are excluded from the data set)
`random`	logical. Specify whether to take a random age (between MinT and MaxT) for each occurrence or the midpoint age. Note that this option defaults to TRUE if several replicates are generated (i.e. replicates > 1). Examples: random = TRUE (default) random = FALSE (use midpoint ages)

Details

The replicate option allows the user to generate several replicates of the data set in a single input file, each time re-drawing the ages of the occurrences at random from uniform distributions with boundaries MinT and MaxT. The replicates can be analysed in different runs (see PyRate command -j) and combining the results of these replicates is a way to account for the uncertainty of the true ages of the fossil occurrences. Examples: replicates=1 (default, generates 1 data set), replicates=10 (generates 10 random replicates of the data set).

Value

PyRate input files in the working directory.

Note

See https://github.com/dsilvestro/PyRate/wiki for more details and tutorials on PyRate and PyRate input.

Examples


minages <- runif(250, 0, 65)
exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE),
                    lng = runif(250, min = 42, max = 51),
                    lat = runif(250, min = -26, max = -11),
                    min_ma = minages,
                    max_ma = minages + runif(250, 0.1, 65))

#a vector with the status for each record, 
#make sure species are only classified as either extinct or extant, 
#otherwise the function will drop an error

status <- sample(c("extinct", "extant"), size = nrow(exmpl), replace = TRUE)

#or from a list of species
status <- sample(c("extinct", "extant"), size = length(letters), replace = TRUE)
names(status) <- letters
status <- status[exmpl$accepted_name]

## Not run: 
write_pyrate(x = exmpl,fname = "test", status = status)

## End(Not run)

minages <- runif(250, 0, 65)
exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE),
                    lng = runif(250, min = 42, max = 51),
                    lat = runif(250, min = -26, max = -11),
                    min_ma = minages,
                    max_ma = minages + runif(250, 0.1, 65))

#a vector with the status for each record, 
#make sure species are only classified as either extinct or extant, 
#otherwise the function will drop an error

status <- sample(c("extinct", "extant"), size = nrow(exmpl), replace = TRUE)

#or from a list of species
status <- sample(c("extinct", "extant"), size = length(letters), replace = TRUE)
names(status) <- letters
status <- status[exmpl$accepted_name]

## Not run: 
write_pyrate(x = exmpl,fname = "test", status = status)

## End(Not run)

Package 'CoordinateCleaner'

Help Index

Artificial Hotspot Occurrence Inventory

Description

Source

References

Examples

Global Coastlines buffered by 1 degree

Description

Source

Examples

Global Coastlines buffered by -1 degree

Description

Source

Examples

Identify Coordinates in Artificial Hotspot Occurrence Inventory

Description

Usage

Arguments

Value

Note

References

See Also

Examples

Identify Coordinates in Vicinity of Country Capitals.

Description

Usage

Arguments

Value

Note

See Also

Examples

Identify Coordinates in Vicinity of Country and Province Centroids

Description

Usage

Arguments

Value

Note

See Also

Examples

Identify Coordinates Outside their Reported Country

Description

Usage

Arguments

Value

Note

See Also

Examples

Identify Duplicated Records

Description

Usage

Arguments

Value

See Also

Examples

Identify Records with Identical lat/lon

Description

Usage

Arguments

Value

See Also

Examples

Identify Records Assigned to GBIF Headquarters

Description

Usage

Arguments

Details

Value

See Also

Examples

Identify Records in the Vicinity of Biodiversity Institutions

Description

Usage

Arguments

Details

Value

See Also

Examples

Identify Records Outside Natural Ranges

Description