Title: | A High-Performance Local Taxonomic Database Interface |
---|---|
Description: | Creates a local database of many commonly used taxonomic authorities and provides functions that can quickly query this data. |
Authors: | Carl Boettiger [aut, cre] , Kari Norman [aut] , Jorrit Poelen [aut] , Scott Chamberlain [aut] , Noam Ross [ctb] , Mattia Ghilardi [ctb] |
Maintainer: | Carl Boettiger <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1.99 |
Built: | 2024-10-28 06:07:14 UTC |
Source: | https://github.com/ropensci/taxadb |
A utility to sanitize taxonomic names to increase probability of resolving names.
clean_names( names, fix_delim = TRUE, binomial_only = TRUE, remove_sp = TRUE, ascii_only = TRUE, lowercase = TRUE, remove_punc = FALSE )
clean_names( names, fix_delim = TRUE, binomial_only = TRUE, remove_sp = TRUE, ascii_only = TRUE, lowercase = TRUE, remove_punc = FALSE )
names |
a character vector of taxonomic names (usually species names) |
fix_delim |
Should we replace separators |
binomial_only |
Attempt to prune name to a binomial name, e.g.
Genus and species (specific epithet), e.g. |
remove_sp |
Should we drop unspecified species epithet designations?
e.g. |
ascii_only |
should we coerce strings to ascii characters?
(see |
lowercase |
should names be coerced to lower-case to provide case-insensitive matching? |
remove_punc |
replace all punctuation but apostrophes with a space, remove apostrophes |
Current implementation is limited to handling a few
common cases. Additional extensions may be added later.
A goal of the clean_names
function is that any
modification rule of the name strings be precise, atomic, and
toggle-able, rather than relying on clever but more opaque rules and
arbitrary scores. This utility should always be used with care, as
indiscriminate modification of names may result in successful but inaccurate
name matching. A good pattern is to only apply this function to the subset
of names that cannot be directly matched.
clean_names(c("Homo sapiens sapiens", "Homo.sapiens", "Homo sp."))
clean_names(c("Homo sapiens sapiens", "Homo.sapiens", "Homo sp."))
common name starts with
common_contains( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
common_contains( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
name |
vector of names (scientific or common, see |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
db |
a connection to the taxadb database. See details. |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
common_contains("monkey")
common_contains("monkey")
common name starts with
common_starts_with( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
common_starts_with( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
name |
vector of names (scientific or common, see |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
db |
a connection to the taxadb database. See details. |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
common_starts_with("monkey")
common_starts_with("monkey")
by
, and values given
by the vector x
, and then uses this table to do a filtering join,
joining on the by
column to return all rows matching the x
values
(scientificNames, taxonIDs, etc).Creates a data frame with column name given by by
, and values given
by the vector x
, and then uses this table to do a filtering join,
joining on the by
column to return all rows matching the x
values
(scientificNames, taxonIDs, etc).
filter_by( x, by, provider = getOption("taxadb_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), collect = TRUE, db = td_connect(), ignore_case = FALSE )
filter_by( x, by, provider = getOption("taxadb_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), collect = TRUE, db = td_connect(), ignore_case = FALSE )
x |
a vector of values to filter on |
by |
a column name in the taxa_tbl (following Darwin Core Schema terms). The filtering join is executed with this column as the joining variable. |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
schema |
One of "dwc" (for Darwin Core data) or "common" (for the Common names table.) |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
collect |
logical, default |
db |
a connection to the taxadb database. See details. |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
a data.frame in the Darwin Core tabular format containing the matching taxonomic entities.
Other filter_by:
filter_common()
,
filter_id()
,
filter_name()
,
filter_rank()
sp <- c("Trochalopteron henrici gucenense", "Trochalopteron elliotii") filter_by(sp, "scientificName") filter_by(c("ITIS:1077358", "ITIS:175089"), "taxonID") filter_by("Aves", "class")
sp <- c("Trochalopteron henrici gucenense", "Trochalopteron elliotii") filter_by(sp, "scientificName") filter_by(c("ITIS:1077358", "ITIS:175089"), "taxonID") filter_by("Aves", "class")
Look up taxonomic information by common name
filter_common( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), collect = TRUE, ignore_case = TRUE, db = td_connect() )
filter_common( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), collect = TRUE, ignore_case = TRUE, db = td_connect() )
name |
a character vector of common (vernacular English) names, e.g. "Humans" |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
collect |
logical, default |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
db |
a connection to the taxadb database. See details. |
a data.frame in the Darwin Core tabular format containing the matching taxonomic entities.
Other filter_by:
filter_by()
,
filter_id()
,
filter_name()
,
filter_rank()
filter_common("Pied Tamarin")
filter_common("Pied Tamarin")
Return a taxonomic table matching the requested ids
filter_id( id, provider = getOption("taxadb_default_provider", "itis"), type = c("taxonID", "acceptedNameUsageID"), version = latest_version(), collect = TRUE, db = td_connect() )
filter_id( id, provider = getOption("taxadb_default_provider", "itis"), type = c("taxonID", "acceptedNameUsageID"), version = latest_version(), collect = TRUE, db = td_connect() )
id |
taxonomic id, in prefix format |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
type |
id type. Can be |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
collect |
logical, default |
db |
a connection to the taxadb database. See details. |
Use type="acceptedNameUsageID"
to return all rows
for which this ID is the accepted ID, including both synonyms and
and accepted names (since both all synonyms of a name share the
same acceptedNameUsageID
.) Use taxonID
(default) to only return
those rows for which the Scientific name corresponds to the taxonID.
Some providers (e.g. ITIS) assign taxonIDs to synonyms, most others
only assign IDs to accepted names. In the latter case, this means
requesting taxonID
will only match accepted names, while requesting
matches to the acceptedNameUsageID
will also return any known synonyms.
See examples.
a data.frame with id and name of all matching species
Other filter_by:
filter_by()
,
filter_common()
,
filter_name()
,
filter_rank()
filter_id(c("ITIS:1077358", "ITIS:175089")) filter_id("ITIS:1077358", type="acceptedNameUsageID")
filter_id(c("ITIS:1077358", "ITIS:175089")) filter_id("ITIS:1077358", type="acceptedNameUsageID")
Look up taxonomic information by scientific name
filter_name( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), collect = TRUE, ignore_case = FALSE, db = td_connect() )
filter_name( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), collect = TRUE, ignore_case = FALSE, db = td_connect() )
name |
a character vector of scientific names, e.g. "Homo sapiens" |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
collect |
logical, default |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
db |
a connection to the taxadb database. See details. |
Most but not all authorities can match against both species level and
higher-level (or lower, e.g. subspecies or variety) taxonomic names.
The rank level is indicated by taxonRank
column.
Most authorities include both known synonyms and accepted names in the
scientificName
column, (with the status indicated by taxonomicStatus
).
This is convenient, as users will typically not know if the names they
have are synonyms or accepted names, but will want to get the match to the
accepted name and accepted ID in either case.
a data.frame in the Darwin Core tabular format containing the matching taxonomic entities.
Other filter_by:
filter_by()
,
filter_common()
,
filter_id()
,
filter_rank()
sp <- c("Trochalopteron henrici gucenense", "Trochalopteron elliotii") filter_name(sp)
sp <- c("Trochalopteron henrici gucenense", "Trochalopteron elliotii") filter_name(sp)
Get all members (descendants) of a given rank level
filter_rank( name, rank, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), collect = TRUE, ignore_case = TRUE, db = td_connect() )
filter_rank( name, rank, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), collect = TRUE, ignore_case = TRUE, db = td_connect() )
name |
taxonomic scientific name (e.g. "Aves") |
rank |
taxonomic rank name. (e.g. "class") |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
collect |
logical, default |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
db |
a connection to the taxadb database. See details. |
a data.frame in the Darwin Core tabular format containing the matching taxonomic entities.
Other filter_by:
filter_by()
,
filter_common()
,
filter_id()
,
filter_name()
filter_rank("Aves", "class")
filter_rank("Aves", "class")
Match names that start or contain a specified text string
fuzzy_filter( name, by = c("scientificName", "vernacularName"), provider = getOption("taxadb_default_provider", "itis"), match = c("contains", "starts_with"), version = latest_version(), db = td_connect(), ignore_case = TRUE, collect = TRUE )
fuzzy_filter( name, by = c("scientificName", "vernacularName"), provider = getOption("taxadb_default_provider", "itis"), match = c("contains", "starts_with"), version = latest_version(), db = td_connect(), ignore_case = TRUE, collect = TRUE )
name |
vector of names (scientific or common, see |
by |
a column name in the taxa_tbl (following Darwin Core Schema terms). The filtering join is executed with this column as the joining variable. |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
match |
should we match by names starting with the term or containing the term anywhere in the name? |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
db |
a connection to the taxadb database. See details. |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
collect |
logical, default |
Note that fuzzy filter will be fast with an single or small number
of names, but will be slower if given a very large vector of
names to match, as unlike other filter_
commands,
fuzzy matching requires separate SQL calls for each name.
As fuzzy matches should all be confirmed manually in any event, e.g.
not every common name containing "monkey" belongs to a primate species.
This method utilizes the database operation %like%
to filter tables without
loading into memory. Note that this does not support the use of regular
expressions at this time.
## match any common name containing: name <- c("woodpecker", "monkey") fuzzy_filter(name, "vernacularName") ## match scientific name fuzzy_filter("Chera", "scientificName", match = "starts_with")
## match any common name containing: name <- c("woodpecker", "monkey") fuzzy_filter(name, "vernacularName") ## match scientific name fuzzy_filter("Chera", "scientificName", match = "starts_with")
A drop-in replacement for [taxize::get_ids()]
get_ids( names, provider = getOption("taxadb_default_provider", "itis"), format = c("prefix", "bare", "uri"), version = latest_version(), taxadb_db = td_connect(), ignore_case = FALSE, warn = TRUE, db = NULL, ... )
get_ids( names, provider = getOption("taxadb_default_provider", "itis"), format = c("prefix", "bare", "uri"), version = latest_version(), taxadb_db = td_connect(), ignore_case = FALSE, warn = TRUE, db = NULL, ... )
names |
a list of scientific names (which may include higher-order ranks in most authorities). |
provider |
abbreviation code for the provider. See details. |
format |
Format for the returned identifier, one of
|
version |
Which version of the taxadb provider database should we use?
defaults to latest. see |
taxadb_db |
Connection to from |
ignore_case |
should we ignore case (capitalization) in matching names?
default is |
warn |
should we display warnings on NAs resulting from multiply-resolved matches?
(Unlike unmatched names, these NAs can usually be resolved manually via |
db |
previous name for |
... |
additional arguments (currently ignored) |
Note that some taxize authorities: nbn
, tropicos
, and eol
,
are not recognized by taxadb and will throw an error here. Meanwhile,
taxadb recognizes several authorities not known to [taxize::get_ids()]
.
Both include itis
, ncbi
, col
, and gbif
.
Like all taxadb functions, this function will run
fastest if a local copy of the provider is installed in advance
using [td_create()]
.
a vector of IDs, of the same length as the input names Any
unmatched names or multiply-matched names will return as NAs.
To resolve multi-matched names, use [filter_name()]
instead to return
a table with a separate row for each separate match of the input name.
filter_name
Other get:
get_names()
get_ids("Midas bicolor") get_ids(c("Midas bicolor", "Homo sapiens"), format = "prefix") get_ids("Midas bicolor", format = "uri")
get_ids("Midas bicolor") get_ids(c("Midas bicolor", "Homo sapiens"), format = "prefix") get_ids("Midas bicolor", format = "uri")
Translate identifiers into scientific names
get_names( id, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), format = c("guess", "prefix", "bare", "uri"), taxadb_db = td_connect(), db = NULL )
get_names( id, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), format = c("guess", "prefix", "bare", "uri"), taxadb_db = td_connect(), db = NULL )
id |
a list of taxonomic identifiers. |
provider |
abbreviation code for the provider. See details. |
version |
Which version of the taxadb provider database should we use?
defaults to latest. see |
format |
Format for the returned identifier, one of
|
taxadb_db |
Connection to from |
db |
previous name for |
Like all taxadb functions, this function will run
fastest if a local copy of the provider is installed in advance
using [td_create()]
.
a vector of names, of the same length as the input ids. Any unmatched IDs will return as NAs.
Other get:
get_ids()
get_names(c("ITIS:1025094", "ITIS:1025103"), format = "prefix")
get_names(c("ITIS:1025094", "ITIS:1025103"), format = "prefix")
return all taxa in which scientific name contains the text provided
name_contains( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
name_contains( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
name |
vector of names (scientific or common, see |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
db |
a connection to the taxadb database. See details. |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
name_contains("Chera")
name_contains("Chera")
scientific name starts with
name_starts_with( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
name_starts_with( name, provider = getOption("taxadb_default_provider", "itis"), version = latest_version(), db = td_connect(), ignore_case = TRUE )
name |
vector of names (scientific or common, see |
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
db |
a connection to the taxadb database. See details. |
ignore_case |
should we ignore case (capitalization) in matching names? Can be significantly slower to run. |
name_starts_with("Chera")
name_starts_with("Chera")
Return a reference to a given table in the taxadb database
taxa_tbl( provider = getOption("taxadb_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), db = td_connect() )
taxa_tbl( provider = getOption("taxadb_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), db = td_connect() )
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
schema |
One of "dwc" (for Darwin Core data) or "common" (for the Common names table.) |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
db |
a connection to the taxadb database. See details. |
## default schema is the dwc table taxa_tbl() ## common names table taxa_tbl(schema = "common")
## default schema is the dwc table taxa_tbl() ## common names table taxa_tbl(schema = "common")
Show the taxadb directory
taxadb_dir()
taxadb_dir()
NOTE: after upgrading duckdb
, a user may need to delete any
existing databases created with the previous version. An efficient
way to do so is unlink(taxadb::taxadb_dir(), TRUE)
.
## show the directory taxadb_dir() ## Purge the local db unlink(taxadb::taxadb_dir(), TRUE)
## show the directory taxadb_dir() ## Purge the local db unlink(taxadb::taxadb_dir(), TRUE)
Connect to the taxadb database
td_connect(dbdir = NULL, driver = NULL, read_only = NULL)
td_connect(dbdir = NULL, driver = NULL, read_only = NULL)
dbdir |
Path to the database. no longer needed |
driver |
deprecated, ignored. driver will always be duckdb. |
read_only |
deprecated, driver is always read-only. |
This function provides a default database connection for
taxadb
. Note that you can use taxadb
with any DBI-compatible database
connection by passing the connection object directly to taxadb
functions using the db
argument. td_connect()
exists only to provide
reasonable automatic defaults based on what is available on your system.
For performance reasons, this function will also cache and restore the
existing database connection, making repeated calls to td_connect()
much
faster and more failsafe than repeated calls to DBI::dbConnect
Returns a DBI connection
to the default duckdb database
## OPTIONAL: you can first set an alternative home location, ## such as a temporary directory: Sys.setenv(TAXADB_HOME=file.path(tempdir(), "taxadb")) ## Connect to the database: db <- td_connect()
## OPTIONAL: you can first set an alternative home location, ## such as a temporary directory: Sys.setenv(TAXADB_HOME=file.path(tempdir(), "taxadb")) ## Connect to the database: db <- td_connect()
create a local taxonomic database
td_create( provider = getOption("taxadb_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), overwrite = NULL, lines = NULL, dbdir = NULL, db = td_connect() )
td_create( provider = getOption("taxadb_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), overwrite = NULL, lines = NULL, dbdir = NULL, db = td_connect() )
provider |
a list (character vector) of provider(s) to be included in the
database. By default, will install |
schema |
One of "dwc" (for Darwin Core data) or "common" (for the Common names table.) |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
overwrite |
Should we overwrite existing tables? Default is |
lines |
number of lines that can be safely read in to memory at once. Leave at default or increase for faster importing if you have plenty of spare RAM. |
dbdir |
a location on your computer where the database
should be installed. Defaults to user data directory given by
|
db |
connection to a database. By default, taxadb will set up its own fast database connection. |
Authorities currently recognized by taxadb are:
itis
: Integrated Taxonomic Information System, https://www.itis.gov
ncbi
: National Center for Biotechnology Information,
https://www.ncbi.nlm.nih.gov/taxonomy
col
: Catalogue of Life, http://www.catalogueoflife.org/
gbif
: Global Biodiversity Information Facility, https://www.gbif.org/
ott
: OpenTree Taxonomy:
https://github.com/OpenTreeOfLife/reference-taxonomy
iucn
: IUCN Red List, https://iucnredlist.org
itis_test
: a small subset of ITIS, cached locally with the package for testing purposes only
path where database has been installed (invisibly)
## Install the ITIS database td_create() ## force re-install: td_create( overwrite = TRUE)
## Install the ITIS database td_create() ## force re-install: td_create( overwrite = TRUE)
Disconnect from the taxadb database.
td_disconnect(db = td_connect())
td_disconnect(db = td_connect())
db |
database connection |
This function manually closes a connection to the taxadb
database.
## Disconnect from the database: td_disconnect()
## Disconnect from the database: td_disconnect()
Downloads the requested taxonomic data tables and return a local path
to the data in tsv.gz
format. Downloads are cached and identified by
content hash so that tl_import
will not attempt to download the
same file multiple times.
tl_import( provider = getOption("tl_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), prov = prov_cache() )
tl_import( provider = getOption("tl_default_provider", "itis"), schema = c("dwc", "common"), version = latest_version(), prov = prov_cache() )
provider |
from which provider should the hierarchy be returned?
Default is 'itis', which can also be configured using |
schema |
One of "dwc" (for Darwin Core data) or "common" (for the Common names table.) |
version |
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details. |
prov |
Address (URL) to provenance record |
tl_import
parses a schema.org record to determine the correct version
to download. If offline, tl_import
will attempt to resolve against
it's own provenance cache. Users can also examine / parse the prov
JSON-LD file directly to determine the provenance of the data products
used.
path(s) to the downloaded files in the cache