dwctaxon has two major purposes, (1) editing and (2) validation of taxonomic data in Darwin Core (DwC) format. This vignette is about the former. Although you could use dwctaxon to build a taxonomic database from scratch, it is more likely you will be using it to modify an existing database, so we will focus on that kind of use-case.
We start by loading packages needed for this vignette:
dwctaxon comes with an example dataset dct_filmies
,
taxonomic data of filmy ferns (family
Hymenophyllaceae). Let’s take a quick look at the data (you may need
to scroll to the right of the frame with the code to see all the
text):
dct_filmies
#> # A tibble: 2,451 × 5
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName
#> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl
#> 2 54133783 54115097 synonym species Trichomanes crassum Copel.
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel.
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel.
#> 6 54133786 54115100 synonym species Cephalomanes curvatum (J. Sm.) V. D. Bosch
#> 7 54133787 54115100 synonym species Cephalomanes javanica (Bl.) V. D. Bosch
#> 8 54133788 54115100 synonym species Cephalomanes oblongifolium Presl
#> 9 54133789 54115100 synonym species Cephalomanes zollingeri V. D. Bosch
#> 10 54133790 54115100 synonym species Lacostea javanica (Bl.) Prantl
#> # ℹ 2,441 more rows
For demonstration purposes, we will just use the first five rows:
Although DwC taxon format includes a large number of terms
(columns)1, a typical database does not use all of
them. dct_filmies
only includes 5 columns. Their usage
should be clear to most biologists, but two columns need more
explanation. taxonID
is a unique ID for each row (name),
and acceptedNameUsageID
is only provided for synonyms; it
indicates the taxonID
of the accepted name. For more
information on DwC taxon format, see
vignette("what-is-dwc")
.
The rest of the vignette will consist of modifying this dataset.
dct_add_row()
is used to add rows. The simplest way to
do this is by specifying the new values as vectors (vectors of length 1
are recycled):
filmies_small |>
dct_add_row(
scientificName = c("Homo sapiens", "Drosophila melanogaster"),
taxonomicStatus = "accepted",
taxonRank = "species"
)
#> # A tibble: 7 × 6
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName modified
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA>
#> 2 54133783 54115097 synonym species Trichomanes crassum Copel. <NA>
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA>
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel. <NA>
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel. <NA>
#> 6 be111923a2780f4931ca7fa4ed09a1b9 <NA> accepted species Homo sapiens 2024-10-17 05:46…
#> 7 5942a5b04c33f577946e4ccfbe490800 <NA> accepted species Drosophila melanogaster 2024-10-17 05:46…
Notice that although we did not specify taxonID
or
modified
, these columns are automatically filled by
default2;
they can be turned off by setting the fill_taxon_id
and
stamp_modified
arguments to FALSE
.
The names of the new values should be valid DwC terms. You can see
the terms available with dct_terms
:
dct_terms
#> # A tibble: 47 × 2
#> group term
#> * <chr> <chr>
#> 1 taxon taxonID
#> 2 taxon scientificNameID
#> 3 taxon acceptedNameUsageID
#> 4 taxon parentNameUsageID
#> 5 taxon originalNameUsageID
#> 6 taxon nameAccordingToID
#> 7 taxon namePublishedInID
#> 8 taxon taxonConceptID
#> 9 taxon scientificName
#> 10 taxon acceptedNameUsage
#> # ℹ 37 more rows
Adding rows with vectors as shown above works well if you only need to add a small number of rows. However, this could get unwieldy if you have a large number to add. In this case, you can instead add them via a dataframe.
The dataframe should have column names matching valid DwC taxon terms:
# Let's add some rows from the original dct_filmies
to_add <- tail(dct_filmies)
filmies_small |>
dct_add_row(new_dat = to_add)
#> # A tibble: 11 × 6
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName modified
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA>
#> 2 54133783 54115097 synonym species Trichomanes crassum Copel. <NA>
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA>
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel. <NA>
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel. <NA>
#> 6 54126747 <NA> accepted infraspecies Hymenophyllum myriocarpum var. endiviifolium (Desv.) Stolze 2024-10-17 05:46:5…
#> 7 54135528 54126748 synonym species Hymenophyllum nigrescens Liebm. 2024-10-17 05:46:5…
#> 8 54135530 54126748 synonym species Mecodium nigricans (Presl ex Kl.) Copel. 2024-10-17 05:46:5…
#> 9 54135531 54126748 synonym species Sphaerocionium nigricans Presl ex Kl. 2024-10-17 05:46:5…
#> 10 54126748 <NA> accepted infraspecies Hymenophyllum myriocarpum var. nigrescens (Liebm.) Stolze 2024-10-17 05:46:5…
#> 11 54126749 <NA> accepted infraspecies Hymenophyllum trichophyllum var. buesii C. V. Morton 2024-10-17 05:46:5…
Note that in this case the taxonID
already existed in
the data to add, so it is not generated automatically.
dct_drop_row()
drops one or more rows by
taxonID
or scientificName
.
For example, we can exclude the row for Cephalomanes
atrovirens Presl by either using its scientificName
(Cephalomanes atrovirens Presl
) or its taxonID
(54115096
):
filmies_small |>
dct_drop_row(scientificName = "Cephalomanes atrovirens Presl")
#> # A tibble: 4 × 5
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName
#> <chr> <chr> <chr> <chr> <chr>
#> 1 54133783 54115097 synonym species Trichomanes crassum Copel.
#> 2 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price
#> 3 54133784 54115098 synonym species Trichomanes densinervium Copel.
#> 4 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel.
filmies_small |>
dct_drop_row(taxonID = "54115096")
#> # A tibble: 4 × 5
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName
#> <chr> <chr> <chr> <chr> <chr>
#> 1 54133783 54115097 synonym species Trichomanes crassum Copel.
#> 2 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price
#> 3 54133784 54115098 synonym species Trichomanes densinervium Copel.
#> 4 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel.
Since it looks up values by taxonID
or
scientificName
, dct_drop_row()
requires these
to be unique and non-missing in the taxonomic database.
Of course, since the taxonomic database is a dataframe, you could
also use other subsetting techniques like brackets in base R or
dplyr::filter()
from the tidyverse to delete rows.
dct_modify_row()
changes the values in an existing
row.
Here, it is helpful to reiterate the purpose of the
taxonID
column: it is a unique identifier for each row
(taxonomic name) in the data. So we will use taxonID
to
identify the row to change, then apply new values using other DwC
terms.
# Change the status of Trichomanes crassum Copel. to "accepted"
filmies_small |>
dct_modify_row(
taxonID = "54133783", # taxonID of Trichomanes crassum Copel.
taxonomicStatus = "accepted"
)
#> # A tibble: 5 × 6
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName modified
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA>
#> 2 54133783 <NA> accepted species Trichomanes crassum Copel. 2024-10-17 05:46:55.559106
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA>
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel. <NA>
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel. <NA>
Notice there were some additional automatic changes besides just
taxonomicStatus
. Since the new status is
"accepted"
, dwctaxon automatically set
acceptedNameUsageID
(which indicates the
taxonID
of the accepted name for synonyms) to
NA
. This behavior can be disabled by setting the
clear_usage_id
argument to FALSE
. We see the
modified
field has been updated as well.
However, it can be difficult for humans to keep track of which
taxonID
matches which name; typically, we think in terms of
species names, not ID numbers. For that reason, you can also use
scientificName
instead of taxon_id
to specify
a row to modify3.
# Change the status of Trichomanes crassum Copel. to "accepted"
filmies_small |>
dct_modify_row(
scientificName = "Trichomanes crassum Copel.",
taxonomicStatus = "accepted"
)
#> # A tibble: 5 × 6
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName modified
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA>
#> 2 54133783 <NA> accepted species Trichomanes crassum Copel. 2024-10-17 05:46:55.600739
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA>
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel. <NA>
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel. <NA>
If you provide both taxonID
and
scientificName
, dwctaxon will identify the row with
taxonID
and apply scientificName
as the new
scientific name:
# Change the name of Trichomanes crassum Copel.
filmies_small |>
dct_modify_row(
taxonID = "54133783", # taxonID of Trichomanes crassum Copel.
scientificName = "Bogus name"
)
#> # A tibble: 5 × 6
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName modified
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA>
#> 2 54133783 54115097 synonym species Bogus name 2024-10-17 05:46:55.633918
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA>
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel. <NA>
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel. <NA>
Another convenient automated behavior of dwctaxon is the ability to “re-map” synonyms. That is, if a previously accepted name (say, “A”) is changed to be the synonym of another name (say, “B”), all synonyms of “A” are also changed to be synonyms of “B”. Let’s see how this works with the example data:
# Change C. densinervium to a synonym of C. crassum
filmies_small |>
dct_modify_row(
scientificName = "Cephalomanes densinervium (Copel.) Copel.",
taxonomicStatus = "synonym",
acceptedNameUsage = "Cephalomanes crassum (Copel.) M. G. Price"
)
#> # A tibble: 5 × 6
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName modified
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA>
#> 2 54133783 54115097 synonym species Trichomanes crassum Copel. <NA>
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA>
#> 4 54133784 54115097 synonym species Trichomanes densinervium Copel. 2024-10-17 05:46:55.683449
#> 5 54115098 54115097 synonym species Cephalomanes densinervium (Copel.) Copel. 2024-10-17 05:46:55.667922
Notice that two names were modified even though we only specified one; since Trichomanes densinervium Copel. was a synonym of Cephalomanes densinervium (Copel.) Copel., it also gets re-mapped to the accepted name Cephalomanes crassum (Copel.) M. G. Price
As described in vignette("what-is-dwc")
, there are
several terms in DwC that I call “term - termID” pairs, e.g.,
acceptedNameUsage
and acceptedNameUsageID
,
parentNameUsage
and parentNameUsageID
, etc.
Typically, one is an actual scientific name (e.g., for
acceptedNameUsage
, the accepted name of a synonym), and one
is the taxonID
of that name (e.g., for
acceptedNameUsageID
, the taxonID
of the
accepted name of a synonym). It is up to the manager of the database to
choose whether to use either or both of the terms in the pair.
This sort of data is redundant and could be prone to error if entered
manually, so dwctaxon can do it for us with dct_fill_col()
.
The easiest way to see how this works is with an example (you may need
to scroll to the right to see the new column):
# Fill-in the acceptedNameUsage column with scientific names
filmies_small |>
dct_fill_col(
fill_to = "acceptedNameUsage",
fill_from = "scientificName",
match_to = "taxonID",
match_from = "acceptedNameUsageID"
)
#> # A tibble: 5 × 7
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName acceptedNameUsage modified
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 <NA> accepted species Cephalomanes atrovirens Presl <NA> 2024-10…
#> 2 54133783 54115097 synonym species Trichomanes crassum Copel. Cephalomanes crassum (Copel.) M… 2024-10…
#> 3 54115097 <NA> accepted species Cephalomanes crassum (Copel.) M. G. Price <NA> 2024-10…
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel. Cephalomanes densinervium (Cope… 2024-10…
#> 5 54115098 <NA> accepted species Cephalomanes densinervium (Copel.) Copel. <NA> 2024-10…
The meaning of the arguments fill_to
and
fill_from
I think are fairly clear: we are filling the
acceptedNameUsage
column with values from
scientificName
.
match_to
and match_from
are a bit trickier;
they describe how to find the data for filling. Here, we are
looking up acceptedNameUsage
by matching
acceptedNameUsageID
(match_from
) to
taxonID
(match_to
).
Like I said, it’s easiest to figure out dct_fill_col()
by trying it yourself.
See dct_terms
for a list↩︎
taxonID
is filled with the md5 hash of the
scientific name. By default, the hash is 32 characters long, so
automatically generated values of taxonID
should be unique
if the scientific names are unique. This can be checked by running
dct_validate()
.↩︎
This only works if the scientific name is unique within the dataset↩︎