--- title: "Validating DwC taxon data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Validating DwC taxon data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Increase width for printing tibbles old <- options(width = 220) ``` dwctaxon has two major purposes, (1) editing and (2) validation of taxonomic data in [Darwin Core (DwC)](https://dwc.tdwg.org/terms/#taxon) format. This vignette is about the latter. ## Setup Start by loading packages and setting the random number generator seed since this vignette involves some random samples. ```{r library, message = FALSE} library(dwctaxon) library(dplyr) set.seed(12345) ``` ## The data As [before](https://docs.ropensci.org/dwctaxon/articles/editing.html#the-data), we will use the example dataset that comes with dwctaxon, `dct_filmies`: ```{r filmy-data} dct_filmies ``` However, `dct_filmies` already is well-formatted and would pass all validation checks! So lets introduce some noise to make things more interesting. ```{r filmy-data-mess} filmies_dirty <- dct_filmies |> # Change taxonomic status of one row to 'good' dct_modify_row(taxonID = "54115096", taxonomicStatus = "good") |> # Duplicate some rows at the end bind_rows(tail(dct_filmies)) |> # Insert bad values for `acceptedNameUsageID` of 5 random rows rows_update( tibble( taxonID = sample(dct_filmies$taxonID, 5), acceptedNameUsageID = sample(letters, 5) ), by = "taxonID" ) filmies_dirty ``` The first few rows may look the same, but we know that these data now have some problems. ## Error on failure `dct_validate()` is the workhorse function for validating DwC data. In default mode, `dct_validate()` will issue an error the first time it finds something wrong with the data (in other words, on the first check that fails): ```{r validate-error, error = TRUE} dct_validate(filmies_dirty) ``` ```{r get-dups, echo = FALSE, warning = FALSE} dup_taxid <- dct_validate(filmies_dirty, on_fail = "summary") |> filter(stringr::str_detect(error, "taxonID .* duplicated value")) |> pull(taxonID) |> knitr::combine_words() ``` dwctaxon tries to provide useful error messages that help you determine what in the data is causing the problem. Here, we see that rows with `taxonID` `r dup_taxid` are duplicated. Here of course we know that's because we duplicated them on purpose; in a real dataset, you could use this information to search out the duplicated values and fix them. ## Summary on failure If you are troubleshooting a DwC taxon dataset, it may be more useful to know about all of the problems at once instead of fixing them one at a time. In that case, set the `on_fail` argument to `"summary"` (`on_fail` can be either its default value `"error"` or `"summary"`): ```{r validate-error-summary, error = TRUE} dct_validate(filmies_dirty, on_fail = "summary") ``` (You may need to scroll to the right in the output below to see all the text). In this case, `dct_validate()` still issues a warning to let us know validation did not pass. The `error` and `check` columns describe what went wrong; the other columns tell us where in the data to find the errors. With this detailed summary, we should definitely be able to hunt down the bugs in this dataset! ## Checks You may be wondering, why the separate "error" and "check" columns in the summary output? That is because `dct_validate()` conducts many smaller checks, each of which can be turned on or off. For a complete description, run `?dct_validate()`. In turn, the checks can each identify different particular problems; the most granular description is given in the "error" column. Furthermore, each of the checks run by `dct_validate()` can also be run as an individual function. For example, let's just check that all values of `acceptedUsageID` have a corresponding `taxonID` (in other words, that all synonyms map properly): ```{r check-tax-id, error = TRUE} filmies_dirty |> dct_check_mapping() ``` It is important to note that not all checks are compatible with each other. For example, `check_sci_name` checks that all scientific names (DwC term `scientificName`) are non-missing and unique; `check_status_diff` checks that in cases of *identical* scientific names, the taxonomic status of each name is different. The default settings for `dct_validate()` are to use the former but not the latter. Whether you expect all scientific names to be unique or not depends on how you set up your data^[According to the rules of taxonomic nomenclature, of course each full scientific name *should* be unique, but there [have been errors in the past](https://www.iapt-taxon.org/nomen/pages/main/art_31.html?zoom_highlight=identical) where the same author published the same name more than once!]. ## Controlled vocabularies Some DwC taxon terms are expected only to take a small number values from a controlled vocabulary. For example, `taxonStatus` (taxonomic status of a scientific name) may only be expected to include the values "accepted", "synonym", etc. This is unlike, e.g., `scientificName`, where we would not try to control the range of possible values. However, although DwC recommends using a controlled vocabulary for such terms, it does not specify the actual values! So dwctaxon lets you set those yourself (and tries to employ reasonable defaults), as shown in the [next section](#changing-the-defaults). ## Changing the defaults Say you want to use a different set of allowed values for `taxonStatus`. Here, let's include "good" so that the data will pass the check for taxonomic status (remember [we modified the data](#the-data) so the `taxonomicStatus` of one of the rows was `"good"`). One way would be to use the `valid_tax_status` argument of `dct_validate()` or `dct_check_tax_status()`: ```{r set-tax-status-manual} filmies_dirty |> dct_check_tax_status( valid_tax_status = "good, accepted, synonym", on_success = "logical" # Issue "TRUE" if the check passes ) ``` But specifying this argument every time you want to check something gets tedious. So we can change the default setting for `valid_tax_status` with `dct_options()` like so: ```{r set-tax-status-default} # First save the current settings before making any changes old_settings <- dct_options() # Change valid_tax_status setting dct_options(valid_tax_status = "good, accepted, synonym") ``` Now we can run `dct_check_tax_status()` and it will use the new default value: ```{r set-tax-status-manual-2} filmies_dirty |> dct_check_tax_status(on_success = "logical") ``` You can change back to the original default values with `reset = TRUE`: ```{r reset-defaults} dct_options(reset = TRUE) ``` Now running the same code as above throws an error: ```{r set-tax-status-manual-3, error = TRUE} filmies_dirty |> dct_check_tax_status(on_success = "logical") ``` There are a large number of settings that can be modified. See `?dct_options()` for a description of each. You can view the current status of all options (default values) by running `dct_options()` with no arguments: ```{r dct-options-show} dct_options() ``` Or check the value of one particular setting by passing its name with the `$` operator: ```{r dct-options-show-single} dct_options()$valid_tax_status ``` We can restore the settings as they were before any of these changes were applied by running `do.call()` on the settings we saved above: ```{r dct-options-restore} do.call(dct_options, old_settings) ``` ```{r, include = FALSE} # Reset options options(old) ```