---
title: "Validating DwC taxon data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Validating DwC taxon data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# Increase width for printing tibbles
old <- options(width = 220)
```

dwctaxon has two major purposes, (1) editing and (2) validation of taxonomic data in [Darwin Core (DwC)](https://dwc.tdwg.org/terms/#taxon) format. This vignette is about the latter.

## Setup

Start by loading packages and setting the random number generator seed since this vignette involves some random samples.

```{r library, message = FALSE}
library(dwctaxon)
library(dplyr)

set.seed(12345)
```

## The data

As [before](https://docs.ropensci.org/dwctaxon/articles/editing.html#the-data), we will use the example dataset that comes with dwctaxon, `dct_filmies`:

```{r filmy-data}
dct_filmies
```

However, `dct_filmies` already is well-formatted and would pass all validation checks! So lets introduce some noise to make things more interesting.

```{r filmy-data-mess}
filmies_dirty <-
  dct_filmies |>
  # Change taxonomic status of one row to 'good'
  dct_modify_row(taxonID = "54115096", taxonomicStatus = "good") |>
  # Duplicate some rows at the end
  bind_rows(tail(dct_filmies)) |>
  # Insert bad values for `acceptedNameUsageID` of 5 random rows
  rows_update(
    tibble(
      taxonID = sample(dct_filmies$taxonID, 5),
      acceptedNameUsageID = sample(letters, 5)
    ),
    by = "taxonID"
  )

filmies_dirty
```

The first few rows may look the same, but we know that these data now have some problems.

## Error on failure

`dct_validate()` is the workhorse function for validating DwC data.

In default mode, `dct_validate()` will issue an error the first time it finds something wrong with the data (in other words, on the first check that fails):

```{r validate-error, error = TRUE}
dct_validate(filmies_dirty)
```

```{r get-dups, echo = FALSE, warning = FALSE}
dup_taxid <- dct_validate(filmies_dirty, on_fail = "summary") |>
  filter(stringr::str_detect(error, "taxonID .* duplicated value")) |>
  pull(taxonID) |>
  knitr::combine_words()
```

dwctaxon tries to provide useful error messages that help you determine what in the data is causing the problem. Here, we see that rows with `taxonID` `r dup_taxid` are duplicated. Here of course we know that's because we duplicated them on purpose; in a real dataset, you could use this information to search out the duplicated values and fix them.

## Summary on failure

If you are troubleshooting a DwC taxon dataset, it may be more useful to know about all of the problems at once instead of fixing them one at a time. In that case, set the `on_fail` argument to `"summary"` (`on_fail` can be either its default value `"error"` or `"summary"`):

```{r validate-error-summary, error = TRUE}
dct_validate(filmies_dirty, on_fail = "summary")
```

(You may need to scroll to the right in the output below to see all the text).

In this case, `dct_validate()` still issues a warning to let us know validation did not pass. The `error` and `check` columns describe what went wrong; the other columns tell us where in the data to find the errors.

With this detailed summary, we should definitely be able to hunt down the bugs in this dataset!

## Checks

You may be wondering, why the separate "error" and "check" columns in the summary output?

That is because `dct_validate()` conducts many smaller checks, each of which can be turned on or off. For a complete description, run `?dct_validate()`. In turn, the checks can each identify different particular problems; the most granular description is given in the "error" column.

Furthermore, each of the checks run by `dct_validate()` can also be run as an individual function. For example, let's just check that all values of `acceptedUsageID` have a corresponding `taxonID` (in other words, that all synonyms map properly):

```{r check-tax-id, error = TRUE}
filmies_dirty |>
  dct_check_mapping()
```

It is important to note that not all checks are compatible with each other. For example, `check_sci_name` checks that all scientific names (DwC term `scientificName`) are non-missing and unique; `check_status_diff` checks that in cases of *identical* scientific names, the taxonomic status of each name is different. The default settings for `dct_validate()` are to use the former but not the latter. Whether you expect all scientific names to be unique or not depends on how you set up your data^[According to the rules of taxonomic nomenclature, of course each full scientific name *should* be unique, but there [have been errors in the past](https://www.iapt-taxon.org/nomen/pages/main/art_31.html?zoom_highlight=identical) where the same author published the same name more than once!].

## Controlled vocabularies

Some DwC taxon terms are expected only to take a small number values from a controlled vocabulary. For example, `taxonStatus` (taxonomic status of a scientific name) may only be expected to include the values "accepted", "synonym", etc. This is unlike, e.g., `scientificName`, where we would not try to control the range of possible values.

However, although DwC recommends using a controlled vocabulary for such terms, it does not specify the actual values! So dwctaxon lets you set those yourself (and tries to employ reasonable defaults), as shown in the 
[next section](#changing-the-defaults).

## Changing the defaults

Say you want to use a different set of allowed values for `taxonStatus`. Here, let's include "good" so that the data will pass the check for taxonomic status (remember [we modified the data](#the-data) so the `taxonomicStatus` of one of the rows was `"good"`).

One way would be to use the `valid_tax_status` argument of `dct_validate()` or `dct_check_tax_status()`:

```{r set-tax-status-manual}
filmies_dirty |>
  dct_check_tax_status(
    valid_tax_status = "good, accepted, synonym",
    on_success = "logical" # Issue "TRUE" if the check passes
  )
```

But specifying this argument every time you want to check something gets tedious.

So we can change the default setting for `valid_tax_status` with `dct_options()` like so:

```{r set-tax-status-default}
# First save the current settings before making any changes
old_settings <- dct_options()

# Change valid_tax_status setting
dct_options(valid_tax_status = "good, accepted, synonym")
```

Now we can run `dct_check_tax_status()` and it will use the new default value:

```{r set-tax-status-manual-2}
filmies_dirty |>
  dct_check_tax_status(on_success = "logical")
```

You can change back to the original default values with `reset = TRUE`:

```{r reset-defaults}
dct_options(reset = TRUE)
```

Now running the same code as above throws an error:

```{r set-tax-status-manual-3, error = TRUE}
filmies_dirty |>
  dct_check_tax_status(on_success = "logical")
```

There are a large number of settings that can be modified. See `?dct_options()` for a description of each.

You can view the current status of all options (default values) by running `dct_options()` with no arguments:

```{r dct-options-show}
dct_options()
```

Or check the value of one particular setting by passing its name with the `$` operator:

```{r dct-options-show-single}
dct_options()$valid_tax_status
```

We can restore the settings as they were before any of these changes were applied by running `do.call()` on the settings we saved above:

```{r dct-options-restore}
do.call(dct_options, old_settings)
```

```{r, include = FALSE}
# Reset options
options(old)
```