--- title: "Backends for taxadb" author: "Carl Boettiger, Kari Norman" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{intro} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `taxadb` is designed to work with a variety of different "backends" -- software that works under the hood to store and retrieve the requested data. `taxadb` has an intelligent default method selector which will attempt to use the best method available on your system, which means you can use `taxadb` without having to worry about these details. However, to improve performance of `taxadb`, becoming familiar with these backends can yield significant improvements in performance. # RSQLite `RSQLite` is the default database backend if no suggested backend is detected. `RSQLite` has no external software dependencies and will be automatically installed with `taxadb` (it is a hard dependency as an imported rather than suggested package). The term `Lite` indicates that SQLite does not require the separate "server" and "client" software model found on traditional databases such as MySQL, and SQLite is widely used in consumer software everywhere. RSQLite packages SQLite for R. It enables persistent local storage for R applications but will be slower than the alternatives. For certain operations it can be significantly slower. # MonetDBLite & duckdb `MonetDBLite` is a modern alternative to `RSQLite`. `MonetDBLite` is both more powerful than SQLite (in supporting a greater array of operations), and can run much faster. Filtering joins in particular can be much faster even than the in-memory operations of `dplyr`. Because filtering joins lie at the heart of many `taxadb` functions this can yield substantial improvements in performance. Unfortunately, the R interface, `MonetDBLite` was removed from CRAN in April 2019. The package can still be installed from GitHub by running `devtools::install_github("hannesmuehleisen/MonetDBLite-R")`, though this requires the appropriate compilers. The developer plans to replace MonetDBLite with `duckdb`, (see ), but this is not yet feature complete and thus not yet fully compatible for `taxadb` use. Because installation is more difficult, `MonetDBLite` is not a required dependency, but will be used by default if `taxadb` detects an existing installation. `duckdb` support will be switched on as the first priority in the method waterfall. # in-memory `taxadb` can also be set to use in-memory only, without a backend. (Note that this is distinct from using `RSQlite` or `MonetDBLite` with over `in-memory` mode, because it uses only native R `data.frame`s to store data). This will tend to be faster that `RSQLite` but slower than `MonetDBLite` or `duckdb`. In this mode, data will persist over a single session but not between sessions (since memory is cleared when the user quits out of R). Note that many taxonomic tables are quite large when uncompressed, and users with less than 8-16GB of free RAM may find their machine becomes slow or unresponsive in this mode. # Manual control of the backend engine Users can override the automatic preferences of `taxadb` by setting the environmental variable `TAXADB_DRIVER`. For example, running `Sys.setenv(TAXADB_DRIVER="RSQLite")` will make `RSQLite` the default driver, even if `MonetDBLite` is installed. # Local storage The first time `taxadb` accesses a data source, it will download and store the full dataset from that provider. Users can trigger a download ahead of time by running `td_create()`, e.g. `td_create("fb")` will create a local copy of the FishBase taxonomy. If a user does not call `td_create()` first, `taxadb` simply downloads the data the first time that provider is queried -- e.g. `filter_name("Homo sapiens", "gibf")` will first download and install GBIF if that has not been done already. These download and install operations may be slow depending on your internet connection, but need be performed only once. Downloaded data is stored on your local harddisk and will persist between R sessions. The default location depends on the default set by your operating system (see the `rappdirs` package). Users can configure this location by setting the environmental variable `TAXADB_HOME`. For example, all unit tests in the package use temporary storage by setting `Sys.setenv(TAXADB_HOME=tempdir())`, which is cleared out after the R session ends. A user can install all available name providers up front with `td_create("all")`. An overview of the available scientific name providers is found in the providers vignette. # Other backends `taxadb` will work just as well with any `DBI`-compatible database backend (Postgres, MariaDB, etc). All `taxadb` functions take an argument `taxadb_db`, which is just a `DBI` connection used by `dplyr`. For example, we can create an in-memory RSQLite connection and use that to store data for a single session: ```r con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") taxadb::get_ids("Homo sapiens", taxadb_db = con) ``` Users can also call the `td_connect()` function to connect to `taxadb`'s default databases. Running `td_connect()` with no arguments will return the current default connection. This is a convenient way to confirm that your system is using the database engine you intended it to use. You can also use that connection to interact directly with the `taxadb` databases (e.g. using `dplyr` or `DBI` functions).