--- title: "Getting Started with `dittodb`" author: "Jonathan Keane" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with `dittodb`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} library(testthat) library(dittodb) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) db_mock_paths("dittodb") ``` `dittodb` is designed to make it easy and fun to test functions that interact with a database. It works by looking for mock responses for each query you send while you run your tests and will seamlessly pretend that those mocks were provided by the database connection without needing a connection at all. ```{r setup db, include = FALSE, eval = FALSE} # setup the DB used in the rest of the vignette con <- dbConnect( RMariaDB::MariaDB(), dbname = "nycflights", host = "127.0.0.1", username = "travis", password = "" ) nycflights13_create_sql(con, schema = "nycflights13") # record interactions start_db_capturing() out <- mean_delays("day") out <- mean_delays("month") stop_db_capturing() ``` ## Our function `mean_delays()` To get started, imagine that we are working on a package that queries a database that consists of the [nycflights13 data](nycflights.html)[^1]. We have the following function which takes a column to aggregate by and returns a dataframe with that column and the mean delay for groups based on the values in the column name given. ## {.unlisted .unnumbered .tabset} ### RMariaDB ```{r mean_delays} library(DBI) mean_delays <- function(group_col) { con <- dbConnect( RMariaDB::MariaDB(), dbname = "nycflights" ) on.exit(dbDisconnect(con)) query <- glue::glue( "SELECT {group_col}, AVG(arr_delay) as mean_delay from nycflights13.flights ", "WHERE arr_delay > 0 GROUP BY {group_col}" ) return(dbGetQuery(con, query)) } ``` ### RPostgres ```{r mean_delays_rpostgres, eval = FALSE} library(DBI) mean_delays <- function(group_col) { con <- dbConnect( RPostgres::Postgres(), dbname = "nycflights" ) on.exit(dbDisconnect(con)) query <- glue::glue( "SELECT {group_col}, AVG(arr_delay) as mean_delay from nycflights13.flights ", "WHERE arr_delay > 0 GROUP BY {group_col}" ) return(dbGetQuery(con, query)) } ``` ### RSQLite ```{r mean_delays_rsqlite, eval = FALSE} library(DBI) mean_delays <- function(group_col) { con <- dbConnect( RSQLite::SQLite(), dbname = "nycflights" ) on.exit(dbDisconnect(con)) query <- glue::glue( "SELECT {group_col}, AVG(arr_delay) as mean_delay from nycflights13.flights ", "WHERE arr_delay > 0 GROUP BY {group_col}" ) return(dbGetQuery(con, query)) } ``` ## {.unlisted .unnumbered} If we give it the column `"month"`, we get the following dataframe: ```{r month, eval = FALSE} mean_delays("month") ``` ```{r cooking_show, echo = FALSE} with_mock_db(mean_delays("month")) ``` Great, now that we have our function we want to test it to make sure it is operating as expected. Normally, we could write something like: ```{r tests_1, eval = FALSE} library(testthat) test_that("mean_delays()", { out <- mean_delays("month") expect_named(out, c("month", "mean_delay")) expect_equal(dim(out), c(12, 2)) }) ``` And this works just fine if we only ever run your tests locally, but if we want to [run our tests with a Continuous Integration system](http://r-pkgs.had.co.nz/check.html#check) (and yes, we want to do that!), this won't work without first setting up our production database of flights. For our tests, we don't actually need to connect to the database and get new data (and, in fact, that would make some tests fail erroneously suddenly if the underlying changed). Instead, what we want is to take a snapshot of what happens when running the test code, and then be able to use that snapshot when we run tests later. These snapshots are frequently called fixtures (though you might hear people use other names like stubs or mocks). ## Recording fixtures We can record fixtures of the database interactions with the commands `start_db_capturing()`, run the functions we want to record, and then stop recording with `stop_db_capturing()`. ```{r recording, eval = FALSE} start_db_capturing() out <- mean_delays("month") stop_db_capturing() ``` This will write a new folder (by default in `./tests/testthat/`) with the name of the database (here: `nycflights`) and then write one file with the name `SELECT-e53189.R` which is the fixture for this example. This `SELECT-*` file contains the data that was received from the database for use in tests. ## `with_mock_db()` Now that we have a fixture, we can use that fixture by wrapping our call that includes a database interaction with the function `with_mock_db()`. This will look for fixtures and use those. ```{r with_mock_1} with_mock_db( mean_delays("month") ) ``` So, now we can write our tests like: ```{r tests_2} library(testthat) library(dittodb) with_mock_db( test_that("mean_delays()", { out <- mean_delays("month") expect_named(out, c("month", "mean_delay")) expect_equal(dim(out), c(12, 2)) }) ) ``` When placed inside of `with_mock_db(...)` a call to `mean_delays("month")` will return what we saved as our fixture _as if it had actually connected to the database_ without needing the database to be installed, reachable, operational, or to exist at all anywhere. If we wanted to test that a day-based aggregation works, we can! Although we will have to make a new fixture. First we would run the following interactively: ```{r recording_days, eval = FALSE} start_db_capturing() out <- mean_delays("day") stop_db_capturing() ``` This will create a new file (`SELECT-16d120.R`) which contains the response when aggregating by day. dittodb saves each database interaction with a hash of the query that is sent, so that a number of different responses from a database can be saved and the correct one will be used when called inside of `with_mock_db(...)`. So now, we could write our new test with: ```{r tests_day_2} with_mock_db( test_that("mean_delays()", { out <- mean_delays("day") expect_named(out, c("day", "mean_delay")) expect_equal(dim(out), c(31, 2)) }) ) ``` ## Getting setup to use `dittodb` Use the function `dittodb::use_dittodb()` to easily get started using `dittodb`. It will add `dittodb` to `Suggests` in the `DESCRIPTION` file and add `library(dittodb)` to `tests/testthat/helper.R`. ## Things to be careful about There are a few things to be careful about when using dittodb. ### When to call `dbConnect()` Always call `dbConnect()` inside of `with_mock_db(...)`. You can make as many calls as you want to the mock database inside of a `with_mock_db(...)`, but you should always make sure that you connect to the database inside of and not outside of `with_mock_db(...)`. This is because when you "connect" to the mock database, a few variables are set that tell dittodb where to look for mocks. It's less important (though still a good idea) to call `dbDisconnect()` inside of `with_mock_db()`. This is also true when recording fixtures with `start_db_recording()`, you should start the recording and then call `dbConnect()`. ### Query size Recording fixtures saves the whole query to disk in a relatively inefficient way (from a data storage perspective), so be careful with what you save. And you'll want to not save extremely large results if at all possible. This is also a best-practice for writing tests anyway: you should have mocks that are as minimal as possible to test the functionality you need to. Minimal mocks make it easier to change things that aren't relevant to the test (you don't have to change the way data is represented if it's not important to what you're testing) and it makes your tests run faster. ## Advanced uses There are a number of advanced features that might be useful. However they take a bit of configuration to use. ### Specify a new path You can control where mocks are read from (when you're using `with_mock_db(...)`) as well as where they are written to (when using `start_db_capturing()`). To do this, use the function `db_mock_paths()`. You can see what paths are being used by calling `db_mock_paths()` with no arguments. dittodb will look for mocks in each path starting with the first one. When recording mocks, dittodb always uses the first path that is returned by `db_mock_paths()`. You can add a new path by calling `db_mock_paths("new/path/here")` which will add the path provided to the top of the list of paths to use. ### Redacting Sometimes (much? most? of the time!) there is sensitive data in your database that you don't actually want to put into your test fixtures. `dittodb` allows you to specify columns that should always be redacted by specifying them like so: ``` start_db_capturing(redact_columns = c("sensitive_column", "other_sensitive_column")) ``` This will always redact the columns "sensitive_column" and "other_sensitive_column" every time a query is recorded that includes either. The redactor replaces every value in the column with a standard value (for example "[redacted]" for characters, `9` for numerics, `1988-10-11T17:00:00` for date times) see `redact_columns()` for more information. ### You, too, can write a fixture! When we use `start_db_recording()` to record fixtures, we are creating what some people call fixtures (though other terms for these abound). These are files that are used during testing to represent and provide some data or state necessary to execute the test. In the case of dittodb, these files contain the data that dittodb uses when it pretends to be a live database. During recording, each query that is sent to the database gets a unique identifier (the first 6 digits of the hash of the query) and when the response is received, that response is saved to a file with the first SQL (Structured Query Language) verb (e.g. `SELECT`), a dash, and the hash using the `dput()` function. This lets you craft a fixture that tests exactly what you need to without having extraneous rows or columns that might not be relevant. You can save our own responses for queries by getting figuring out the hash (the easiest way to do this now is to write the test that you want to create a fixture for, run it and see the error message that looks something like "Couldn't find the file nycflights/SELECT-16d120.R in any of the mock directories." and use the filename from there.) and then saving the dataframe that you want the test to use with the command `dput(df, file = "nycflights/SELECT-16d120.R", control = c("all", "hexNumeric"))` (if the dataframe you want to save is `df` and we are using the path we saw in the example error message). And you've created your own fixture! You can also take the approach of recording fixtures and then editing them manually to pare them down. The workflow for that would be something like: ```{r editing, eval = FALSE} # read in the recorded fixture df_fixt <- source("nycflights/SELECT-16d120.R", keep.source = FALSE)$value # filter out anything after february and all days after the 9th of the month df_fixt <- dplyr::filter(df_fixt, month <= 2 & day < 10) # save the fixture for use in tests dput(df_fixt, file = "nycflights/SELECT-16d120.R", control = c("all", "hexNumeric")) ``` [^1]: These code snippets won't work right out of the box — you will need to make sure the database you're connecting to has the [nycflights13 data](nycflights.html) data in it. You could do this with SQLite by running `nycflights13_create_sqlite(location = "nycflights")` before the rest of the commands here.