Provenance & Reproducibility Using the rdataretriever

Datasets that are regularly updated are increasingly common (Yenni et al. 2019). This presents two challenges for reprodubility. First, if the underlying structure of the dataset changes then previously written code for processing the data will often cease to run properly. Second, if the version of the data used in a particularly analysis isn’t archived then if the data changes it will be difficult to reproduce the original analysis.

The retriever and rdataretriever address both of these limitations. The centrally maintained scripts for processing datasets are updated when datasets change structure and so as long as rdataretriever::get_updates() is run before installing the dataset all data code for downloading, cleaning, and installing the data will continue to work. While the regularly updated data processing recipes ensure that code analyzing the datasets will always continue to run, it is important for reproducibility that we be able to rerun the exact data processing steps on the exact data that was used for the original analysis. The rdataretriever has built in provenance functionality to support this.

To store the data and processing script in their current state we use the commit() function to store both components of the data processing in a zip file for future reuse. This is logically similar to a git commit in that we store the state of the data and the process at a moment in time using a hash.

For example, the portal-dev dataset is updated weekly. If we want to be able rerun our original analysis after the reviews for a paper come back we’ll need to store that version of the data.

rdataretriever::commit('portal-dev', commit_message='Archive Portal data processing for initial submission on 2020-02-26', path = '.')

When we want to reanalyze this exact state of the dataset we can load it back into SQLite (or any of the other backends). Use the hash number related to the commit.

rdataretriever::install_sqlite("portal-dev-326d87.zip")

References

Yenni, Glenda M, Erica M Christensen, Ellen K Bledsoe, Sarah R Supp, Renata M Diaz, Ethan P White, and SK Morgan Ernest. 2019. “Developing a Modern Data Workflow for Regularly Updated Data.” PLoS Biology 17 (1): e3000125.