---
title: "Automating File Import"
author: "Thomas Klebel"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Automating File Import}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r}
library(jstor)
library(purrr)
library(stringr)
```


# Intro
The `find_*` functions from `jstor` all work on a single file. Data from DfR
however contains many single files, from up to 25,000 when using the self-service
functions, up to several hundreds of thousands of files when requesting a large
dataset via an agreement.

Currently `jstor` offers two implementations to import many files: 
`jst_import_zip()` and `jst_import()`. The first one lets you import data directly
from zip archives, the second works for file paths, so you need to unzip
the archive first. I will first introduce `jst_import_zip()` and discuss the 
approach of with `jst_import()` afterwards.

# Importing data directly from zip-archives
Unpacking and working with many files directly is unpractical for at least three
reasons: 

1. If you unzip the archive, the single files will occupy a lot more disk space
than the single archive.
2. Before you can import the files, you need to list them  via `list.files` or
`system("find...")` on UNIX. Depending on the size of your data, this can take
some time.
3. There might be different types of data in your sample (journal articles, book
chapters, etc.). You need to manage matching the paths to the appropriate 
functions, which is extra work.

Importing directly from the zip archive simplifies all those tasks with a
single function: `jst_import_zip()`. For the following demonstration, we will
use a small sample archive that comes with the package.

As a first step, we should take a look at the archive and its content. This
is made easy with `jst_preview_zip()`:

```{r}
jst_preview_zip(jst_example("pseudo_dfr.zip")) %>% knitr::kable()
```

We can see that we have a simple archive with three metadata files and one 
ngram file. Before we can use `jst_import_zip()`, we first need to think about,
what we actually want to import: all of the data, or just parts? What kind of
data do we want to extract from articles, books and pamphlets? We can specify
this via `jst_define_import()`:

```{r}
import_spec <- jst_define_import(
  article = c(jst_get_article, jst_get_authors),
  book = jst_get_book,
  ngram1 = jst_get_ngram
)
```

In this case, we want to import data on articles (standard metadata plus
information on the authors), general data on books and unigrams (ngram1). This
specification can then be used with `jst_import_zip()`:

```{r}
# set up a temporary folder for output files
tmp <- tempdir()

# extract the content and write output to disk
jst_import_zip(jst_example("pseudo_dfr.zip"),
               import_spec = import_spec,
               out_file = "my_test",
               out_path = tmp)

```

We can take a look at the files within `tmp` with `list.files()`:
```{r}
list.files(tmp, pattern = "csv")
```

As you can see, `jst_import_zip()` automatically creates file names based on
the string you supplied to `out_file` to delineate the different types of
output.

If we want to re-import the data for further analysis, we can either use 
functions like `readr::read_csv()`, or a small helper function from the package
which determines and sets the column types correctly:

```{r}
jst_re_import(
  file.path(tmp, "my_test_journal_article_jst_get_article-1.csv")
) %>% 
  knitr::kable()
```

A side note on ngrams: For larger archives, importing all ngrams can take a very
long time. It is thus advisable to only import ngrams for articles which you
want to analyse, i.e. most likely a subset of the initial request. The 
function `jst_subset_ngrams()` helps you with this (see also the section on
importing bigrams in the 
[case study](https://docs.ropensci.org/jstor/articles/analysing-n-grams.html#importing-bigrams)).

```{r, echo=FALSE}
unlink(tmp)
```


## Parallel processing
Since the above process might take a while for larger archives (files have to
be unpacked, read and parsed), there might be a benefit of executing the 
function in parallel. `jst_import_zip()` and `jst_import()` use `furrr` at their
core, therefore adding parallelism is very easy. Just add the following lines
at the beginning of your script, and the import will use all available cores:

```{r, eval=FALSE}
library(future)

plan(multisession)
```

You can find out more about futures by reading the package vignette:

```{r, eval=FALSE}
vignette("future-1-overview", package = "future")
```


# Working with file paths
The above approach of importing directly from zip archives is very convenient,
but in some cases you might want to have more control over how data is imported.
For example, if you run into problems because the output from any of the 
functions provided with the package looks corrupted, you could
want to look at the original files. Alongside this, you could unzip the archive
and work with the files directly, which I will demonstrate in the following
sections.


## Unzip containers
For simple purposes it might be sensible
to unzip to a temporary directory (with `temp()` and `unzip()`) but for my 
research I simply extracted files to an external SSD, since I a) lacked disk space,
b) needed to read them fast, and c) wanted to be able to look at specific files for
debugging.

## List files
There are many ways to generate a list of all files: 
`list.files()` or using `system()` in conjunction with `find` on unix-like systems
are common options. 

For demonstration purposes I use files contained in `jstor` which can be accessed
via `system.file`:
```{r, echo=TRUE}
example_dir <- system.file("extdata", package = "jstor")
```

## `list.files`

```{r}
file_names_listed <- list.files(path = example_dir, full.names = TRUE,
                                pattern = "*.xml")
file_names_listed
```

### `system` and `find`
```{r, eval=FALSE}
file_names <- system(paste0("cd ", example_dir, "; find . -name '*.xml' -type f"), intern = TRUE)
```

```{r, eval=FALSE}
library(stringr)

file_names_system <- file_names %>%
  str_replace("^\\.\\/", "") %>%
  str_c(example_dir, "/", .)

file_names_system
#> [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_with_footnotes.xml" 
#> [2] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_book.xml"
#> [3] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library/jstor/extdata/sample_with_references.xml"
```

In this case the two approaches give the same result.
The main difference seems to be though, that `list.files` sorts the output, 
whereas `find` does not. For a large amount of files (200,000) this makes
`list.files` slower, for smaller datasets the difference shouldn't make an
impact.

## Batch import
Once the file list is generated, we can apply any of the `jst_get_*`-functions 
to the list. A good and simple way for small to moderate amounts of files is to
use `purrr::map_df()`:

```{r}
# only work with journal articles
article_paths <- file_names_listed %>% 
  keep(str_detect, "with")

article_paths %>% 
  map_df(jst_get_article) %>% 
  knitr::kable()

```

This works well if 1) there are no errors and 2) if there is only a moderate
size of files. For larger numbers of files, `jst_import()` can streamline the 
process for you. This function works very similar to `jst_import_zip()`, the 
main difference being that it needs file paths as input and can only handle
one type of output.

```{r}
jst_import(article_paths, out_file = "my_second_test", .f = jst_get_article, 
           out_path = tmp)
```

The result is again written to disk, as can be seen below:

```{r}
list.files(tmp, pattern = "csv")
```