---
title: "gutenbergr: Search and download public domain texts from Project Gutenberg"
author: "David Robinson, Myfanwy Johnston"
data: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{gutenbergr: Search and download public domain texts from Project Gutenberg}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)
```

```{r packages-used}
library(gutenbergr)
library(dplyr)
library(stringr)
library(tidytext)
```

The gutenbergr package helps you download and process public domain works from the [Project Gutenberg](http://www.gutenberg.org/) collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

* A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein.
* Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
  * `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc
  * `gutenberg_authors` contains information about each author, such as aliases and birth/death year
  * `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics
  
### Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc:

```{r basics}
gutenberg_metadata
```

For example, you could find the Gutenberg ID(s) of Jane Austen's _Persuasion_ by doing:

```{r filter}
gutenberg_metadata |>
  filter(title == "Persuasion")
```

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The `gutenberg_works()` function does this pre-filtering:

```{r works}
gutenberg_works()
```

It also allows you to perform filtering as an argument:

```{r Austen}
gutenberg_works(author == "Austen, Jane")

# or with a regular expression
gutenberg_works(str_detect(author, "Austen"))
```

The meta-data currently in the package was last updated on **`r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y')`**.

### Downloading books by ID

The function `gutenberg_download()` downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that one version of _Persuasion_ has ID 105 (see [the URL here](https://www.gutenberg.org/ebooks/105)), so `gutenberg_download(105)` downloads this text.

```{r load 1 file, echo=FALSE}
f105 <- system.file("extdata", "105.zip", package = "gutenbergr")
persuasion <- gutenberg_download(105,
  mirror = "http://aleph.gutenberg.org"
)
```


```{r load 1 from web, eval = FALSE}
persuasion <- gutenberg_download(105)
```

```{r display persuasion}
persuasion
```

Notice it is returned as a tbl_df (a type of data frame) including two variables: `gutenberg_id` (useful if multiple books are returned), and a character vector of the text, one row per line.

You can also provide `gutenberg_download()` a vector of IDs to download multiple books. For example, to download _Renascence, and Other Poems_ (book [109](https://www.gutenberg.org/ebooks/109)) along with _Persuasion_, do:

```{r load 2 from file, echo=FALSE}
books <- gutenbergr::sample_books
```


```{r load 2 from web, eval = FALSE}
books <- gutenberg_download(c(109, 105), meta_fields = c("title", "author"))
```

```{r display books}
books
```

Notice that the `meta_fields` argument allows us to add one or more additional fields from the `gutenberg_metadata` to the downloaded text, such as title or author.

```{r count books}
books |>
  count(title)
```

### Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics. "lcc" means [Library of Congress Classification](https://www.loc.gov/catdir/cpso/lcco/), while "lcsh" means [Library of Congress subject headings](https://id.loc.gov/authorities/subjects.html):

```{r subjects}
gutenberg_subjects
```

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The `gutenberg_id` column can then be used to download these texts or to link with other metadata.

```{r filter subjects}
gutenberg_subjects |>
  filter(subject == "Detective and mystery stories")

gutenberg_subjects |>
  filter(grepl("Holmes, Sherlock", subject))
```

`gutenberg_authors` contains information about each author, such as aliases and birth/death year:

```{r authors}
gutenberg_authors
```

### Analysis

What's next after retrieving a book's text? Well, having the book as a data frame is especially useful for working with the [tidytext](https://github.com/juliasilge/tidytext) package for text analysis.

```{r tidytext}
words <- books |>
  unnest_tokens(word, text)

words

word_counts <- words |>
  anti_join(stop_words, by = "word") |>
  count(title, word, sort = TRUE)

word_counts
```

You may also find these resources useful:

* The [Natural Language Processing CRAN View](https://CRAN.R-project.org/view=NaturalLanguageProcessing) suggests many R packages related to text mining, especially around the [tm package](https://cran.r-project.org/package=tm)
* You could match the `wikipedia` column in `gutenberg_author` to Wikipedia content with the [WikipediR](https://cran.r-project.org/package=WikipediR) package or to pageview statistics with the [wikipediatrend](https://cran.r-project.org/package=wikipediatrend) package
* If you're considering an analysis based on author name, you may find the [humaniformat](https://cran.r-project.org/package=humaniformat) (for extraction of first names) and [gender](https://cran.r-project.org/package=gender) (prediction of gender from first names) packages useful. (Note that humaniformat has a `format_reverse` function for reversing "Last, First" names).