--- title: "gutenbergr: Search and download public domain texts from Project Gutenberg" author: "David Robinson, Myfanwy Johnston" data: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{gutenbergr: Search and download public domain texts from Project Gutenberg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE ) ``` ```{r packages-used} library(gutenbergr) library(dplyr) library(stringr) library(tidytext) ``` The gutenbergr package helps you download and process public domain works from the [Project Gutenberg](http://www.gutenberg.org/) collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes: * A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein. * Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered: * `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc * `gutenberg_authors` contains information about each author, such as aliases and birth/death year * `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics ### Project Gutenberg Metadata This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading. The dataset `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc: ```{r basics} gutenberg_metadata ``` For example, you could find the Gutenberg ID(s) of Jane Austen's _Persuasion_ by doing: ```{r filter} gutenberg_metadata |> filter(title == "Persuasion") ``` In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The `gutenberg_works()` function does this pre-filtering: ```{r works} gutenberg_works() ``` It also allows you to perform filtering as an argument: ```{r Austen} gutenberg_works(author == "Austen, Jane") # or with a regular expression gutenberg_works(str_detect(author, "Austen")) ``` The meta-data currently in the package was last updated on **`r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y')`**. ### Downloading books by ID The function `gutenberg_download()` downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that one version of _Persuasion_ has ID 105 (see [the URL here](https://www.gutenberg.org/ebooks/105)), so `gutenberg_download(105)` downloads this text. ```{r load 1 file, echo=FALSE} f105 <- system.file("extdata", "105.zip", package = "gutenbergr") persuasion <- gutenberg_download(105, mirror = "http://aleph.gutenberg.org" ) ``` ```{r load 1 from web, eval = FALSE} persuasion <- gutenberg_download(105) ``` ```{r display persuasion} persuasion ``` Notice it is returned as a tbl_df (a type of data frame) including two variables: `gutenberg_id` (useful if multiple books are returned), and a character vector of the text, one row per line. You can also provide `gutenberg_download()` a vector of IDs to download multiple books. For example, to download _Renascence, and Other Poems_ (book [109](https://www.gutenberg.org/ebooks/109)) along with _Persuasion_, do: ```{r load 2 from file, echo=FALSE} books <- gutenbergr::sample_books ``` ```{r load 2 from web, eval = FALSE} books <- gutenberg_download(c(109, 105), meta_fields = c("title", "author")) ``` ```{r display books} books ``` Notice that the `meta_fields` argument allows us to add one or more additional fields from the `gutenberg_metadata` to the downloaded text, such as title or author. ```{r count books} books |> count(title) ``` ### Other meta-datasets You may want to select books based on information other than their title or author, such as their genre or topic. `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics. "lcc" means [Library of Congress Classification](https://www.loc.gov/catdir/cpso/lcco/), while "lcsh" means [Library of Congress subject headings](https://id.loc.gov/authorities/subjects.html): ```{r subjects} gutenberg_subjects ``` This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The `gutenberg_id` column can then be used to download these texts or to link with other metadata. ```{r filter subjects} gutenberg_subjects |> filter(subject == "Detective and mystery stories") gutenberg_subjects |> filter(grepl("Holmes, Sherlock", subject)) ``` `gutenberg_authors` contains information about each author, such as aliases and birth/death year: ```{r authors} gutenberg_authors ``` ### Analysis What's next after retrieving a book's text? Well, having the book as a data frame is especially useful for working with the [tidytext](https://github.com/juliasilge/tidytext) package for text analysis. ```{r tidytext} words <- books |> unnest_tokens(word, text) words word_counts <- words |> anti_join(stop_words, by = "word") |> count(title, word, sort = TRUE) word_counts ``` You may also find these resources useful: * The [Natural Language Processing CRAN View](https://CRAN.R-project.org/view=NaturalLanguageProcessing) suggests many R packages related to text mining, especially around the [tm package](https://cran.r-project.org/package=tm) * You could match the `wikipedia` column in `gutenberg_author` to Wikipedia content with the [WikipediR](https://cran.r-project.org/package=WikipediR) package or to pageview statistics with the [wikipediatrend](https://cran.r-project.org/package=wikipediatrend) package * If you're considering an analysis based on author name, you may find the [humaniformat](https://cran.r-project.org/package=humaniformat) (for extraction of first names) and [gender](https://cran.r-project.org/package=gender) (prediction of gender from first names) packages useful. (Note that humaniformat has a `format_reverse` function for reversing "Last, First" names).