---
title: Introduction to jstor
author: "Thomas Klebel"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
      toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to jstor}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The tool [Data for Research (DfR)](https://www.jstor.org/dfr/) by JSTOR is a
valuable source for citation analysis and text mining. `jstor`
provides functions and suggests workflows for importing
datasets from DfR.

When using DfR, requests for datasets can
be made for small excerpts (max. 25,000 records) or large ones, which require an
agreement between the researcher and JSTOR. `jstor` was developed to deal with
very large datasets which require an agreement, but can be used with smaller
ones as well.

The most important set of functions is a group of `jst_get_*` functions:

- `jst_get_article` (for journal documents and pamphlets)
- `jst_get_authors` (for all sources)
- `jst_get_references` (for journal documents)
- `jst_get_footnotes` (for journal documents)
- `jst_get_book` (for books and research reports)
- `jst_get_chapters` (for books and possibly research reports)

I will demonstrate their usage using the
[sample dataset](https://www.jstor.org/dfr/about/sample-datasets)
which is provided by JSTOR on their website.

# General Concept
All functions from the `jst_get_*` family which are concerned with meta 
data operate along the same lines:

1. The file is read with `xml2::read_xml()`.
2. Content of the file is extracted via XPATH or CSS-expressions.
3. The resulting data is returned in a tidy `tibble`.

The functions are similar in that all operate on single files
(article, book, research report or pamphlet). Depending on the content of the
file, the output of the functions might have one or multiple rows.
`jst_get_article` always returns a `tibble` with one row: the core meta data
(like title, id, or first page of the article) are single items,
and only one article is processed at a time. 
Running `jst_get_authors` for the same article might give you a tibble with one
or multiple rows, depending on the number of authors the article has. The same
is true for `jst_get_references` and `jst_get_footnotes`. If a file has no data on
references (they might still exist, but JSTOR might not have parsed them), the
output is only one row, with missing references. If there is data on references,
each entry gets its own row. Note however, that the number of rows does not
equal the number of references. References usually start with a title like 
"References", which is obviously not a reference to another article.
Be sure to think carefully about your assumptions and to check the content of 
your data before you make inferences.

Books work a bit differently. Searching for data on
https://www.jstor.org/dfr/results lets you filter for books, which are actually
book chapters. If you receive data from DfR on a book chapter, you always get 
one xml-file with the whole book, including data on all chapters. Ngram or 
full-text data for the same entry however is processed only from single
chapters^[See the
[technical specifications](https://www.jstor.org/dfr/about/technical-specifications)
for more detail.]. Thus, the output of `jst_get_book` for a single file is
similar to the one from `jst_get_article`: it is one row with general data about
the book. `jst_get_chapters` gives you data on all chapters, and the resulting 
tibble therefore might have multiple rows.


The following sections showcase the different functions separately.

# Application
Apart from `jstor` we only need to load `dplyr` for matching records and `knitr`
for printing nice tables.

```{r, message=FALSE, warning=FALSE}
library(jstor)
library(dplyr)
library(knitr)
```


## jst_get_article
The basic usage of the `jst_get_*` functions is very simple. They take only one 
argument, the path to the file to import:
```{r}
meta_data <- jst_get_article(file_path = jst_example("article_with_references.xml"))
```

The resulting object is a `tibble` with one row and 17 columns. The columns
correspond to most of the elements documented here: https://www.jstor.org/dfr/about/technical-specifications.

The columns are:

- file_name *(chr)*: The file name of the original .xml-file. Can be used 
 for joining with other parts (authors, references, footnotes, full-texts).
- journal_doi *(chr)*: A registered identifier for the journal.
- journal_jcode *(chr)*: A identifier for the journal like "amerjsoci" for
 the "American Journal of Sociology".
- journal_pub_id *(chr)*: Similar to journal_jcode. Most of the time either
 one is present.
- article_doi *(chr)*: A registered unique identifier for the article.
- article_jcode *(chr)*: A unique identifier for the article (not a DOI).
- article_pub_id *(chr)*: Infrequent, either part of the DOI or the 
 article_jcode.
- article_type *(chr)*: The type of article (research-article, book-review,
 etc.).
- article_title *(chr)*: The title of the article.
- volume *(chr)*: The volume the article was published in.
- issue *(chr)*: The issue the article was published in.
- language *(chr)*: The language of the article.
- pub_day *(chr)*: Publication day, if specified.
- pub_month *(chr)*: Publication month, if specified.
- pub_year *(int)*: Year of publication.
- first_page *(int)*: Page number for the first page of the article.
- last_page *(int)*: Page number for the last page of the article.

Since the output from all functions are tibbles, the result is nicely formatted:

```{r, results='asis'}
meta_data %>% kable()
```

## jst_get_authors
Extracting the authors works in similar fashion:

```{r, results='asis'}
authors <- jst_get_authors(jst_example("article_with_references.xml"))
kable(authors)
```

Here we have the following columns:

- *file_name*: The same as above, used for matching articles.
- *prefix*: A prefix to the name.
- *given_name*: The given name of the author (i.e. `Albert` or `A.`).
- *surname*: The surname of the author (i.e. `Einstein`).
- *string_name*: Sometimes instead of given_name and surname, only a full string is
supplied, i.e.: `Albert Einstein`, or `Einstein, Albert`.
- *suffix*: A suffix to the name, as in `Albert Einstein, II.`.
- *author_number*: An integer representing the order of how the authors appeared in the data.

The number of rows matches the number of authors -- each author get its' own row.

## jst_get_references
```{r}
references <- jst_get_references(jst_example("article_with_references.xml"))

# # we need to remove line breaks for knitr::kable() to work properly for printing
references <- references %>%
  mutate(ref_unparsed = stringr::str_remove_all(ref_unparsed, "\\\n"))
```

We have two columns:

- *file_name*: Identifier, can be used for matching.
- *ref_title*: The title of the references sections.
- *ref_authors*: A string of authors. Several authors are separated with `;`.
- *ref_editors*: A string of editors, if present.
- *ref_collab*: A field that may contain information on the authors, if authors
            are not available.
- *ref_item_title*: The title of the cited entry.
- *ref_year*: A year, often the article's publication year, but not always. 
- *ref_source*: The source of the cited entry. For books often the title of the book,
            for articles the publisher of the journal.
- *ref_volume*: The volume of the journal article.
- *ref_first_page*: The first page of the article/chapter.
- *ref_last_page*: The last page of the article/chapter.
- *ref_publisher*: For books the publisher, for articles often missing.
- *ref_publication_type*: Known types: book, journal, web, other.
- *ref_unparsed*: The full references entry in unparsed form.


Here I display 5 random entries:

```{r, echo=FALSE}
set.seed(1234)
```

```{r}
references %>% 
  sample_n(5) %>% 
  kable()
```

This example shows several things: `file_name` is identical among rows, since
it identifies the article and all references came from one article. The the
sample file doesn't follow a typical convention (it was published in 1922), 
therefore there are several different headings (`ref_title`). Usually, this is
only "Bibliography" or "References".

Since the references were not parsed by JSTOR, we only get an unparsed version.
In general, the content of references (`unparsed_refs`) is in quite a raw state, 
quite often the result of digitising scans via OCR. For example, the last entry 
reads like this: 
`MACHADO, A.1911 Zytologische Untersuchungen fiber Trypanosoma rotatorium ...`.
There is an error here: `fiber` should be `über`. The language of the source
is German, but the OCR-software assumed English. Therefore, it didn't
recognize the *Umlaut*. Similar errors are common for text read via OCR.

For other files, we can set `parse_refs = TRUE`, so references will be imported 
in their parsed form, whenever they are available. 

```{r}
jst_get_references(
  jst_example("parsed_references.xml"),
  parse_refs = TRUE
) %>% 
  kable()
```


Note, that there might be other content present like endnotes, in case the
article used endnotes rather than footnotes.


## jst_get_footnotes
```{r, results='asis'}
jst_get_footnotes(jst_example("article_with_references.xml")) %>% 
  kable()
```

Very commonly, articles either have footnotes or references. The sample file
used here does not have footnotes, therefore a simple `tibble` with missing
footnotes is returned.

I will use another file to demonstrate footnotes.

```{r, results='asis'}
footnotes <- jst_get_footnotes(jst_example("article_with_footnotes.xml"))

footnotes %>% 
  mutate(footnotes = stringr::str_remove_all(footnotes, "\\\n")) %>% 
  kable()

```

In general, you might need to combine `jst_get_footnotes()` with
`jst_get_references()` to get all available information on citation data.

## jst_get_full_text
The function to extract full texts can't be demonstrated with proper data, since
the full texts are only supplied upon special request with DfR. The function
guesses the encoding of the specified file via `readr::guess_encoding()`, reads
the whole file and returns a `tibble` with `file_name`, `full_text` and
`encoding`.

I created a file that looks similar to files supplied by DfR with sample text:

```{r, results='asis'}
full_text <- jst_get_full_text(jst_example("full_text.txt"))
full_text %>% 
  mutate(full_text = stringr::str_remove_all(full_text, "\\\n")) %>% 
  kable()
```


## Combining results
Different parts of meta-data can be combined by using `dplyr::left_join()`.

### Matching with authors

```{r, results='asis'}
meta_data %>% 
  left_join(authors) %>%
  select(file_name, article_title, pub_year, given_name, surname) %>% 
  kable()
```

### Matching with references

```{r, results='asis'}
meta_data %>% 
  left_join(references) %>% 
  select(file_name, article_title, volume, pub_year, ref_unparsed) %>%
  head(5) %>% 
  kable()
```


# Books
Quite recently DfR added book chapters to their stack. To import metadata about
the books and chapters, jstor supplies `jst_get_book` and `jst_get_chapters`.

`jst_get_book` is very similar to `jst_get_article`. We obtain general information
about the complete book:

```{r, results='asis'}
jst_get_book(jst_example("book.xml")) %>% knitr::kable()
```

A single book might contain many chapters. `jst_get_chapters` extracts all of them.
Due to this, the function is a bit slower than most of jstor's other functions.

```{r}
chapters <- jst_get_chapters(jst_example("book.xml"))

str(chapters)
```

Without the abstracts (they are rather long) the first 10 chapters look like
this:

```{r, results='asis'}
chapters %>% 
  select(-abstract) %>% 
  head(10) %>% 
  kable()
```


Since extracting all authors for all chapters needs considerably
more time, by default authors are not extracted. You can import them like so:

```{r}
author_chap <- jst_get_chapters(jst_example("book.xml"), authors = TRUE) 
```

The authors are supplied in a list column:
```{r}
class(author_chap$authors)
```

You can expand this list with `tidyr::unnest`:

```{r}
author_chap %>% 
  tidyr::unnest(authors) %>% 
  select(part_id, given_name, surname) %>% 
  head(10) %>% 
  kable()
```

You can learn more about the concept of list-columns in Hadley Wickham's book
[R for Data Science](https://r4ds.had.co.nz/many-models.html).