--- title: "Introduction to the textreuse package" author: "Lincoln Mullen" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the textreuse packages} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The textreuse package provides classes and functions to detect document similarity and text reuse in text corpora. This introductory vignette provides details on the `TextReuseTextDocument` and `TextReuseCorpus` classes, as well as functions for tokenizing, hashing, and measuring similarity. See the pairwise, minhash/LSH, or alignment vignettes for details on solving text similarity problems. ```{r eval=FALSE} vignette("textreuse-pairwise", package = "textreuse") vignette("textreuse-minhash", package = "textreuse") vignette("textreuse-alignment", package = "textreuse") ``` For these vignette we will use a small corpus of eight documents published by the [American Tract Society](https://en.wikipedia.org/wiki/American_Tract_Society) and available from the Internet Archive. The [full corpus](http://lincolnmullen.com/blog/corpus-of-american-tract-society-publications/) is also available to be downloaded if you wish to test the package. ## TextReuse classes ### TextReuseTextDocument The most basic class provided by this package is the `TextReuseTextDocument` class. This class contains the text of a document and its metadata. When the document is loaded, the text is also tokenized. (See the section on tokenizers below.) Those tokens are then hashed using a hash function. By default the hashes are retained and the tokens are discarded, since using only hashes results in a significant memory savings. Here we load a file into a `TextReuseTextDocument` and tokenize it into shingled n-grams, adding an option to retain the tokens. ```{r} library(textreuse) file <- system.file("extdata/ats/remember00palm.txt", package = "textreuse") doc <- TextReuseTextDocument(file = file, meta = list("publisher" = "ATS"), tokenizer = tokenize_ngrams, n = 5, keep_tokens = TRUE) doc ``` We can see details of the document with accessor functions. These are derived from the S3 virtual class `TextDocument ` in the [NLP](https://cran.r-project.org/package=NLP) package. Notice that an ID has been assigned to the document based on the filename (without the extension). The name of the tokenizer and hash functions are also saved in the metadata. ```{r} meta(doc) meta(doc, "id") meta(doc, "date") <- 1865 head(tokens(doc)) head(hashes(doc)) wordcount(doc) ``` The `tokens()` and `hashes()` function return the tokens and hashes associated with the document. The `meta()` function returns a named list of all the metadata fields. If that function is called with a specific ID, as in `meta(doc, "myfield")`, then the value for only that field is returned. You can also assign to the metadata as a whole or a specific field, as in the example above. In addition the `content()` function provides the unprocessed text of the document. The assumption is that is that you want to tokenize and hash the tokens from the start. If, however, you wish to do any of those steps yourself, you can load a document with `tokenizer = NULL`, then use `tokenize()` or `rehash()` to recompute the tokens and hashes. Note that a `TextReuseTextDocument` can actually contain two kinds of hashes. The `hashes()` accessor gives you integer representations of each of the tokens in the document: if there are 100,000 tokens in the document, there will be 100,000 hashes. The `minhashes()` accessor gives you a signature that represents the document as a whole but not the specific tokens within it. See the minhash vignette for details: `vignette("textreuse-minhash")`. ### TextReuseCorpus The class `TextReuseCorpus` provides a list of `TextReuseTextDocuments`. It derives from the S3 virtual class `Corpus` in the [tm](https://cran.r-project.org/package=tm) package. It can be created from a directory of files (or by providing a vector of paths to files). ```{r} dir <- system.file("extdata/ats", package = "textreuse") corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, progress = FALSE) corpus ``` The names of the items in a `TextReuseCorpus` are the IDs of the documents. You can use these IDs to subset the corpus or to retrieve specific documents. ```{r} names(corpus) corpus[["remember00palm"]] corpus[c("calltounconv00baxt", "lifeofrevrichard00baxt")] ``` Accessor functions such as `meta()`, `tokens()`, `hashes()`, and `wordcount()` have methods that work on corpora. ```{r} wordcount(corpus) ``` Note that when creating a corpus, very short or empty documents will be skipped with a warning. A document must have enough words to create at least two n-grams. For example, if five-grams are desired, then the document must have at least six words. ## Tokenizers One of the steps that is performed when loading a `TextReuseTextDocument`, either individual or in a corpus, is tokenization. Tokenization breaks up a text into pieces, often overlapping. These pieces are the features which are compared when measuring document similarity. The textreuse package provides a number of tokenizers. ```{r} text <- "How many roads must a man walk down\nBefore you'll call him a man?" tokenize_words(text) tokenize_sentences(text) tokenize_ngrams(text, n = 3) tokenize_skip_ngrams(text, n = 3, k = 2) ``` You can write your own tokenizers or use tokenizers from other packages. They should accept a character vector as their first argument. As an example, we will write a tokenizer function using the \link[stringr]{stringr} package which splits a text on new lines, perhaps useful for poetry. Notice that the function takes a single string and returns a character vector with one element for each line. (A more robust tokenizer might strip blank lines and punctuation, include an option for lowercasing the text, and check for the validity of arguments.) ```{r} poem <- "Roses are red\nViolets are blue\nI like using R\nAnd you should too" cat(poem) tokenize_lines <- function(string) { stringr::str_split(string, "\n+")[[1]] } tokenize_lines(poem) ``` ## Hash functions This package provides one function to hash tokens to integers, `hash_string()`. ```{r} hash_string(tokenize_words(text)) ``` You can write your own hash functions, or use those provided by the [digest](https://cran.r-project.org/package=digest) package. ## Comparison functions This package provides a number of comparison functions for measuring similarity. These functions take either a set (in which each token is counted one time) or a bag (in which each token is counted as many times as it appears) and compares it to another set or bag. ```{r} a <- tokenize_words(paste("How does it feel, how does it feel?", "To be without a home", "Like a complete unknown, like a rolling stone")) b <- tokenize_words(paste("How does it feel, how does it feel?", "To be on your own, with no direction home", "A complete unknown, like a rolling stone")) jaccard_similarity(a, b) jaccard_dissimilarity(a, b) jaccard_bag_similarity(a, b) ratio_of_matches(a, b) ``` See the documentation for `?similarity-functions` for details on what is measured with these functions. You can write your own similarity functions, which should accept two sets or bags, `a` and `b`, should work on both character and numeric vectors, since they are used with either tokens or hashes of tokens, and should return a single numeric score for the comparison. You will need to implement a method for the `TextReuseTextDocument` class. ## Parallelization This package will use multiple cores for a few functions is an option is set. This only benefits the corpus loading and tokenizing functions, which are often the slowest parts of an analysis. This is implemented with the [parallel package](https://cran.r-project.org/view=HighPerformanceComputing), and does not work on Windows machines. (Regardless of the options set, this package will never attempt to parallelize computations on Windows.) To use the parallel option, you must specify the number of CPU cores that you wish to use: ```{R eval = FALSE} options("mc.cores" = 4L) ``` If that option is set, this package will use multiple cores when possible. You can figure out how many cores your computer has with `parallel::detectCores()`. See `help(package = "parallel")` for more details.