Title: | Text Interchange Format |
---|---|
Description: | Provides validation functions for common interchange formats for representing text data in R. Includes formats for corpus objects, document term matrices, and tokens. Other annotations can be stored by overloading the tokens structure. |
Authors: | Taylor Arnold [aut, cre], Ken Benoit [aut], Lincoln Mullen [aut], Adam Obeng [aut], rOpenSci Text Workshop Participants (2017) [aut] |
Maintainer: | Taylor B. Arnold <[email protected]> |
License: | GPL-2 |
Version: | 0.4 |
Built: | 2024-10-28 05:50:58 UTC |
Source: | https://github.com/ropenscilabs/tif |
This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The corpus and tokens objects have multiple valid formats. Packages compliant with the tif proposal should accept all valid formats and should directly return, or provide conversion functions, for converting outputs into at least one of the formats (when applicable). The tokenized text format is extensible to include other annotations such as part of speech tags and named entities.
Maintainer: Taylor Arnold [email protected]
Authors:
Ken Benoit [email protected]
Lincoln Mullen [email protected]
Adam Obeng [email protected]
rOpenSci Text Workshop Participants (2017)
Useful links:
Report bugs at http://github.com/ropensci/tif/issues
These functions convert between the various valid formats for corpus and tokens objects. By using these in other packages, maintainers need to only handle whichever specific format they would like to work with, but gain the freedom to output (or convert into) the one most suited to their package's paradigm.
tif_as_corpus_character(corpus) ## Default S3 method: tif_as_corpus_character(corpus) ## S3 method for class 'character' tif_as_corpus_character(corpus) ## S3 method for class 'data.frame' tif_as_corpus_character(corpus) tif_as_corpus_df(corpus) ## Default S3 method: tif_as_corpus_df(corpus) ## S3 method for class 'character' tif_as_corpus_df(corpus) ## S3 method for class 'data.frame' tif_as_corpus_df(corpus) tif_as_tokens_df(tokens) ## Default S3 method: tif_as_tokens_df(tokens) ## S3 method for class 'list' tif_as_tokens_df(tokens) ## S3 method for class 'data.frame' tif_as_tokens_df(tokens) tif_as_tokens_list(tokens) ## Default S3 method: tif_as_tokens_list(tokens) ## S3 method for class 'list' tif_as_tokens_list(tokens) ## S3 method for class 'data.frame' tif_as_tokens_list(tokens)
tif_as_corpus_character(corpus) ## Default S3 method: tif_as_corpus_character(corpus) ## S3 method for class 'character' tif_as_corpus_character(corpus) ## S3 method for class 'data.frame' tif_as_corpus_character(corpus) tif_as_corpus_df(corpus) ## Default S3 method: tif_as_corpus_df(corpus) ## S3 method for class 'character' tif_as_corpus_df(corpus) ## S3 method for class 'data.frame' tif_as_corpus_df(corpus) tif_as_tokens_df(tokens) ## Default S3 method: tif_as_tokens_df(tokens) ## S3 method for class 'list' tif_as_tokens_df(tokens) ## S3 method for class 'data.frame' tif_as_tokens_df(tokens) tif_as_tokens_list(tokens) ## Default S3 method: tif_as_tokens_list(tokens) ## S3 method for class 'list' tif_as_tokens_list(tokens) ## S3 method for class 'data.frame' tif_as_tokens_list(tokens)
corpus |
valid tif corpus object to coerce |
tokens |
valid tif tokens object to coerce |
No explicit checking is done on the input; the output is guaranteed to be valid only if the input is a valid format. In fact, we make an effort to not modify an object that appears to be in the required format already due to R's copy on modify semantics.
# coerce corpus object corpus <- c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less.") names(corpus) <- c("Camus", "Bradbury", "Vonnegut") new <- tif_as_corpus_df(corpus) new tif_as_corpus_character(new) # coerce tokens object tokens <- list(doc1 = c("aujourd'hui", "maman", "est", "morte"), doc2 = c("it", "was", "a", "pleasure", "to", "burn"), doc3 = c("all", "this", "happened", "more", "or", "less")) new <- tif_as_tokens_df(tokens) new tif_as_tokens_list(new)
# coerce corpus object corpus <- c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less.") names(corpus) <- c("Camus", "Bradbury", "Vonnegut") new <- tif_as_corpus_df(corpus) new tif_as_corpus_character(new) # coerce tokens object tokens <- list(doc1 = c("aujourd'hui", "maman", "est", "morte"), doc2 = c("it", "was", "a", "pleasure", "to", "burn"), doc3 = c("all", "this", "happened", "more", "or", "less")) new <- tif_as_tokens_df(tokens) new tif_as_tokens_list(new)
A valid character vector corpus object is an character vector with UTF-8 encoding. If it has names, this should be a unique character also in UTF-8 encoding. No other attributes should be present.
tif_is_corpus_character(corpus, warn = FALSE)
tif_is_corpus_character(corpus, warn = FALSE)
corpus |
a corpus object to test for validity |
warn |
logical. Should the function produce a verbose warning for the condition for which the validation fails. Useful for testing. |
The tests are run sequentially and the function returns, with a warning if the warn flag is set, on the first test that fails. We use this implementation because some tests may fail entirely or be meaningless if the prior ones are note passed.
a logical vector of length one indicating whether the input is a valid corpus
corpus <- c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less.") tif_is_corpus_character(corpus) names(corpus) <- c("Camus", "Bradbury", "Vonnegut") tif_is_corpus_character(corpus)
corpus <- c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less.") tif_is_corpus_character(corpus) names(corpus) <- c("Camus", "Bradbury", "Vonnegut") tif_is_corpus_character(corpus)
A valid data frame corpus object is an object that least two columns. One column must be called doc_id and be a character vector with UTF-8 encoding. Document ids must be unique. There must also be a column called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.
tif_is_corpus_df(corpus, warn = FALSE)
tif_is_corpus_df(corpus, warn = FALSE)
corpus |
a corpus object to test for validity |
warn |
logical. Should the function produce a verbose warning for the condition for which the validation fails. Useful for testing. |
The tests are run sequentially and the function returns, with a warning if the warn flag is set, on the first test that fails. We use this implementation because some tests may fail entirely or be meaningless if the prior ones are note passed. For example, if the corpus object does not have a variable named "text" it does not make sense to check whether this column is a character vector.
a logical vector of length one indicating whether the input is a valid corpus
corpus <- data.frame(doc_id = c("doc1", "doc2", "doc3"), text = c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less."), stringsAsFactors = FALSE) tif_is_corpus_df(corpus) corpus$author <- c("Camus", "Bradbury", "Vonnegut") tif_is_corpus_df(corpus)
corpus <- data.frame(doc_id = c("doc1", "doc2", "doc3"), text = c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less."), stringsAsFactors = FALSE) tif_is_corpus_df(corpus) corpus$author <- c("Camus", "Bradbury", "Vonnegut") tif_is_corpus_df(corpus)
A valid document term matrix is a sparse matrix with the row representing documents and columns representing terms. The row names is a character vector giving the document ids with no duplicated entries. The column names is a character vector giving the terms of the matrix with no duplicated entries. The spare matrix should inherit from the Matrix class dgCMatrix.
tif_is_dtm(dtm, warn = FALSE)
tif_is_dtm(dtm, warn = FALSE)
dtm |
a document term matrix object to test the validity of |
warn |
logical. Should the function produce a verbose warning for the condition for which the validation fails. Useful for testing. |
The tests are run sequentially and the function returns, with a warning if the warn flag is set, on the first test that fails. We use this implementation because some tests may fail entirely or be meaningless if the prior ones are note passed. For example, if the dtm object is not a matrix it may not contain row or column names.
a logical vector of length one indicating whether the input is a valid document term matrix
#' @importFrom Matrix Matrix dtm <- Matrix::Matrix(0, ncol = 26, nrow = 5, sparse = TRUE) colnames(dtm) <- LETTERS rownames(dtm) <- sprintf("doc%d", 1:5) tif_is_dtm(dtm)
#' @importFrom Matrix Matrix dtm <- Matrix::Matrix(0, ncol = 26, nrow = 5, sparse = TRUE) colnames(dtm) <- LETTERS rownames(dtm) <- sprintf("doc%d", 1:5) tif_is_dtm(dtm)
A valid tokens data frame object is a data frame or an object that inherits a data frame. It has no row names and has at least two columns. It must a contain column called doc_id that is a character vector with UTF-8 encoding. Document ids must be unique. It must also contain a column called token that must also be a character vector in UTF-8 encoding. Each individual token is represented by a single row in the data frame. Addition token-level metadata columns are allowed but not required.
tif_is_tokens_df(tokens, warn = FALSE)
tif_is_tokens_df(tokens, warn = FALSE)
tokens |
a tokens object to test for validity |
warn |
logical. Should the function produce a verbose warning for the condition for which the validation fails. Useful for testing. |
The tests are run sequentially and the function returns, with a warning if the warn flag is set, on the first test that fails. We use this implementation because some tests may fail entirely or be meaningless if the prior ones are note passed. For example, if the tokens object does not have a variable named "doc_id" it does not make sense to check whether this column is a character vector.
a logical vector of length one indicating whether the input is a valid tokens object
tokens <- data.frame(doc_id = c("doc1", "doc1", "doc1", "doc1", "doc2", "doc2", "doc2", "doc2", "doc2", "doc2", "doc3", "doc3", "doc3", "doc3", "doc3", "doc3"), token = c("aujourd'hui", "maman", "est", "morte", "it", "was", "a", "pleasure", "to", "burn", "all", "this", "happened", "more", "or", "less"), stringsAsFactors = FALSE) tif_is_tokens_df(tokens) tokens$pos <- "NOUN" tokens$NER <- "" tokens$sentiment <- runif(16L) tif_is_tokens_df(tokens)
tokens <- data.frame(doc_id = c("doc1", "doc1", "doc1", "doc1", "doc2", "doc2", "doc2", "doc2", "doc2", "doc2", "doc3", "doc3", "doc3", "doc3", "doc3", "doc3"), token = c("aujourd'hui", "maman", "est", "morte", "it", "was", "a", "pleasure", "to", "burn", "all", "this", "happened", "more", "or", "less"), stringsAsFactors = FALSE) tif_is_tokens_df(tokens) tokens$pos <- "NOUN" tokens$NER <- "" tokens$sentiment <- runif(16L) tif_is_tokens_df(tokens)
A valid corpus tokens object is (possibly named) list of character vectors. The character vectors, as well as names, should be in UTF-8 encoding. No other attributes should be present in either the list or any of its elements.
tif_is_tokens_list(tokens, warn = FALSE)
tif_is_tokens_list(tokens, warn = FALSE)
tokens |
a tokens object to test for validity |
warn |
logical. Should the function produce a verbose warning for the condition for which the validation fails. Useful for testing. |
The tests are run sequentially and the function returns, with a warning if the warn flag is set, on the first test that fails. We use this implementation because some tests may fail entirely or be meaningless if the prior ones are note passed.
a logical vector of length one indicating whether the input is a valid tokens
tokens <- list(doc1 = c("aujourd'hui", "maman", "est", "morte"), doc2 = c("it", "was", "a", "pleasure", "to", "burn"), doc3 = c("all", "this", "happened", "more", "or", "less")) tif_is_tokens_list(tokens) names(tokens) <- c("doc1", "doc2", "doc3") tif_is_tokens_list(tokens)
tokens <- list(doc1 = c("aujourd'hui", "maman", "est", "morte"), doc2 = c("it", "was", "a", "pleasure", "to", "burn"), doc3 = c("all", "this", "happened", "more", "or", "less")) tif_is_tokens_list(tokens) names(tokens) <- c("doc1", "doc2", "doc3") tif_is_tokens_list(tokens)