Package: textreuse 0.1.5

Yaoxiang Li

textreuse: Detect Text Reuse and Document Similarity

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Authors:Yaoxiang Li [aut, cre], Lincoln Mullen [aut]

textreuse_0.1.5.tar.gz
textreuse_0.1.5.zip(r-4.5)textreuse_0.1.5.zip(r-4.4)textreuse_0.1.5.zip(r-4.3)
textreuse_0.1.5.tgz(r-4.4-x86_64)textreuse_0.1.5.tgz(r-4.4-arm64)textreuse_0.1.5.tgz(r-4.3-x86_64)textreuse_0.1.5.tgz(r-4.3-arm64)
textreuse_0.1.5.tar.gz(r-4.5-noble)textreuse_0.1.5.tar.gz(r-4.4-noble)
textreuse_0.1.5.tgz(r-4.4-emscripten)textreuse_0.1.5.tgz(r-4.3-emscripten)
textreuse.pdf |textreuse.html✨
textreuse/json (API)
NEWS

# Install 'textreuse' in R:

install.packages('textreuse', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/ropensci/textreuse/issues

Pkgdown site:https://docs.ropensci.org

Uses libs:

c++– GNU Standard C++ Library v3

On CRAN:

peer-reviewed cpp

9.02 score 198 stars 222 scripts 556 downloads 1 mentions 43 exports 27 dependencies

Last updated 6 months agofrom:9cf2568ee8 (on master). Checks:1 OK, 8 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Dec 27 2024
R-4.5-win-x86_64	NOTE	Dec 27 2024
R-4.5-linux-x86_64	NOTE	Dec 27 2024
R-4.4-win-x86_64	NOTE	Dec 27 2024
R-4.4-mac-x86_64	NOTE	Dec 27 2024
R-4.4-mac-aarch64	NOTE	Dec 27 2024
R-4.3-win-x86_64	NOTE	Dec 27 2024
R-4.3-mac-x86_64	NOTE	Dec 27 2024
R-4.3-mac-aarch64	NOTE	Dec 27 2024

Exports:align_local content content<-filenames has_content has_hashes has_minhashes has_tokens hash_string hashes hashes<-is.TextReuseCorpus is.TextReuseTextDocument jaccard_bag_similarity jaccard_dissimilarity jaccard_similarity lsh lsh_candidates lsh_compare lsh_probability lsh_query lsh_subset lsh_threshold meta meta<-minhash_generator minhashes minhashes<-pairwise_candidates pairwise_compare ratio_of_matches rehash skipped TextReuseCorpus TextReuseTextDocument tokenize tokenize_ngrams tokenize_sentences tokenize_skip_ngrams tokenize_words tokens tokens<-wordcount

Dependencies:assertthat BH cli cpp11 digest dplyr fansi generics glue lifecycle magrittr NLP pillar pkgconfig purrr R6 Rcpp RcppProgress rlang stringi stringr tibble tidyr tidyselect utf8 vctrs withr

Introduction to the textreuse package

Lincoln Mullen

Rendered fromtextreuse-introduction.Rmdusingknitr::rmarkdownon Dec 27 2024.

Last update: 2020-05-12
Started: 2015-10-22

Minhash and locality-sensitive hashing

Lincoln Mullen

Rendered fromtextreuse-minhash.Rmdusingknitr::rmarkdownon Dec 27 2024.

Last update: 2015-10-31
Started: 2015-10-22

Pairwise comparisons for document similarity

Lincoln Mullen

Rendered fromtextreuse-pairwise.Rmdusingknitr::rmarkdownon Dec 27 2024.

Last update: 2015-10-31
Started: 2015-10-22

Text Alignment

Lincoln Mullen

Rendered fromtextreuse-alignment.Rmdusingknitr::rmarkdownon Dec 27 2024.

Last update: 2015-10-22
Started: 2015-10-22

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
textreuse: Detect Text Reuse and Document Similarity	textreuse-package textreuse
Local alignment of natural language texts	align_local
Convert candidates data frames to other formats	as.matrix.textreuse_candidates
Filenames from paths	filenames
Hash a string to an integer	hash_string
Locality sensitive hashing for minhash	lsh
Candidate pairs from LSH comparisons	lsh_candidates
Compare candidates identified by LSH	lsh_compare
Probability that a candidate pair will be detected with LSH	lsh_probability lsh_threshold
Query a LSH cache for matches to a single document	lsh_query
List of all candidates in a corpus	lsh_subset
Generate a minhash function	minhash_generator
Candidate pairs from pairwise comparisons	pairwise_candidates
Pairwise comparisons among documents in a corpus	pairwise_compare
Recompute the hashes for a document or corpus	rehash
Measure similarity/dissimilarity in documents	jaccard_bag_similarity jaccard_dissimilarity jaccard_similarity ratio_of_matches similarity-functions
TextReuseCorpus	is.TextReuseCorpus skipped TextReuseCorpus
TextReuseTextDocument	has_content has_hashes has_minhashes has_tokens is.TextReuseTextDocument TextReuseTextDocument
Accessors for TextReuse objects	hashes hashes<- minhashes minhashes<- TextReuseTextDocument-accessors tokens tokens<-
Recompute the tokens for a document or corpus	tokenize
Split texts into tokens	tokenizers tokenize_ngrams tokenize_sentences tokenize_skip_ngrams tokenize_words
Count words	wordcount

Package: textreuse 0.1.5

textreuse: Detect Text Reuse and Document Similarity

Introduction to the textreuse package

Minhash and locality-sensitive hashing

Pairwise comparisons for document similarity

Text Alignment

Citation

Development and contributors

Readme and manuals

Help Manual

Usage by other packages (reverse dependencies)