Package: textreuse 0.1.5

Yaoxiang Li

textreuse: Detect Text Reuse and Document Similarity

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

Authors:Yaoxiang Li [aut, cre], Lincoln Mullen [aut]

textreuse_0.1.5.tar.gz
textreuse_0.1.5.zip(r-4.5)textreuse_0.1.5.zip(r-4.4)textreuse_0.1.5.zip(r-4.3)
textreuse_0.1.5.tgz(r-4.4-arm64)textreuse_0.1.5.tgz(r-4.4-x86_64)textreuse_0.1.5.tgz(r-4.3-arm64)textreuse_0.1.5.tgz(r-4.3-x86_64)
textreuse_0.1.5.tar.gz(r-4.5-noble)textreuse_0.1.5.tar.gz(r-4.4-noble)
textreuse_0.1.5.tgz(r-4.4-emscripten)textreuse_0.1.5.tgz(r-4.3-emscripten)
textreuse.pdf |textreuse.html
textreuse/json (API)
NEWS

# Install 'textreuse' in R:
install.packages('textreuse', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/ropensci/textreuse/issues

Uses libs:
  • c++– GNU Standard C++ Library v3

On CRAN:

peer-reviewed

43 exports 195 stars 5.39 score 27 dependencies 1 mentions 412 downloads

Last updated 2 months agofrom:bb892527fb (on master)

Exports:align_localcontentcontent<-filenameshas_contenthas_hasheshas_minhasheshas_tokenshash_stringhasheshashes<-is.TextReuseCorpusis.TextReuseTextDocumentjaccard_bag_similarityjaccard_dissimilarityjaccard_similaritylshlsh_candidateslsh_comparelsh_probabilitylsh_querylsh_subsetlsh_thresholdmetameta<-minhash_generatorminhashesminhashes<-pairwise_candidatespairwise_compareratio_of_matchesrehashskippedTextReuseCorpusTextReuseTextDocumenttokenizetokenize_ngramstokenize_sentencestokenize_skip_ngramstokenize_wordstokenstokens<-wordcount

Dependencies:assertthatBHclicpp11digestdplyrfansigenericsgluelifecyclemagrittrNLPpillarpkgconfigpurrrR6RcppRcppProgressrlangstringistringrtibbletidyrtidyselectutf8vctrswithr

Introduction to the textreuse package

Rendered fromtextreuse-introduction.Rmdusingknitr::rmarkdownon Jul 23 2024.

Last update: 2020-05-12
Started: 2015-10-22

Minhash and locality-sensitive hashing

Rendered fromtextreuse-minhash.Rmdusingknitr::rmarkdownon Jul 23 2024.

Last update: 2015-10-31
Started: 2015-10-22

Pairwise comparisons for document similarity

Rendered fromtextreuse-pairwise.Rmdusingknitr::rmarkdownon Jul 23 2024.

Last update: 2015-10-31
Started: 2015-10-22

Text Alignment

Rendered fromtextreuse-alignment.Rmdusingknitr::rmarkdownon Jul 23 2024.

Last update: 2015-10-22
Started: 2015-10-22

Readme and manuals

Help Manual

Help pageTopics
textreuse: Detect Text Reuse and Document Similaritytextreuse-package textreuse
Local alignment of natural language textsalign_local
Convert candidates data frames to other formatsas.matrix.textreuse_candidates
Filenames from pathsfilenames
Hash a string to an integerhash_string
Locality sensitive hashing for minhashlsh
Candidate pairs from LSH comparisonslsh_candidates
Compare candidates identified by LSHlsh_compare
Probability that a candidate pair will be detected with LSHlsh_probability lsh_threshold
Query a LSH cache for matches to a single documentlsh_query
List of all candidates in a corpuslsh_subset
Generate a minhash functionminhash_generator
Candidate pairs from pairwise comparisonspairwise_candidates
Pairwise comparisons among documents in a corpuspairwise_compare
Recompute the hashes for a document or corpusrehash
Measure similarity/dissimilarity in documentsjaccard_bag_similarity jaccard_dissimilarity jaccard_similarity ratio_of_matches similarity-functions
TextReuseCorpusis.TextReuseCorpus skipped TextReuseCorpus
TextReuseTextDocumenthas_content has_hashes has_minhashes has_tokens is.TextReuseTextDocument TextReuseTextDocument
Accessors for TextReuse objectshashes hashes<- minhashes minhashes<- TextReuseTextDocument-accessors tokens tokens<-
Recompute the tokens for a document or corpustokenize
Split texts into tokenstokenizers tokenize_ngrams tokenize_sentences tokenize_skip_ngrams tokenize_words
Count wordswordcount