Package: tokenizers 0.3.1

Thomas Charlon

tokenizers: Fast, Consistent Tokenization of Natural Language Text

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Authors:Thomas Charlon [aut, cre], Lincoln Mullen [aut], Os Keyes [ctb], Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb], Kenneth Benoit [ctb]

tokenizers_0.3.1.tar.gz
tokenizers_0.3.1.zip(r-4.5)tokenizers_0.3.1.zip(r-4.4)tokenizers_0.3.1.zip(r-4.3)
tokenizers_0.3.1.tgz(r-4.4-x86_64)tokenizers_0.3.1.tgz(r-4.4-arm64)tokenizers_0.3.1.tgz(r-4.3-x86_64)tokenizers_0.3.1.tgz(r-4.3-arm64)
tokenizers_0.3.1.tar.gz(r-4.5-noble)tokenizers_0.3.1.tar.gz(r-4.4-noble)
tokenizers_0.3.1.tgz(r-4.4-emscripten)tokenizers_0.3.1.tgz(r-4.3-emscripten)
tokenizers.pdf |tokenizers.html
tokenizers/json (API)
NEWS

# Install 'tokenizers' in R:
install.packages('tokenizers', repos = c('https://ropensci.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/ropensci/tokenizers/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On CRAN:

nlppeer-reviewedtext-miningtokenizer

13.26 score 184 stars 79 packages 1.1k scripts 32k downloads 1 mentions 15 exports 3 dependencies

Last updated 8 months agofrom:b80863d088 (on master). Checks:OK: 3 NOTE: 6. Indexed: yes.

TargetResultDate
Doc / VignettesOKOct 25 2024
R-4.5-win-x86_64OKOct 25 2024
R-4.5-linux-x86_64OKOct 25 2024
R-4.4-win-x86_64NOTEOct 25 2024
R-4.4-mac-x86_64NOTEOct 25 2024
R-4.4-mac-aarch64NOTEOct 25 2024
R-4.3-win-x86_64NOTEOct 25 2024
R-4.3-mac-x86_64NOTEOct 25 2024
R-4.3-mac-aarch64NOTEOct 25 2024

Exports:chunk_textcount_characterscount_sentencescount_wordstokenize_character_shinglestokenize_characterstokenize_linestokenize_ngramstokenize_paragraphstokenize_ptbtokenize_regextokenize_sentencestokenize_skip_ngramstokenize_word_stemstokenize_words

Dependencies:RcppSnowballCstringi

Introduction to the tokenizers Package

Rendered fromintroduction-to-tokenizers.Rmdusingknitr::rmarkdownon Oct 25 2024.

Last update: 2022-12-19
Started: 2016-08-11

The Text Interchange Formats and the tokenizers Package

Rendered fromtif-and-tokenizers.Rmdusingknitr::rmarkdownon Oct 25 2024.

Last update: 2022-09-23
Started: 2018-03-14

Readme and manuals

Help Manual

Help pageTopics
Basic tokenizersbasic-tokenizers tokenize_characters tokenize_lines tokenize_paragraphs tokenize_regex tokenize_sentences tokenize_words
Chunk text into smaller segmentschunk_text
Count words, sentences, characterscount_characters count_sentences count_words
The text of Moby Dickmobydick
N-gram tokenizersngram-tokenizers tokenize_ngrams tokenize_skip_ngrams
Character shingle tokenizerstokenize_character_shingles
Penn Treebank Tokenizertokenize_ptb
Word stem tokenizertokenize_word_stems
Tokenizerstokenizers-package tokenizers