Title: | Fast, Consistent Tokenization of Natural Language Text |
---|---|
Description: | Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'. |
Authors: | Thomas Charlon [aut, cre] , Lincoln Mullen [aut] , Os Keyes [ctb] , Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] , Kenneth Benoit [ctb] |
Maintainer: | Thomas Charlon <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.1 |
Built: | 2024-12-24 05:14:18 UTC |
Source: | https://github.com/ropensci/tokenizers |
These functions perform basic tokenization into words, sentences, paragraphs, lines, and characters. The functions can be piped into one another to create at most two levels of tokenization. For instance, one might split a text into paragraphs and then word tokens, or into sentences and then word tokens.
tokenize_characters( x, lowercase = TRUE, strip_non_alphanum = TRUE, simplify = FALSE ) tokenize_words( x, lowercase = TRUE, stopwords = NULL, strip_punct = TRUE, strip_numeric = FALSE, simplify = FALSE ) tokenize_sentences(x, lowercase = FALSE, strip_punct = FALSE, simplify = FALSE) tokenize_lines(x, simplify = FALSE) tokenize_paragraphs(x, paragraph_break = "\n\n", simplify = FALSE) tokenize_regex(x, pattern = "\\s+", simplify = FALSE)
tokenize_characters( x, lowercase = TRUE, strip_non_alphanum = TRUE, simplify = FALSE ) tokenize_words( x, lowercase = TRUE, stopwords = NULL, strip_punct = TRUE, strip_numeric = FALSE, simplify = FALSE ) tokenize_sentences(x, lowercase = FALSE, strip_punct = FALSE, simplify = FALSE) tokenize_lines(x, simplify = FALSE) tokenize_paragraphs(x, paragraph_break = "\n\n", simplify = FALSE) tokenize_regex(x, pattern = "\\s+", simplify = FALSE)
x |
A character vector or a list of character vectors to be tokenized.
If |
lowercase |
Should the tokens be made lower case? The default value
varies by tokenizer; it is only |
strip_non_alphanum |
Should punctuation and white space be stripped? |
simplify |
|
stopwords |
A character vector of stop words to be excluded. |
strip_punct |
Should punctuation be stripped? |
strip_numeric |
Should numbers be stripped? |
paragraph_break |
A string identifying the boundary between two paragraphs. |
pattern |
A regular expression that defines the split. |
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_words(song) tokenize_words(song, strip_punct = FALSE) tokenize_sentences(song) tokenize_paragraphs(song) tokenize_lines(song) tokenize_characters(song)
song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_words(song) tokenize_words(song, strip_punct = FALSE) tokenize_sentences(song) tokenize_paragraphs(song) tokenize_lines(song) tokenize_characters(song)
Given a text or vector/list of texts, break the texts into smaller segments each with the same number of words. This allows you to treat a very long document, such as a novel, as a set of smaller documents.
chunk_text(x, chunk_size = 100, doc_id = names(x), ...)
chunk_text(x, chunk_size = 100, doc_id = names(x), ...)
x |
A character vector or a list of character vectors to be tokenized
into n-grams. If |
chunk_size |
The number of words in each chunk. |
doc_id |
The document IDs as a character vector. This will be taken from
the names of the |
... |
Arguments passed on to |
Chunking the text passes it through tokenize_words
,
which will strip punctuation and lowercase the text unless you provide
arguments to pass along to that function.
## Not run: chunked <- chunk_text(mobydick, chunk_size = 100) length(chunked) chunked[1:3] ## End(Not run)
## Not run: chunked <- chunk_text(mobydick, chunk_size = 100) length(chunked) chunked[1:3] ## End(Not run)
Count words, sentences, and characters in input texts. These functions use
the stringi
package, so they handle the counting of Unicode strings
(e.g., characters with diacritical marks) in a way that makes sense to people
counting characters.
count_words(x) count_characters(x) count_sentences(x)
count_words(x) count_characters(x) count_sentences(x)
x |
A character vector or a list of character vectors. If |
An integer vector containing the counted elements. If the input vector or list has names, they will be preserved.
count_words(mobydick) count_sentences(mobydick) count_characters(mobydick)
count_words(mobydick) count_sentences(mobydick) count_characters(mobydick)
The text of Moby Dick, by Herman Melville, taken from Project Gutenberg.
mobydick
mobydick
A named character vector with length 1.
These functions tokenize their inputs into different kinds of n-grams. The input can be a character vector of any length, or a list of character vectors where each character vector in the list has a length of 1. See details for an explanation of what each function does.
tokenize_ngrams( x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE ) tokenize_skip_ngrams( x, lowercase = TRUE, n_min = 1, n = 3, k = 1, stopwords = character(), simplify = FALSE )
tokenize_ngrams( x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE ) tokenize_skip_ngrams( x, lowercase = TRUE, n_min = 1, n = 3, k = 1, stopwords = character(), simplify = FALSE )
x |
A character vector or a list of character vectors to be tokenized
into n-grams. If |
lowercase |
Should the tokens be made lower case? |
n |
The number of words in the n-gram. This must be an integer greater than or equal to 1. |
n_min |
The minimum number of words in the n-gram. This must be an
integer greater than or equal to 1, and less than or equal to |
stopwords |
A character vector of stop words to be excluded from the n-grams. |
ngram_delim |
The separator between words in an n-gram. |
simplify |
|
k |
For the skip n-gram tokenizer, the maximum skip distance between
words. The function will compute all skip n-grams between |
tokenize_ngrams
: Basic shingled n-grams. A
contiguous subsequence of n
words. This will compute shingled n-grams
for every value of between n_min
(which must be at least 1) and
n
.
tokenize_skip_ngrams
:Skip n-grams. A subsequence
of n
words which are at most a gap of k
words between them. The
skip n-grams will be calculated for all values from 0
to k
.
These functions will strip all punctuation and normalize all whitespace to a single space character.
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_ngrams(song, n = 4) tokenize_ngrams(song, n = 4, n_min = 1) tokenize_skip_ngrams(song, n = 4, k = 2)
song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_ngrams(song, n = 4) tokenize_ngrams(song, n = 4, n_min = 1) tokenize_skip_ngrams(song, n = 4, k = 2)
The character shingle tokenizer functions like an n-gram tokenizer, except the units that are shingled are characters instead of words. Options to the function let you determine whether non-alphanumeric characters like punctuation should be retained or discarded.
tokenize_character_shingles( x, n = 3L, n_min = n, lowercase = TRUE, strip_non_alphanum = TRUE, simplify = FALSE )
tokenize_character_shingles( x, n = 3L, n_min = n, lowercase = TRUE, strip_non_alphanum = TRUE, simplify = FALSE )
x |
A character vector or a list of character vectors to be tokenized
into character shingles. If |
n |
The number of characters in each shingle. This must be an integer greater than or equal to 1. |
n_min |
This must be an integer greater than or equal to 1, and less
than or equal to |
lowercase |
Should the characters be made lower case? |
strip_non_alphanum |
Should punctuation and white space be stripped? |
simplify |
|
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
x <- c("Now is the hour of our discontent") tokenize_character_shingles(x) tokenize_character_shingles(x, n = 5) tokenize_character_shingles(x, n = 5, strip_non_alphanum = FALSE) tokenize_character_shingles(x, n = 5, n_min = 3, strip_non_alphanum = FALSE)
x <- c("Now is the hour of our discontent") tokenize_character_shingles(x) tokenize_character_shingles(x, n = 5) tokenize_character_shingles(x, n = 5, strip_non_alphanum = FALSE) tokenize_character_shingles(x, n = 5, n_min = 3, strip_non_alphanum = FALSE)
This function implements the Penn Treebank word tokenizer.
tokenize_ptb(x, lowercase = FALSE, simplify = FALSE)
tokenize_ptb(x, lowercase = FALSE, simplify = FALSE)
x |
A character vector or a list of character vectors to be tokenized
into n-grams. If |
lowercase |
Should the tokens be made lower case? |
simplify |
|
This tokenizer uses regular expressions to tokenize text similar to the tokenization used in the Penn Treebank. It assumes that text has already been split into sentences. The tokenizer does the following:
splits common English contractions, e.g. don't
is
tokenized into do n't
and they'll
is tokenized into ->
they 'll
,
handles punctuation characters as separate tokens,
splits commas and single quotes off from words, when they are followed by whitespace,
splits off periods that occur at the end of the sentence.
This function is a port of the Python NLTK version of the Penn Treebank Tokenizer.
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
song <- list(paste0("How many roads must a man walk down\n", "Before you call him a man?"), paste0("How many seas must a white dove sail\n", "Before she sleeps in the sand?\n"), paste0("How many times must the cannonballs fly\n", "Before they're forever banned?\n"), "The answer, my friend, is blowin' in the wind.", "The answer is blowin' in the wind.") tokenize_ptb(song) tokenize_ptb(c("Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.", "They'll save and invest more.", "Hi, I can't say hello."))
song <- list(paste0("How many roads must a man walk down\n", "Before you call him a man?"), paste0("How many seas must a white dove sail\n", "Before she sleeps in the sand?\n"), paste0("How many times must the cannonballs fly\n", "Before they're forever banned?\n"), "The answer, my friend, is blowin' in the wind.", "The answer is blowin' in the wind.") tokenize_ptb(song) tokenize_ptb(c("Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.", "They'll save and invest more.", "Hi, I can't say hello."))
This function turns its input into a character vector of word stems. This is
just a wrapper around the wordStem
function from the
SnowballC package which does the heavy lifting, but this function provides a
consistent interface with the rest of the tokenizers in this package. The
input can be a character vector of any length, or a list of character vectors
where each character vector in the list has a length of 1.
tokenize_word_stems( x, language = "english", stopwords = NULL, simplify = FALSE )
tokenize_word_stems( x, language = "english", stopwords = NULL, simplify = FALSE )
x |
A character vector or a list of character vectors to be tokenized.
If |
language |
The language to use for word stemming. This must be one of
the languages available in the SnowballC package. A list is provided by
|
stopwords |
A character vector of stop words to be excluded |
simplify |
|
This function will strip all white space and punctuation and make all word stems lowercase.
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_word_stems(song)
song <- paste0("How many roads must a man walk down\n", "Before you call him a man?\n", "How many seas must a white dove sail\n", "Before she sleeps in the sand?\n", "\n", "How many times must the cannonballs fly\n", "Before they're forever banned?\n", "The answer, my friend, is blowin' in the wind.\n", "The answer is blowin' in the wind.\n") tokenize_word_stems(song)
A collection of functions with a consistent interface to convert natural language text into tokens.
The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list are the tokens generated by the function. If the input character vector or list is named, then the names are preserved.
Maintainer: Thomas Charlon [email protected] (ORCID)
Authors:
Lincoln Mullen [email protected] (ORCID)
Other contributors:
Os Keyes [email protected] (ORCID) [contributor]
Dmitriy Selivanov [email protected] [contributor]
Jeffrey Arnold [email protected] (ORCID) [contributor]
Kenneth Benoit [email protected] (ORCID) [contributor]
Useful links:
Report bugs at https://github.com/ropensci/tokenizers/issues