Returns the configuration of a causal model
causal_config( model = getOption("pangoling.causal.default"), checkpoint = NULL, config_model = NULL )
causal_config( model = getOption("pangoling.causal.default"), checkpoint = NULL, config_model = NULL )
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
A list with the configuration of the model.
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.
If not specified, the causal model used will be the one set in the global
option pangoling.causal.default
, this can be
accessed via getOption("pangoling.causal.default")
(by default
"gpt2"). To change the default option
use options(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found in Hugging Face website.
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
Python method
from_pretrained
for details.
In case of errors when a new model is run, check the status of https://status.huggingface.co/
Other causal model helper functions:
causal_preload()
causal_config(model = "gpt2")
causal_config(model = "gpt2")
This function predicts the possible next tokens and their predictability (log-probabilities by default). The function sorts tokens in descending order of their predictability.
causal_next_tokens_pred_tbl( context, log.p = getOption("pangoling.log.p"), decode = FALSE, model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
causal_next_tokens_pred_tbl( context, log.p = getOption("pangoling.log.p"), decode = FALSE, model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
context |
A single string representing the context for which the next tokens and their predictabilities are predicted. |
log.p |
Base of the logarithm used for the output predictability values.
If |
decode |
Logical. If |
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
The function uses a causal transformer model to compute the predictability
of all tokens in the model's vocabulary, given a single input context. It
returns a table where each row represents a token, along with its
predictability score. By default, the function returns log-probabilities in
natural logarithm (base e), but you can specify a different logarithm base
(e.g., log.p = 1/2
for surprisal in bits).
If decode = TRUE
, the tokens are converted into human-readable strings,
handling special characters like accents and diacritics. This ensures that
tokens are more interpretable, especially for languages with complex
tokenization.
A table with possible next tokens and their log-probabilities.
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.
If not specified, the causal model used will be the one set in the global
option pangoling.causal.default
, this can be
accessed via getOption("pangoling.causal.default")
(by default
"gpt2"). To change the default option
use options(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found in Hugging Face website.
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
Python method
from_pretrained
for details.
In case of errors when a new model is run, check the status of https://status.huggingface.co/
Other causal model functions:
causal_pred_mats()
,
causal_words_pred()
causal_next_tokens_pred_tbl( context = "The apple doesn't fall far from the", model = "gpt2" )
causal_next_tokens_pred_tbl( context = "The apple doesn't fall far from the", model = "gpt2" )
This function computes a list of matrices, where each matrix corresponds to a
unique group specified by the by
argument. Each matrix represents the
predictability of every token in the input text (x
) based on preceding
context, as evaluated by a causal transformer model.
causal_pred_mats( x, by = rep(1, length(x)), sep = " ", log.p = getOption("pangoling.log.p"), sorted = FALSE, model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, decode = FALSE, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ... )
causal_pred_mats( x, by = rep(1, length(x)), sep = " ", log.p = getOption("pangoling.log.p"), sorted = FALSE, model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, decode = FALSE, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ... )
x |
A character vector of words, phrases, or texts to evaluate. |
by |
A grouping variable indicating how texts are split into groups. |
sep |
A string specifying how words are separated within contexts or
groups. Default is |
log.p |
Base of the logarithm used for the output predictability values.
If |
sorted |
When default FALSE it will retain the order of groups we are
splitting by. When TRUE then sorted (according to |
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
decode |
Logical. If |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
batch_size |
Maximum number of sentences/texts processed in parallel. Larger batches increase speed but use more memory. Since all texts in a batch must have the same length, shorter ones are padded with placeholder tokens. |
... |
Currently not in use. |
The function splits the input x
into groups specified by the by
argument
and processes each group independently. For each group, the model computes
the predictability of each token in its vocabulary based on preceding
context.
Each matrix contains:
Rows representing the model's vocabulary.
Columns corresponding to tokens in the group (e.g., a sentence or paragraph).
By default, values in the matrices are the natural logarithm of word probabilities.
A list of matrices with tokens in their columns and the vocabulary of the model in their rows
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.
If not specified, the causal model used will be the one set in the global
option pangoling.causal.default
, this can be
accessed via getOption("pangoling.causal.default")
(by default
"gpt2"). To change the default option
use options(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found in Hugging Face website.
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
Python method
from_pretrained
for details.
In case of errors when a new model is run, check the status of https://status.huggingface.co/
Other causal model functions:
causal_next_tokens_pred_tbl()
,
causal_words_pred()
data("df_sent") df_sent list_of_mats <- causal_pred_mats( x = df_sent$word, by = df_sent$sent_n, model = "gpt2" ) # View the structure of the resulting list list_of_mats |> str() # Inspect the last rows of the first matrix list_of_mats[[1]] |> tail() # Inspect the last rows of the second matrix list_of_mats[[2]] |> tail()
data("df_sent") df_sent list_of_mats <- causal_pred_mats( x = df_sent$word, by = df_sent$sent_n, model = "gpt2" ) # View the structure of the resulting list list_of_mats |> str() # Inspect the last rows of the first matrix list_of_mats[[1]] |> tail() # Inspect the last rows of the second matrix list_of_mats[[2]] |> tail()
Preloads a causal language model to speed up next runs.
causal_preload( model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
causal_preload( model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
Nothing.
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.
If not specified, the causal model used will be the one set in the global
option pangoling.causal.default
, this can be
accessed via getOption("pangoling.causal.default")
(by default
"gpt2"). To change the default option
use options(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found in Hugging Face website.
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
Python method
from_pretrained
for details.
In case of errors when a new model is run, check the status of https://status.huggingface.co/
Other causal model helper functions:
causal_config()
causal_preload(model = "gpt2")
causal_preload(model = "gpt2")
These functions calculate the predictability of words, phrases, or tokens using a causal transformer model.
causal_words_pred( x, by = rep(1, length(x)), word_n = NULL, sep = " ", log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ... ) causal_tokens_pred_lst( texts, log.p = getOption("pangoling.log.p"), model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1 ) causal_targets_pred( contexts, targets, sep = " ", log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ... )
causal_words_pred( x, by = rep(1, length(x)), word_n = NULL, sep = " ", log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ... ) causal_tokens_pred_lst( texts, log.p = getOption("pangoling.log.p"), model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1 ) causal_targets_pred( contexts, targets, sep = " ", log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.causal.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL, batch_size = 1, ... )
x |
A character vector of words, phrases, or texts to evaluate. |
by |
A grouping variable indicating how texts are split into groups. |
word_n |
Word order, by default this is the word order of the vector x. |
sep |
A string specifying how words are separated within contexts or
groups. Default is |
log.p |
Base of the logarithm used for the output predictability values.
If |
ignore_regex |
Can ignore certain characters when calculating the log
probabilities. For example |
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
batch_size |
Maximum number of sentences/texts processed in parallel. Larger batches increase speed but use more memory. Since all texts in a batch must have the same length, shorter ones are padded with placeholder tokens. |
... |
Currently not in use. |
texts |
A vector or list of sentences or paragraphs. |
contexts |
A character vector of contexts corresponding to each target. |
targets |
A character vector of target words or phrases. |
These functions calculate the predictability (by default the natural logarithm of the word probability) of words, phrases or tokens using a causal transformer model:
causal_targets_pred()
: Evaluates specific target words or phrases
based on their given contexts. Use when you have explicit
context-target pairs to evaluate, with each target word or phrase paired
with a single preceding context.
causal_words_pred()
: Computes predictability for all elements of a
vector grouped by a specified variable. Use when working with words or
phrases split into groups, such as sentences or paragraphs, where
predictability is computed for every word or phrase in each group.
causal_tokens_pred_lst()
: Computes the predictability of each token
in a sentence (or group of sentences) and returns a list of results for
each sentence. Use when you want to calculate the predictability of
every token in one or more sentences.
See the online article in pangoling website for more examples.
For causal_targets_pred()
and causal_words_pred()
,
a named numeric vector of predictability scores. For
causal_tokens_pred_lst()
, a list of named numeric vectors, one for
each sentence or group.
A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.
If not specified, the causal model used will be the one set in the global
option pangoling.causal.default
, this can be
accessed via getOption("pangoling.causal.default")
(by default
"gpt2"). To change the default option
use options(pangoling.causal.default = "newcausalmodel")
.
A list of possible causal models can be found in Hugging Face website.
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
Python method
from_pretrained
for details.
In case of errors when a new model is run, check the status of https://status.huggingface.co/
Other causal model functions:
causal_next_tokens_pred_tbl()
,
causal_pred_mats()
# Using causal_targets_pred causal_targets_pred( contexts = c("The apple doesn't fall far from the", "Don't judge a book by its"), targets = c("tree.", "cover."), model = "gpt2" ) # Using causal_words_pred causal_words_pred( x = df_sent$word, by = df_sent$sent_n, model = "gpt2" ) # Using causal_tokens_pred_lst preds <- causal_tokens_pred_lst( texts = c("The apple doesn't fall far from the tree.", "Don't judge a book by its cover."), model = "gpt2" ) preds # Convert the output to a tidy table suppressPackageStartupMessages(library(tidytable)) map2_dfr(preds, seq_along(preds), ~ data.frame(tokens = names(.x), pred = .x, id = .y))
# Using causal_targets_pred causal_targets_pred( contexts = c("The apple doesn't fall far from the", "Don't judge a book by its"), targets = c("tree.", "cover."), model = "gpt2" ) # Using causal_words_pred causal_words_pred( x = df_sent$word, by = df_sent$sent_n, model = "gpt2" ) # Using causal_tokens_pred_lst preds <- causal_tokens_pred_lst( texts = c("The apple doesn't fall far from the tree.", "Don't judge a book by its cover."), model = "gpt2" ) preds # Convert the output to a tidy table suppressPackageStartupMessages(library(tidytable)) map2_dfr(preds, seq_along(preds), ~ data.frame(tokens = names(.x), pred = .x, id = .y))
This dataset contains data from a self-paced reading experiment on Chinese relative clause comprehension. It is structured to support analysis of reaction times, comprehension accuracy, and surprisal values across various experimental conditions in a 2x2 fully crossed factorial design:
data(df_jaeger14)
data(df_jaeger14)
A tibble with 8,624 rows and 15 variables:
Participant identifier, a character vector.
Trial item number, an integer.
Experimental condition, a character vector indicating variations in sentence structure (e.g., "a", "b", "c", "d").
Chinese word presented in each trial, a character vector.
Position of the word within the sentence, an integer.
Reaction time in milliseconds for reading each word, an integer.
Sentence region or phrase type (e.g., "hd1", "Det+CL"), a character vector.
Comprehension question associated with the trial, a character vector.
Binary accuracy score for the comprehension question (1 = correct, 0 = incorrect).
Expected correct answer for the comprehension question, a character vector ("Y" or "N").
Type of comprehension question, a character vector.
Name of the experiment, indicating self-paced reading, a character vector.
Experimental list number, for counterbalancing item presentation, an integer.
Full sentence used in the trial with words marked for analysis, a character vector.
Model-derived surprisal values for each word, a numeric vector.
Region codes in the dataset (column region
):
N: Main clause subject (in object-modifications only)
V: Main clause verb (in object-modifications only)
Det+CL: Determiner+classifier
Adv: Adverb
VN: RC-verb+RC-object (subject relatives) or RC-subject+RC-verb (object relatives)
Note: These two words were merged into one region after the experiment; they were presented as separate regions during the experiment.
FreqP: Frequency phrase/durational phrase
DE: Relativizer "de"
head: Relative clause head noun
hd1: First word after the head noun
hd2: Second word after the head noun
hd3: Third word after the head noun
hd4: Fourth word after the head noun (only in subject-modifications)
hd5: Fifth word after the head noun (only in subject-modifications)
Notes on reading times (column rt
):
The reading time of the relative clause region (e.g., "V-N" or "N-V") was computed by summing up the reading times of the relative clause verb and noun.
The verb and noun were presented as two separate regions during the experiment.
Factor I: Modification type (subject modification; object modification)
Factor II: Relative clause type (subject relative; object relative)
Condition labels:
a) subject modification; subject relative
b) subject modification; object relative
c) object modification; subject relative
d) object modification; object relative
Jäger, L., Chen, Z., Li, Q., Lin, C.-J. C., & Vasishth, S. (2015). The subject-relative advantage in Chinese: Evidence for expectation-based processing. Journal of Memory and Language, 79–80, 97-120. doi:10.1016/j.jml.2014.10.005
Other datasets:
df_sent
# Basic exploration head(df_jaeger14) # Summarize reaction times by region library(tidytable) df_jaeger14 |> group_by(region) |> summarize(mean_rt = mean(rt, na.rm = TRUE))
# Basic exploration head(df_jaeger14) # Summarize reaction times by region library(tidytable) df_jaeger14 |> group_by(region) |> summarize(mean_rt = mean(rt, na.rm = TRUE))
This dataset contains two example sentences, split
word-by-word. It is structured to demonstrate the use of the pangoling
package for processing text data.
df_sent
df_sent
A data frame with 15 rows and 2 columns:
(integer) Sentence number, indicating which sentence each word belongs to.
(character) Words from the sentences.
Other datasets:
df_jaeger14
# Load the dataset data("df_sent") df_sent
# Load the dataset data("df_sent") df_sent
pangoling
install_py_pangoling
function facilitates the installation of Python
packages needed for using pangoling
within an R environment,
utilizing the reticulate
package for managing Python environments. It
supports various installation methods,
environment settings, and Python versions.
install_py_pangoling(method = c("auto", "virtualenv", "conda"), conda = "auto", version = "default", envname = "r-pangoling", restart_session = TRUE, conda_python_version = NULL, ..., pip_ignore_installed = FALSE, new_env = identical(envname, "r-pangoling"), python_version = NULL)
install_py_pangoling(method = c("auto", "virtualenv", "conda"), conda = "auto", version = "default", envname = "r-pangoling", restart_session = TRUE, conda_python_version = NULL, ..., pip_ignore_installed = FALSE, new_env = identical(envname, "r-pangoling"), python_version = NULL)
method |
A character vector specifying the environment management method. Options are 'auto', 'virtualenv', and 'conda'. Default is 'auto'. |
conda |
Specifies the conda binary to use. Default is 'auto'. |
version |
The Python version to use. Default is 'default', automatically selected. |
envname |
Name of the virtual environment. Default is 'r-pangoling'. |
restart_session |
Logical, whether to restart the R session after installation. Default is TRUE. |
conda_python_version |
Python version for conda environments. |
... |
Additional arguments passed to |
pip_ignore_installed |
Logical, whether to ignore already installed packages. Default is FALSE. |
new_env |
Logical, whether to create a new environment if |
python_version |
Specifies the Python version for the environment. |
This function automatically selects the appropriate method for environment management and Python installation, with a focus on virtual and conda environments. It ensures flexibility in dependency management and Python version control. If a new environment is created, existing environments with the same name are removed.
The function returns NULL
invisibly, but outputs a message on successful
installation.
Other helper functions:
installed_py_pangoling()
,
set_cache_folder()
# Install with default settings: if (FALSE) { install_py_pangoling() }
# Install with default settings: if (FALSE) { install_py_pangoling() }
pangoling
are installedThis function verifies whether the necessary Python modules (transformers
and torch
) are available in the current Python environment.
installed_py_pangoling()
installed_py_pangoling()
A logical value: TRUE
if both transformers
and torch
are
installed and accessible, otherwise FALSE
.
Other helper functions:
install_py_pangoling()
,
set_cache_folder()
## Not run: if (installed_py_pangoling()) { message("Python dependencies are installed.") } else { warning("Python dependencies are missing. Please install `torch` and `transformers`.") } ## End(Not run)
## Not run: if (installed_py_pangoling()) { message("Python dependencies are installed.") } else { warning("Python dependencies are missing. Please install `torch` and `transformers`.") } ## End(Not run)
Returns the configuration of a masked model.
masked_config( model = getOption("pangoling.masked.default"), config_model = NULL )
masked_config( model = getOption("pangoling.masked.default"), config_model = NULL )
model |
Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.
If not specified, the masked model that will be used is the one set in
specified in the global option pangoling.masked.default
, this can be
accessed via getOption("pangoling.masked.default")
(by default
"bert-base-uncased"). To change the default option
use options(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found in Hugging Face website
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
python method
from_pretrained
for details. In case of errors check the status of
https://status.huggingface.co/
A list with the configuration of the model.
Other masked model helper functions:
masked_preload()
masked_config(model = "bert-base-uncased")
masked_config(model = "bert-base-uncased")
Preloads a masked language model to speed up next runs.
masked_preload( model = getOption("pangoling.masked.default"), add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
masked_preload( model = getOption("pangoling.masked.default"), add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
model |
Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.
If not specified, the masked model that will be used is the one set in
specified in the global option pangoling.masked.default
, this can be
accessed via getOption("pangoling.masked.default")
(by default
"bert-base-uncased"). To change the default option
use options(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found in Hugging Face website
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
python method
from_pretrained
for details. In case of errors check the status of
https://status.huggingface.co/
Nothing.
Other masked model helper functions:
masked_config()
causal_preload(model = "bert-base-uncased")
causal_preload(model = "bert-base-uncased")
Get the predictability (by default the natural logarithm of the word probability) of a vector of target words (or phrase) given a vector of left and of right contexts using a masked transformer.
masked_targets_pred( prev_contexts, targets, after_contexts, log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.masked.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
masked_targets_pred( prev_contexts, targets, after_contexts, log.p = getOption("pangoling.log.p"), ignore_regex = "", model = getOption("pangoling.masked.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
prev_contexts |
Left context of the target word in left-to-right written languages. |
targets |
Target words. |
after_contexts |
Right context of the target in left-to-right written languages. |
log.p |
Base of the logarithm used for the output predictability values.
If |
ignore_regex |
Can ignore certain characters when calculating the log
probabilities. For example |
model |
Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.
If not specified, the masked model that will be used is the one set in
specified in the global option pangoling.masked.default
, this can be
accessed via getOption("pangoling.masked.default")
(by default
"bert-base-uncased"). To change the default option
use options(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found in Hugging Face website
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
python method
from_pretrained
for details. In case of errors check the status of
https://status.huggingface.co/
A named vector of predictability values (by default the natural logarithm of the word probability).
See the online article in pangoling website for more examples.
Other masked model functions:
masked_tokens_pred_tbl()
masked_targets_pred( prev_contexts = c("The", "The"), targets = c("apple", "pear"), after_contexts = c( "doesn't fall far from the tree.", "doesn't fall far from the tree." ), model = "bert-base-uncased" )
masked_targets_pred( prev_contexts = c("The", "The"), targets = c("apple", "pear"), after_contexts = c( "doesn't fall far from the tree.", "doesn't fall far from the tree." ), model = "bert-base-uncased" )
For each mask, indicated with [MASK]
, in a sentence, get the possible
tokens and their predictability (by default the natural logarithm of the
word probability) using a masked transformer.
masked_tokens_pred_tbl( masked_sentences, log.p = getOption("pangoling.log.p"), model = getOption("pangoling.masked.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
masked_tokens_pred_tbl( masked_sentences, log.p = getOption("pangoling.log.p"), model = getOption("pangoling.masked.default"), checkpoint = NULL, add_special_tokens = NULL, config_model = NULL, config_tokenizer = NULL )
masked_sentences |
Masked sentences. |
log.p |
Base of the logarithm used for the output predictability values.
If |
model |
Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website. |
checkpoint |
Folder of a checkpoint. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_model |
List with other arguments that control how the model from Hugging Face is accessed. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.
If not specified, the masked model that will be used is the one set in
specified in the global option pangoling.masked.default
, this can be
accessed via getOption("pangoling.masked.default")
(by default
"bert-base-uncased"). To change the default option
use options(pangoling.masked.default = "newmaskedmodel")
.
A list of possible masked can be found in Hugging Face website
Using the config_model
and config_tokenizer
arguments, it's possible to
control how the model and tokenizer from Hugging Face is accessed, see the
python method
from_pretrained
for details. In case of errors check the status of
https://status.huggingface.co/
A table with the masked sentences, the tokens (token
),
predictability (pred
), and the respective mask number (mask_n
).
See the online article in pangoling website for more examples.
Other masked model functions:
masked_targets_pred()
masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model = "bert-base-uncased" )
masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.", model = "bert-base-uncased" )
The number of tokens in a string or vector of strings
ntokens( x, model = getOption("pangoling.causal.default"), add_special_tokens = NULL, config_tokenizer = NULL )
ntokens( x, model = getOption("pangoling.causal.default"), add_special_tokens = NULL, config_tokenizer = NULL )
x |
character input |
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
The number of tokens in a string or vector of words.
Other token-related functions:
tokenize_lst()
,
transformer_vocab()
ntokens(x = c("The apple doesn't fall far from the tree."), model = "gpt2")
ntokens(x = c("The apple doesn't fall far from the tree."), model = "gpt2")
Calculates the perplexity of a vector of (log-)probabilities.
perplexity_calc(x, na.rm = FALSE, log.p = TRUE)
perplexity_calc(x, na.rm = FALSE, log.p = TRUE)
x |
A vector of log-probabilities. |
na.rm |
Should missing values (including NaN) be removed? |
log.p |
If TRUE (default), x are assumed to be log-transformed probabilities with base e, if FALSE x are assumed to be raw probabilities, alternatively log.p can be the base of other logarithmic transformations. |
If x are raw probabilities (NOT the default), then perplexity is calculated as follows:
The perplexity.
probs <- c(.3, .5, .6) perplexity_calc(probs, log.p = FALSE) lprobs <- log(probs) perplexity_calc(lprobs, log.p = TRUE)
probs <- c(.3, .5, .6) perplexity_calc(probs, log.p = FALSE) lprobs <- log(probs) perplexity_calc(lprobs, log.p = TRUE)
This function sets the cache directory for HuggingFace transformers. If a
path is given, the function checks if the directory exists and then sets the
TRANSFORMERS_CACHE
environment variable to this path.
If no path is provided, the function checks for the existing cache directory
in a number of environment variables.
If none of these environment variables are set, it provides the user with
information on the default cache directory.
set_cache_folder(path = NULL)
set_cache_folder(path = NULL)
path |
Character string, the path to set as the cache directory. If NULL, the function will look for the cache directory in a number of environment variables. Default is NULL. |
Nothing is returned, this function is called for its side effect of
setting the TRANSFORMERS_CACHE
environment variable, or providing
information to the user.
Other helper functions:
install_py_pangoling()
,
installed_py_pangoling()
## Not run: set_cache_folder("~/new_cache_dir") ## End(Not run)
## Not run: set_cache_folder("~/new_cache_dir") ## End(Not run)
Tokenize a string or token ids.
tokenize_lst( x, decode = FALSE, model = getOption("pangoling.causal.default"), add_special_tokens = NULL, config_tokenizer = NULL )
tokenize_lst( x, decode = FALSE, model = getOption("pangoling.causal.default"), add_special_tokens = NULL, config_tokenizer = NULL )
x |
Strings or token ids. |
decode |
Logical. If |
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
A list with tokens
Other token-related functions:
ntokens()
,
transformer_vocab()
tokenize_lst(x = c("The apple doesn't fall far from the tree."), model = "gpt2")
tokenize_lst(x = c("The apple doesn't fall far from the tree."), model = "gpt2")
Returns the (decoded) vocabulary of a model.
transformer_vocab( model = getOption("pangoling.causal.default"), add_special_tokens = NULL, decode = FALSE, config_tokenizer = NULL )
transformer_vocab( model = getOption("pangoling.causal.default"), add_special_tokens = NULL, decode = FALSE, config_tokenizer = NULL )
model |
Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website. |
add_special_tokens |
Whether to include special tokens. It has the same default as the AutoTokenizer method in Python. |
decode |
Logical. If |
config_tokenizer |
List with other arguments that control how the tokenizer from Hugging Face is accessed. |
A vector with the vocabulary of a model.
Other token-related functions:
ntokens()
,
tokenize_lst()
transformer_vocab(model = "gpt2") |> head()
transformer_vocab(model = "gpt2") |> head()