Package 'pangoling' reference manual

Title:	Access to Large Language Model Predictions
Description:	Provides access to word predictability estimates using large language models (LLMs) based on 'transformer' architectures via integration with the 'Hugging Face' ecosystem. The package interfaces with pre-trained neural networks and supports both causal/auto-regressive LLMs (e.g., 'GPT-2'; Radford et al., 2019) and masked/bidirectional LLMs (e.g., 'BERT'; Devlin et al., 2019, <doi:10.48550/arXiv.1810.04805>) to compute the probability of words, phrases, or tokens given their linguistic context. By enabling a straightforward estimation of word predictability, the package facilitates research in psycholinguistics, computational linguistics, and natural language processing (NLP).
Authors:	Bruno Nicenboim [aut, cre] , Chris Emmerly [ctb], Giovanni Cassani [ctb], Lisa Levinson [rev], Utku Turk [rev]
Maintainer:	Bruno Nicenboim <b.nicenboim@tilburguniversity.edu>
License:	MIT + file LICENSE
Version:	1.0.1
Built:	2025-03-11 21:18:49 UTC
Source:	https://github.com/ropensci/pangoling

Returns the configuration of a causal model

Description

Returns the configuration of a causal model

Usage

causal_config(
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  config_model = NULL
)
causal_config(
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  config_model = NULL
)

Arguments

`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.

Value

A list with the configuration of the model.

More details about causal models

A causal language model (also called GPT-like, auto-regressive, or decoder model) is a type of large language model usually used for text-generation that can predict the next word (or more accurately in fact token) based on a preceding context.

If not specified, the causal model used will be the one set in the global option pangoling.causal.default, this can be accessed via getOption("pangoling.causal.default") (by default "gpt2"). To change the default option use options(pangoling.causal.default = "newcausalmodel").

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

Examples


causal_config(model = "gpt2")

causal_config(model = "gpt2")

Generate next tokens after a context and their predictability using a causal transformer model

Description

This function predicts the possible next tokens and their predictability (log-probabilities by default). The function sorts tokens in descending order of their predictability.

Usage

causal_next_tokens_pred_tbl(
  context,
  log.p = getOption("pangoling.log.p"),
  decode = FALSE,
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)
causal_next_tokens_pred_tbl(
  context,
  log.p = getOption("pangoling.log.p"),
  decode = FALSE,
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)

Arguments

`context`	A single string representing the context for which the next tokens and their predictabilities are predicted.
`log.p`	Base of the logarithm used for the output predictability values. If `TRUE` (default), the natural logarithm (base e) is used. If `FALSE`, the raw probabilities are returned. Alternatively, `log.p` can be set to a numeric value specifying the base of the logarithm (e.g., `2` for base-2 logarithms). To get surprisal in bits (rather than predictability), set `log.p = 1/2`.
`decode`	Logical. If `TRUE`, decodes the tokens into human-readable strings, handling special characters and diacritics. Default is `FALSE`.
`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Details

The function uses a causal transformer model to compute the predictability of all tokens in the model's vocabulary, given a single input context. It returns a table where each row represents a token, along with its predictability score. By default, the function returns log-probabilities in natural logarithm (base e), but you can specify a different logarithm base (e.g., log.p = 1/2 for surprisal in bits).

If decode = TRUE, the tokens are converted into human-readable strings, handling special characters like accents and diacritics. This ensures that tokens are more interpretable, especially for languages with complex tokenization.

Value

A table with possible next tokens and their log-probabilities.

More details about causal models

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

Examples


causal_next_tokens_pred_tbl(
  context = "The apple doesn't fall far from the",
  model = "gpt2"
)

causal_next_tokens_pred_tbl(
  context = "The apple doesn't fall far from the",
  model = "gpt2"
)

Generate a list of predictability matrices using a causal transformer model

Description

This function computes a list of matrices, where each matrix corresponds to a unique group specified by the by argument. Each matrix represents the predictability of every token in the input text (x) based on preceding context, as evaluated by a causal transformer model.

Usage

causal_pred_mats(
  x,
  by = rep(1, length(x)),
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  sorted = FALSE,
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  decode = FALSE,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)
causal_pred_mats(
  x,
  by = rep(1, length(x)),
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  sorted = FALSE,
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  decode = FALSE,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

Arguments

`x`	A character vector of words, phrases, or texts to evaluate.
`by`	A grouping variable indicating how texts are split into groups.
`sep`	A string specifying how words are separated within contexts or groups. Default is `" "`. For languages that don't have spaces between words (e.g., Chinese), set `sep = ""`.
`log.p`	Base of the logarithm used for the output predictability values. If `TRUE` (default), the natural logarithm (base e) is used. If `FALSE`, the raw probabilities are returned. Alternatively, `log.p` can be set to a numeric value specifying the base of the logarithm (e.g., `2` for base-2 logarithms). To get surprisal in bits (rather than predictability), set `log.p = 1/2`.
`sorted`	When default FALSE it will retain the order of groups we are splitting by. When TRUE then sorted (according to `by`) list(s) are returned.
`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`decode`	Logical. If `TRUE`, decodes the tokens into human-readable strings, handling special characters and diacritics. Default is `FALSE`.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.
`batch_size`	Maximum number of sentences/texts processed in parallel. Larger batches increase speed but use more memory. Since all texts in a batch must have the same length, shorter ones are padded with placeholder tokens.
`...`	Currently not in use.

Details

The function splits the input x into groups specified by the by argument and processes each group independently. For each group, the model computes the predictability of each token in its vocabulary based on preceding context.

Each matrix contains:

Rows representing the model's vocabulary.
Columns corresponding to tokens in the group (e.g., a sentence or paragraph).
By default, values in the matrices are the natural logarithm of word probabilities.

Value

A list of matrices with tokens in their columns and the vocabulary of the model in their rows

More details about causal models

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

Examples


data("df_sent")
df_sent
list_of_mats <- causal_pred_mats(
                       x = df_sent$word,
                       by = df_sent$sent_n,  
                       model = "gpt2"
                )

# View the structure of the resulting list
list_of_mats |> str()

# Inspect the last rows of the first matrix
list_of_mats[[1]] |> tail()

# Inspect the last rows of the second matrix
list_of_mats[[2]] |> tail()

data("df_sent")
df_sent
list_of_mats <- causal_pred_mats(
                       x = df_sent$word,
                       by = df_sent$sent_n,  
                       model = "gpt2"
                )

# View the structure of the resulting list
list_of_mats |> str()

# Inspect the last rows of the first matrix
list_of_mats[[1]] |> tail()

# Inspect the last rows of the second matrix
list_of_mats[[2]] |> tail()

Preloads a causal language model

Description

Preloads a causal language model to speed up next runs.

Usage

causal_preload(
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)
causal_preload(
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)

Arguments

`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Value

Nothing.

More details about causal models

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

Examples


causal_preload(model = "gpt2")

causal_preload(model = "gpt2")

Compute predictability using a causal transformer model

Description

These functions calculate the predictability of words, phrases, or tokens using a causal transformer model.

Usage

causal_words_pred(
  x,
  by = rep(1, length(x)),
  word_n = NULL,
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

causal_tokens_pred_lst(
  texts,
  log.p = getOption("pangoling.log.p"),
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1
)

causal_targets_pred(
  contexts,
  targets,
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)
causal_words_pred(
  x,
  by = rep(1, length(x)),
  word_n = NULL,
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

causal_tokens_pred_lst(
  texts,
  log.p = getOption("pangoling.log.p"),
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1
)

causal_targets_pred(
  contexts,
  targets,
  sep = " ",
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.causal.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL,
  batch_size = 1,
  ...
)

Arguments

`x`	A character vector of words, phrases, or texts to evaluate.
`by`	A grouping variable indicating how texts are split into groups.
`word_n`	Word order, by default this is the word order of the vector x.
`sep`	A string specifying how words are separated within contexts or groups. Default is `" "`. For languages that don't have spaces between words (e.g., Chinese), set `sep = ""`.
`log.p`	Base of the logarithm used for the output predictability values. If `TRUE` (default), the natural logarithm (base e) is used. If `FALSE`, the raw probabilities are returned. Alternatively, `log.p` can be set to a numeric value specifying the base of the logarithm (e.g., `2` for base-2 logarithms). To get surprisal in bits (rather than predictability), set `log.p = 1/2`.
`ignore_regex`	Can ignore certain characters when calculating the log probabilities. For example `⁠^[[:punct:]]$⁠` will ignore all punctuation that stands alone in a token.
`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.
`batch_size`	Maximum number of sentences/texts processed in parallel. Larger batches increase speed but use more memory. Since all texts in a batch must have the same length, shorter ones are padded with placeholder tokens.
`...`	Currently not in use.
`texts`	A vector or list of sentences or paragraphs.
`contexts`	A character vector of contexts corresponding to each target.
`targets`	A character vector of target words or phrases.

Details

These functions calculate the predictability (by default the natural logarithm of the word probability) of words, phrases or tokens using a causal transformer model:

causal_targets_pred(): Evaluates specific target words or phrases based on their given contexts. Use when you have explicit context-target pairs to evaluate, with each target word or phrase paired with a single preceding context.
causal_words_pred(): Computes predictability for all elements of a vector grouped by a specified variable. Use when working with words or phrases split into groups, such as sentences or paragraphs, where predictability is computed for every word or phrase in each group.
causal_tokens_pred_lst(): Computes the predictability of each token in a sentence (or group of sentences) and returns a list of results for each sentence. Use when you want to calculate the predictability of every token in one or more sentences.

See the online article in pangoling website for more examples.

Value

For causal_targets_pred() and causal_words_pred(), a named numeric vector of predictability scores. For causal_tokens_pred_lst(), a list of named numeric vectors, one for each sentence or group.

More details about causal models

A list of possible causal models can be found in Hugging Face website.

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the Python method from_pretrained for details.

In case of errors when a new model is run, check the status of https://status.huggingface.co/

Examples


# Using causal_targets_pred
causal_targets_pred(
  contexts = c("The apple doesn't fall far from the",
               "Don't judge a book by its"),
  targets = c("tree.", "cover."),
  model = "gpt2"
)

# Using causal_words_pred
causal_words_pred(
  x = df_sent$word,
  by = df_sent$sent_n,
  model = "gpt2"
)

# Using causal_tokens_pred_lst
preds <- causal_tokens_pred_lst(
  texts = c("The apple doesn't fall far from the tree.",
            "Don't judge a book by its cover."),
  model = "gpt2"
)
preds

# Convert the output to a tidy table
suppressPackageStartupMessages(library(tidytable))
map2_dfr(preds, seq_along(preds), 
~ data.frame(tokens = names(.x), pred = .x, id = .y))

# Using causal_targets_pred
causal_targets_pred(
  contexts = c("The apple doesn't fall far from the",
               "Don't judge a book by its"),
  targets = c("tree.", "cover."),
  model = "gpt2"
)

# Using causal_words_pred
causal_words_pred(
  x = df_sent$word,
  by = df_sent$sent_n,
  model = "gpt2"
)

# Using causal_tokens_pred_lst
preds <- causal_tokens_pred_lst(
  texts = c("The apple doesn't fall far from the tree.",
            "Don't judge a book by its cover."),
  model = "gpt2"
)
preds

# Convert the output to a tidy table
suppressPackageStartupMessages(library(tidytable))
map2_dfr(preds, seq_along(preds), 
~ data.frame(tokens = names(.x), pred = .x, id = .y))

Self-Paced Reading Dataset on Chinese Relative Clauses

Description

This dataset contains data from a self-paced reading experiment on Chinese relative clause comprehension. It is structured to support analysis of reaction times, comprehension accuracy, and surprisal values across various experimental conditions in a 2x2 fully crossed factorial design:

Usage

data(df_jaeger14)
data(df_jaeger14)

Format

A tibble with 8,624 rows and 15 variables:

subject: Participant identifier, a character vector.
item: Trial item number, an integer.
cond: Experimental condition, a character vector indicating variations in sentence structure (e.g., "a", "b", "c", "d").
word: Chinese word presented in each trial, a character vector.
wordn: Position of the word within the sentence, an integer.
rt: Reaction time in milliseconds for reading each word, an integer.
region: Sentence region or phrase type (e.g., "hd1", "Det+CL"), a character vector.
question: Comprehension question associated with the trial, a character vector.
accuracy: Binary accuracy score for the comprehension question (1 = correct, 0 = incorrect).
correct_answer: Expected correct answer for the comprehension question, a character vector ("Y" or "N").
question_type: Type of comprehension question, a character vector.
experiment: Name of the experiment, indicating self-paced reading, a character vector.
list: Experimental list number, for counterbalancing item presentation, an integer.
sentence: Full sentence used in the trial with words marked for analysis, a character vector.
surprisal: Model-derived surprisal values for each word, a numeric vector.

Region codes in the dataset (column region):

N: Main clause subject (in object-modifications only)
V: Main clause verb (in object-modifications only)
Det+CL: Determiner+classifier
Adv: Adverb
VN: RC-verb+RC-object (subject relatives) or RC-subject+RC-verb (object relatives)
- Note: These two words were merged into one region after the experiment; they were presented as separate regions during the experiment.
FreqP: Frequency phrase/durational phrase
DE: Relativizer "de"
head: Relative clause head noun
hd1: First word after the head noun
hd2: Second word after the head noun
hd3: Third word after the head noun
hd4: Fourth word after the head noun (only in subject-modifications)
hd5: Fifth word after the head noun (only in subject-modifications)

Notes on reading times (column rt):

The reading time of the relative clause region (e.g., "V-N" or "N-V") was computed by summing up the reading times of the relative clause verb and noun.
The verb and noun were presented as two separate regions during the experiment.

Details

Factor I: Modification type (subject modification; object modification)
Factor II: Relative clause type (subject relative; object relative)

Condition labels:

a) subject modification; subject relative
b) subject modification; object relative
c) object modification; subject relative
d) object modification; object relative

Source

Jäger, L., Chen, Z., Li, Q., Lin, C.-J. C., & Vasishth, S. (2015). The subject-relative advantage in Chinese: Evidence for expectation-based processing. Journal of Memory and Language, 79–80, 97-120. doi:10.1016/j.jml.2014.10.005

Examples

# Basic exploration
head(df_jaeger14)

# Summarize reaction times by region
 library(tidytable)
df_jaeger14 |>
  group_by(region) |>
  summarize(mean_rt = mean(rt, na.rm = TRUE))
# Basic exploration
head(df_jaeger14)

# Summarize reaction times by region
 library(tidytable)
df_jaeger14 |>
  group_by(region) |>
  summarize(mean_rt = mean(rt, na.rm = TRUE))

Example dataset: Two word-by-word sentences

Description

This dataset contains two example sentences, split word-by-word. It is structured to demonstrate the use of the pangoling package for processing text data.

Usage

df_sent
df_sent

Format

A data frame with 15 rows and 2 columns:

sent_n: (integer) Sentence number, indicating which sentence each word belongs to.
word: (character) Words from the sentences.

Examples

# Load the dataset
data("df_sent")
df_sent
# Load the dataset
data("df_sent")
df_sent

Install the Python packages needed for `pangoling`

Description

install_py_pangoling function facilitates the installation of Python packages needed for using pangoling within an R environment, utilizing the reticulate package for managing Python environments. It supports various installation methods, environment settings, and Python versions.

Usage

install_py_pangoling(method = c("auto", "virtualenv", "conda"),
                     conda = "auto",
                     version = "default",
                     envname = "r-pangoling",
                     restart_session = TRUE,
                     conda_python_version = NULL,
                     ...,
                     pip_ignore_installed = FALSE,
                     new_env = identical(envname, "r-pangoling"),
                     python_version = NULL)
install_py_pangoling(method = c("auto", "virtualenv", "conda"),
                     conda = "auto",
                     version = "default",
                     envname = "r-pangoling",
                     restart_session = TRUE,
                     conda_python_version = NULL,
                     ...,
                     pip_ignore_installed = FALSE,
                     new_env = identical(envname, "r-pangoling"),
                     python_version = NULL)

Arguments

`method`	A character vector specifying the environment management method. Options are 'auto', 'virtualenv', and 'conda'. Default is 'auto'.
`conda`	Specifies the conda binary to use. Default is 'auto'.
`version`	The Python version to use. Default is 'default', automatically selected.
`envname`	Name of the virtual environment. Default is 'r-pangoling'.
`restart_session`	Logical, whether to restart the R session after installation. Default is TRUE.
`conda_python_version`	Python version for conda environments.
`...`	Additional arguments passed to `reticulate::py_install`.
`pip_ignore_installed`	Logical, whether to ignore already installed packages. Default is FALSE.
`new_env`	Logical, whether to create a new environment if `envname` is 'r-pangoling'. Default is the identity of `envname`.
`python_version`	Specifies the Python version for the environment.

Details

This function automatically selects the appropriate method for environment management and Python installation, with a focus on virtual and conda environments. It ensures flexibility in dependency management and Python version control. If a new environment is created, existing environments with the same name are removed.

Value

The function returns NULL invisibly, but outputs a message on successful installation.

Examples


# Install with default settings:
if (FALSE) {
 install_py_pangoling()
}

# Install with default settings:
if (FALSE) {
 install_py_pangoling()
}

Check if the required Python dependencies for `pangoling` are installed

Description

This function verifies whether the necessary Python modules (transformers and torch) are available in the current Python environment.

Usage

installed_py_pangoling()
installed_py_pangoling()

Value

A logical value: TRUE if both transformers and torch are installed and accessible, otherwise FALSE.

Examples

## Not run: 
if (installed_py_pangoling()) {
 message("Python dependencies are installed.")
} else {
 warning("Python dependencies are missing. Please install `torch` and `transformers`.")
}

## End(Not run)
## Not run: 
if (installed_py_pangoling()) {
 message("Python dependencies are installed.")
} else {
 warning("Python dependencies are missing. Please install `torch` and `transformers`.")
}

## End(Not run)

Returns the configuration of a masked model

Description

Returns the configuration of a masked model.

Usage

masked_config(
  model = getOption("pangoling.masked.default"),
  config_model = NULL
)
masked_config(
  model = getOption("pangoling.masked.default"),
  config_model = NULL
)

Arguments

`model`	Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.

If not specified, the masked model that will be used is the one set in specified in the global option pangoling.masked.default, this can be accessed via getOption("pangoling.masked.default") (by default "bert-base-uncased"). To change the default option use options(pangoling.masked.default = "newmaskedmodel").

A list of possible masked can be found in Hugging Face website

Using the config_model and config_tokenizer arguments, it's possible to control how the model and tokenizer from Hugging Face is accessed, see the python method from_pretrained for details. In case of errors check the status of https://status.huggingface.co/

Value

A list with the configuration of the model.

Examples


masked_config(model = "bert-base-uncased")

masked_config(model = "bert-base-uncased")

Preloads a masked language model

Description

Preloads a masked language model to speed up next runs.

Usage

masked_preload(
  model = getOption("pangoling.masked.default"),
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)
masked_preload(
  model = getOption("pangoling.masked.default"),
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)

Arguments

`model`	Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.

A list of possible masked can be found in Hugging Face website

Value

Nothing.

Examples


causal_preload(model = "bert-base-uncased")

causal_preload(model = "bert-base-uncased")

Get the predictability of a target word (or phrase) given a left and right context

Description

Get the predictability (by default the natural logarithm of the word probability) of a vector of target words (or phrase) given a vector of left and of right contexts using a masked transformer.

Usage

masked_targets_pred(
  prev_contexts,
  targets,
  after_contexts,
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.masked.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)
masked_targets_pred(
  prev_contexts,
  targets,
  after_contexts,
  log.p = getOption("pangoling.log.p"),
  ignore_regex = "",
  model = getOption("pangoling.masked.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)

Arguments

`prev_contexts`	Left context of the target word in left-to-right written languages.
`targets`	Target words.
`after_contexts`	Right context of the target in left-to-right written languages.
`log.p`	Base of the logarithm used for the output predictability values. If `TRUE` (default), the natural logarithm (base e) is used. If `FALSE`, the raw probabilities are returned. Alternatively, `log.p` can be set to a numeric value specifying the base of the logarithm (e.g., `2` for base-2 logarithms). To get surprisal in bits (rather than predictability), set `log.p = 1/2`.
`ignore_regex`	Can ignore certain characters when calculating the log probabilities. For example `⁠^[[:punct:]]$⁠` will ignore all punctuation that stands alone in a token.
`model`	Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.

A list of possible masked can be found in Hugging Face website

Value

A named vector of predictability values (by default the natural logarithm of the word probability).

More examples

See the online article in pangoling website for more examples.

Examples


masked_targets_pred(
  prev_contexts = c("The", "The"),
  targets = c("apple", "pear"),
  after_contexts = c(
    "doesn't fall far from the tree.",
    "doesn't fall far from the tree."
  ),
  model = "bert-base-uncased"
)

masked_targets_pred(
  prev_contexts = c("The", "The"),
  targets = c("apple", "pear"),
  after_contexts = c(
    "doesn't fall far from the tree.",
    "doesn't fall far from the tree."
  ),
  model = "bert-base-uncased"
)

Get the possible tokens and their log probabilities for each mask in a sentence

Description

For each mask, indicated with ⁠[MASK]⁠, in a sentence, get the possible tokens and their predictability (by default the natural logarithm of the word probability) using a masked transformer.

Usage

masked_tokens_pred_tbl(
  masked_sentences,
  log.p = getOption("pangoling.log.p"),
  model = getOption("pangoling.masked.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)
masked_tokens_pred_tbl(
  masked_sentences,
  log.p = getOption("pangoling.log.p"),
  model = getOption("pangoling.masked.default"),
  checkpoint = NULL,
  add_special_tokens = NULL,
  config_model = NULL,
  config_tokenizer = NULL
)

Arguments

`masked_sentences`	Masked sentences.
`log.p`	Base of the logarithm used for the output predictability values. If `TRUE` (default), the natural logarithm (base e) is used. If `FALSE`, the raw probabilities are returned. Alternatively, `log.p` can be set to a numeric value specifying the base of the logarithm (e.g., `2` for base-2 logarithms). To get surprisal in bits (rather than predictability), set `log.p = 1/2`.
`model`	Name of a pre-trained model or folder. One should be able to use models based on "bert". See hugging face website.
`checkpoint`	Folder of a checkpoint.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_model`	List with other arguments that control how the model from Hugging Face is accessed.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Details

A masked language model (also called BERT-like, or encoder model) is a type of large language model that can be used to predict the content of a mask in a sentence.

A list of possible masked can be found in Hugging Face website

Value

A table with the masked sentences, the tokens (token), predictability (pred), and the respective mask number (mask_n).

More examples

See the online article in pangoling website for more examples.

Examples


masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.",
  model = "bert-base-uncased"
)

masked_tokens_pred_tbl("The [MASK] doesn't fall far from the tree.",
  model = "bert-base-uncased"
)

The number of tokens in a string or vector of strings

Description

The number of tokens in a string or vector of strings

Usage

ntokens(
  x,
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  config_tokenizer = NULL
)
ntokens(
  x,
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  config_tokenizer = NULL
)

Arguments

`x`	character input
`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Value

The number of tokens in a string or vector of words.

Examples


ntokens(x = c("The apple doesn't fall far from the tree."), model = "gpt2")

ntokens(x = c("The apple doesn't fall far from the tree."), model = "gpt2")

Calculates perplexity

Description

Calculates the perplexity of a vector of (log-)probabilities.

Usage

perplexity_calc(x, na.rm = FALSE, log.p = TRUE)
perplexity_calc(x, na.rm = FALSE, log.p = TRUE)

Arguments

`x`	A vector of log-probabilities.
`na.rm`	Should missing values (including NaN) be removed?
`log.p`	If TRUE (default), x are assumed to be log-transformed probabilities with base e, if FALSE x are assumed to be raw probabilities, alternatively log.p can be the base of other logarithmic transformations.

Details

If x are raw probabilities (NOT the default), then perplexity is calculated as follows:

$\left(\prod_{n} x_n \right)^\frac{1}{N}$

Value

The perplexity.

Examples

probs <- c(.3, .5, .6)
perplexity_calc(probs, log.p = FALSE)
lprobs <- log(probs)
perplexity_calc(lprobs, log.p = TRUE)
probs <- c(.3, .5, .6)
perplexity_calc(probs, log.p = FALSE)
lprobs <- log(probs)
perplexity_calc(lprobs, log.p = TRUE)

Set cache folder for HuggingFace transformers

Description

This function sets the cache directory for HuggingFace transformers. If a path is given, the function checks if the directory exists and then sets the TRANSFORMERS_CACHE environment variable to this path. If no path is provided, the function checks for the existing cache directory in a number of environment variables. If none of these environment variables are set, it provides the user with information on the default cache directory.

Usage

set_cache_folder(path = NULL)
set_cache_folder(path = NULL)

Arguments

path

Character string, the path to set as the cache directory. If NULL, the function will look for the cache directory in a number of environment variables. Default is NULL.

Value

Nothing is returned, this function is called for its side effect of setting the TRANSFORMERS_CACHE environment variable, or providing information to the user.

Examples


## Not run: 
set_cache_folder("~/new_cache_dir")

## End(Not run)

## Not run: 
set_cache_folder("~/new_cache_dir")

## End(Not run)

Tokenize an input

Description

Tokenize a string or token ids.

Usage

tokenize_lst(
  x,
  decode = FALSE,
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  config_tokenizer = NULL
)
tokenize_lst(
  x,
  decode = FALSE,
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  config_tokenizer = NULL
)

Arguments

`x`	Strings or token ids.
`decode`	Logical. If `TRUE`, decodes the tokens into human-readable strings, handling special characters and diacritics. Default is `FALSE`.
`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Value

A list with tokens

Examples


tokenize_lst(x = c("The apple doesn't fall far from the tree."), 
             model = "gpt2")

tokenize_lst(x = c("The apple doesn't fall far from the tree."), 
             model = "gpt2")

Returns the vocabulary of a model

Description

Returns the (decoded) vocabulary of a model.

Usage

transformer_vocab(
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  decode = FALSE,
  config_tokenizer = NULL
)
transformer_vocab(
  model = getOption("pangoling.causal.default"),
  add_special_tokens = NULL,
  decode = FALSE,
  config_tokenizer = NULL
)

Arguments

`model`	Name of a pre-trained model or folder. One should be able to use models based on "gpt2". See hugging face website.
`add_special_tokens`	Whether to include special tokens. It has the same default as the AutoTokenizer method in Python.
`decode`	Logical. If `TRUE`, decodes the tokens into human-readable strings, handling special characters and diacritics. Default is `FALSE`.
`config_tokenizer`	List with other arguments that control how the tokenizer from Hugging Face is accessed.

Value

A vector with the vocabulary of a model.

Examples


transformer_vocab(model = "gpt2") |>
 head()

transformer_vocab(model = "gpt2") |>
 head()

Package 'pangoling'

Help Index

Returns the configuration of a causal model

Description

Usage

Arguments

Value

More details about causal models

See Also

Examples

Generate next tokens after a context and their predictability using a causal transformer model

Description

Usage

Arguments

Details

Value

More details about causal models

See Also

Examples

Generate a list of predictability matrices using a causal transformer model

Description

Usage

Arguments

Details

Value

More details about causal models

See Also

Examples

Preloads a causal language model

Description

Usage

Arguments

Value

More details about causal models

See Also

Examples

Compute predictability using a causal transformer model

Description

Usage

Arguments

Details

Value

More details about causal models

See Also

Examples

Self-Paced Reading Dataset on Chinese Relative Clauses

Description

Usage

Format

Details

Source

See Also

Examples

Example dataset: Two word-by-word sentences

Description

Usage

Format

See Also

Examples

Install the Python packages needed for pangoling

Description

Usage

Arguments

Details

Value

See Also

Examples

Check if the required Python dependencies for pangoling are installed

Description

Usage

Value

See Also

Examples

Returns the configuration of a masked model

Description

Usage

Arguments

Details

Value

See Also

Install the Python packages needed for `pangoling`

Check if the required Python dependencies for `pangoling` are installed