The “pkgmatch” package package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. “pkgmatch” finds the most closely matching packages using a combination of Language Models (LMs, or equivalently, “LLMs” for large language models), and traditional token-frequency algorithms. This document describes how the “pkgmatch” package works.
The LMs used here accept input lengths or contexts of up to 8,096 tokens. That is long enough to accommodate the entire code base of very large R packages, and no chunking is implemented for code inputs. It is possible that this may lead to truncation of code inputs for very large packages, but that should generally be very unlikely.
Text input of packages includes text from the package description (in
the “DESCRIPTION” file), followed by the “README” file (where present),
text of all vignettes, and text of all function descriptions, and
parameters of each function. All code is removed from these text
sources, including removal of all code chunks in .Rmd
documents.
A reduced version of the input text is also submitted both to LM and
token-frequency algorithms, through removing the text of function
descriptions, leaving only package description, README and vignette
text. Text inputs and outputs are thus considered in two forms: the full
form, generally appended with _with_fns
, and the reduced
form appended with _wo_fns
.
All functions accept an input
parameter, with an
accompanying input_is_code
logical flag. This flag is
important because text inputs are passed to a
different model than code
inputs. If not specifically set, the value of
input_is_code
flag is determined by first passing the
input
to an internal function, text_is_code()
.
That function defines inputs as code if the number of recognisable words
in the input is equal to or greater than a default threshold of 98%. The
code
file in which that function is defined contains a short script at
the bottom which was used to define this threshold. This automated
distinction between text and code inputs can always be over-ridden by
specifying a value for the input_is_code
parameter.
The algorithms described here were applied to all R packages from rOpenSci’s suite, as well as from CRAN. Most functions accept a “corpus” parameter, with values of either “ropensci” or “cran” to compare inputs to these specified corpora. The following sections describe algorithms this package uses to assess similarities between inputs and all packages within these corpora.
“pkgmatch” access LMs through a local ollama server, as described in a separate vignette. LMs are exclusively used here to generate “embeddings”, which are numeric vectors of a fixed length representing a (reduced) version of the representation of the underlying text data within the multi-dimensional tensor space of the model. In short, and to avoid any need to actually understand that sentence: embeddings convert a text input into a numeric vector, and enable different texts to be numerically compared.
The comparisons are implemented here, as in the majority of LMs, as cosine similarities. Cosine similarities effectively quantify the similarity in the multi-dimensional space of the LM. These similarities are generally preferred in LMs, because similarities in orientation of two (or more) embedding vectors are generally more informative than similarities in magnitude.
For a given input and corpus, the output of this LM component specifies a cosine similarities for each package, along with a corresponding rank, where the first-ranked package is the most similar.
Similarities from LMs alone are often not reliable, and better results are often generated through combining LM results with equivalent outputs from alternative algorithms. The “pkgmatch” package also passes all input to an internal version of the “Best Match 25” (BM25) algorithm, an algorithm that is commonly used in conjunction with LM outputs. Like LM cosine similarities, the BM25 algorithm generates vectors of numeric values quantifying the similarity between a given input and all packages (or functions) within the specified corpus.
Importantly, the version developed here separates the two steps required in the algorithm, of calculating inverse document frequencies (IDFs) of all terms in the specified corpora, and then using those IDFs to calculate final scores for input documents. The package is distributed along with pre-calculated values of IDFs for all terms (both text and code) for both corpora. These are automatically downloaded the first time any functions are called, and from that time on, always available for immediate use in all subsequent calculations.
For text input, the BM25 algorithm requires inputs to be converted to “tokens”, for which this package relies on rOpenSci’s tokenizer package. The BM25 values then assess similarities through adding similarities for all tokens shared between two documents, weighted by the inverse of token frequencies across all packages within the specified corpus. This inverse weighting ensures that similarities are more influenced by similar usage of infrequent tokens, and less by usage of common tokens.
For code input, tokens for BM25 input are taken as the names of all
functions called, pre-pended with namespaces, such as
base::print()
. Function call are identified from the
“treesitter” algorithm, using functionality of the “treesitter” R
package. This package is currently restricted to the R language only,
and thus code input in this package considers code from /R
sub-directories only, and excludes any source code in other
languages.
All function calls in all packages within each corpus are first converted to IDFs. Calculation of BM25 values are weighted by these so that, as for word-token frequencies, resultant similarities are most strongly affected by similarities in usage of less common functions, and relatively unaffected by similarities in usage of common functions.
Finally, the best matching outputs from the various combinations of
input chunks and algorithms are combined using the Reciprocal
Rank Fusion (RRF) algorithm of Cormack, Clarke, & Büttcher
(2009). Scores for each combination of input chunk and algorithm are
converted to ranks, and all ranks submitted to the RRF algorithm. A
default value of k = 60
is used at all times, as
recommended by the original paper, which also suggests very low
sensitivity of results to variations in this value. Where inputs are
entire packages, final rankings are generated separately for code and
text.