--- title: "How does pkgmatch work?" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{How does pkgmatch work?} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set ( collapse = TRUE, comment = "#>" ) ``` The "pkgmatch" package package finds packages, as well as individual functions, which best match a given input. Inputs can be text descriptions, sections of code, or even entire R packages. "pkgmatch" finds the most closely matching packages using a combination of Language Models (LMs, or equivalently, "LLMs" for large language models), and traditional [token-frequency algorithms](https://en.wikipedia.org/wiki/Okapi_BM25). This document describes how the "pkgmatch" package works. ## Summary - Input chunking: - Packages as either text or code - Text chunks: - Entire package text, including README, vignettes, and all function documentation entries - Package text without function documentation entries - Code chunks: None; entire package code as sole input chunk - LM sizes: Input = 8,096 tokens, embedding length = 768 - Similarity algorithms: - Cosine similarities from LM embeddings - BM25 values - Final ranking: - "Reciprocal Rank Fusion" algorithm of Cormack, Clarke, & Büttcher (2009). - Reproducibility: Entirely deterministic and reproducible ## Input chunking The LMs used here accept input lengths or contexts of up to 8,096 tokens. That is long enough to accommodate the entire code base of very large R packages, and no chunking is implemented for code inputs. It is possible that this may lead to truncation of code inputs for very large packages, but that should generally be very unlikely. Text input of packages includes text from the package description (in the "DESCRIPTION" file), followed by the "README" file (where present), text of all vignettes, and text of all function descriptions, and parameters of each function. All code is removed from these text sources, including removal of all code chunks in `.Rmd` documents. A reduced version of the input text is also submitted both to LM and token-frequency algorithms, through removing the text of function descriptions, leaving only package description, README and vignette text. Text inputs and outputs are thus considered in two forms: the full form, generally appended with `_with_fns`, and the reduced form appended with `_wo_fns`. ## Inputs as text or code All functions accept an `input` parameter, with an accompanying `input_is_code` logical flag. This flag is important because text inputs are passed to [a different model](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) than [code inputs](https://huggingface.co/jinaai/jina-embeddings-v2-base-code). If not specifically set, the value of `input_is_code` flag is determined by first passing the `input` to an internal function, `text_is_code()`. That function defines inputs as code if the number of recognisable words in the input is equal to or greater than a default threshold of 98%. The [code file in which that function is defined](https://github.com/ropensci-review-tools/pkgmatch/blob/main/R/utils.R) contains a short script at the bottom which was used to define this threshold. This automated distinction between text and code inputs can always be over-ridden by specifying a value for the `input_is_code` parameter. ## Similarities between inputs and corpora of R packages The algorithms described here were applied to all R packages from [rOpenSci's suite](https://ropensci.org/packages), as well as from [CRAN](https://cran.r-project.org). Most functions accept a "corpus" parameter, with values of either "ropensci" or "cran" to compare inputs to these specified corpora. The following sections describe algorithms this package uses to assess similarities between inputs and all packages within these corpora. ### LM similarities "pkgmatch" access LMs through a local [ollama server](https://ollama.com), as described in a [separate vignette](https://docs.ropensci.org/pkgmatch/articles/ollama.html). LMs are exclusively used here to generate "embeddings", which are numeric vectors of a fixed length representing a (reduced) version of the representation of the underlying text data within the multi-dimensional tensor space of the model. In short, and to avoid any need to actually understand that sentence: embeddings convert a text input into a numeric vector, and enable different texts to be numerically compared. The comparisons are implemented here, as in the majority of LMs, as cosine similarities. Cosine similarities effectively quantify the similarity in the multi-dimensional space of the LM. These similarities are generally preferred in LMs, because similarities in *orientation* of two (or more) embedding vectors are generally more informative than similarities in magnitude. For a given input and corpus, the output of this LM component specifies a cosine similarities for each package, along with a corresponding rank, where the first-ranked package is the most similar. ### Token-frequency similarities Similarities from LMs alone are often not reliable, and better results are often generated through combining LM results with equivalent outputs from alternative algorithms. The "pkgmatch" package also passes all input to an internal version of the ["Best Match 25" (BM25) algorithm](https://en.wikipedia.org/wiki/Okapi_BM25), an algorithm that is commonly used in conjunction with LM outputs. Like LM cosine similarities, the BM25 algorithm generates vectors of numeric values quantifying the similarity between a given input and all packages (or functions) within the specified corpus. Importantly, the version developed here separates the two steps required in the algorithm, of calculating inverse document frequencies (IDFs) of all terms in the specified corpora, and then using those IDFs to calculate final scores for input documents. The package is distributed along with pre-calculated values of IDFs for all terms (both text and code) for both corpora. These are automatically downloaded the first time any functions are called, and from that time on, always available for immediate use in all subsequent calculations. #### Tokens for text input For text input, the BM25 algorithm requires inputs to be converted to "tokens", for which this package relies on [rOpenSci's tokenizer package](https://docs.ropensci.org/tokenizers). The BM25 values then assess similarities through adding similarities for all tokens shared between two documents, weighted by the inverse of token frequencies across all packages within the specified corpus. This inverse weighting ensures that similarities are more influenced by similar usage of infrequent tokens, and less by usage of common tokens. #### Tokens for code input For code input, tokens for BM25 input are taken as the names of all functions called, pre-pended with namespaces, such as `base::print()`. Function call are identified from the "treesitter" algorithm, using functionality of the ["treesitter" R package](https://davisvaughan.github.io/r-tree-sitter/). This package is currently restricted to [the R language only](https://github.com/r-lib/tree-sitter-r), and thus code input in this package considers code from `/R` sub-directories only, and excludes any source code in other languages. All function calls in all packages within each corpus are first converted to IDFs. Calculation of BM25 values are weighted by these so that, as for word-token frequencies, resultant similarities are most strongly affected by similarities in usage of less common functions, and relatively unaffected by similarities in usage of common functions. ## Combining similarities with "rank fusion" Finally, the best matching outputs from the various combinations of input chunks and algorithms are combined using the [*Reciprocal Rank Fusion* (RRF) algorithm of Cormack, Clarke, & Büttcher (2009)](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf). Scores for each combination of input chunk and algorithm are converted to ranks, and all ranks submitted to the RRF algorithm. A default value of `k = 60` is used at all times, as recommended by the original paper, which also suggests very low sensitivity of results to variations in this value. Where inputs are entire packages, final rankings are generated separately for code and text.