--- title: "The pkgmatch package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{pkgmatch} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set ( collapse = TRUE, comment = "#>" ) ``` The "pkgmatch" package is a search and matching engine for R packages. It finds the best-matching R packages to an input of either a text description, or a local path to an R package. `pkgmatch` was developed to enable rOpenSci to identify similar packages to each new package submitted for [our software peer-review scheme](https://ropensci.org/software-review/). By default, matches are found from [rOpenSci's own package suite](https://ropensci.org/packages/), but it is also possible to find matches from all [packages currently on CRAN](https://cran.r-project.org). ## What does the package do? What the package does is best understood by example, starting with loading the package. ```{r library} library (pkgmatch) ``` Then match packages to an input string: ```{r match-text-1-fakey, eval = FALSE} input <- "genomics and transcriptomics sequence data" pkgmatch_similar_pkgs (input) ``` ```{r redef-sim-pkgs1, eval = TRUE, echo = FALSE} c ("onekp", "UCSCXenaTools", "biomartr", "restez", "DataPackageR") ``` By default, the top five matching packages are printed to the screen. The function actually returns information on all packages, along with a `head` method to display the first few rows: ```{r match-text-1-fakey-return, eval = FALSE} p <- pkgmatch_similar_pkgs (input) head (p) ``` ```{r match-text-1-return, eval = TRUE, echo = FALSE} data.frame ( package = c ("onekp", "UCSCXenaTools", "biomartr", "restez", "DataPackageR"), rank = 1:5 ) ``` The `head` method also accepts an `n` parameter to control how many rows are displayed, or `as.data.frame` can be used to see the entire `data.frame` of results. The following lines find equivalent matches against all packages currently on CRAN: ```{r match-text-2-cran-fakey, eval = FALSE} pkgmatch_similar_pkgs (input, corpus = "cran") ``` ```{r redef-sim-pkgs2, eval = TRUE, echo = FALSE} c ("microseq", "read.gb", "seq2R", "tidysq", "rnaCrosslinkOO") ``` ### Using an R package as input The package also accepts as input a path to a local R package. The following code downloads a "tarball" (`.tar.gz` file) from CRAN and finds matching packages from that corpus. We of course expect the best matches against CRAN packages to include that package itself: ```{r odbc-cran-match-fakey, eval = FALSE} u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" destfile <- file.path (tempdir (), basename (u)) download.file (u, destfile = destfile, quiet = TRUE) pkgmatch_similar_pkgs (destfile, corpus = "cran") ``` ```{r odbc-cran-match, echo = FALSE, eval = TRUE} list ( text = c ("odbc", "rocker", "connections", "DatabaseConnector", "DBI"), code = c ("odbc", "RMariaDB", "RSQLite", "noctua", "RAthena") ) ``` which they indeed do. As explained in the documentation, the `pkgmatch_similar_pkgs()` function ranks final results by combining several distinct components, primarily from Large Language Model (LLM) embeddings, as well as from [more conventional document token-frequency analyses](https://en.wikipedia.org/wiki/Okapi_BM25). The rankings from each of these components can be seen as above with the `head` method: ```{r odbc-match-head-fakey, eval = FALSE} p <- pkgmatch_similar_pkgs (destfile, corpus = "cran") head (p) ``` ```{r odbc-cran-match-head, echo = FALSE, eval = TRUE} data.frame ( package = c ("odbc", "rocker", "connections", "DatabaseConnector", "DBI"), version = c ("1.5.0", "0.3.1", "0.2.0", "6.3.2", "1.2.3"), text_rank = 1:5, code_rank = c (1, 1105, 287, 69, 9) ) ``` ### Controlling how ranks are combined As explained in the documentation for the main `pkgmatch_similar_pkgs()` function, ranks for the different components are combined to form a single final ranking using the [Reciprocal Rank Fusion (RRF) algorithm](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf). That function also includes an additional `llm_proportion` parameter which can be used to weight the relative contributions of these different components. Results from the LLM component are: ```{r odbc-cran-match-llm-fakey, eval = FALSE} pkgmatch_similar_pkgs (destfile, corpus = "cran", llm_proportion = 1) ``` ```{r odbc-cran-match-llm, echo = FALSE, eval = TRUE} list ( text = c ("odbc", "rocker", "FormShare", "CDMConnector", "ODB"), code = c ("odbc", "RODBCDBI", "sjdbc", "RODBC", "stacomirtools") ) ``` Results from other other component, comparing relative token frequencies with all CRAN packages, including frequencies of code tokens, are: ```{r odbc-cran-match-bm25-fakey, eval = FALSE} pkgmatch_similar_pkgs (destfile, corpus = "cran", llm_proportion = 0) ``` ```{r odbc-cran-match-bm25, echo = FALSE, eval = TRUE} list ( text = c ("odbc", "implyr", "DatabaseConnector", "sparklyr", "gbifdb"), code = c ("odbc", "pkgload", "httr2", "gganimate", "gargle") ) ``` And there are notable differences between the two sets of results. As also explained in the documentation for `pkgmatch_similar_pkgs()`, all internal function calls are locally cached, so that this function can be easily and quickly re-run with different values of `llm_proportion`.