The pkgmatch package

The “pkgmatch” package is a search and matching engine for R packages. It finds the best-matching R packages to an input of either a text description, or a local path to an R package. pkgmatch was developed to enable rOpenSci to identify similar packages to each new package submitted for our software peer-review scheme. By default, matches are found from rOpenSci’s own package suite, but it is also possible to find matches from all packages currently on CRAN.

What does the package do?

What the package does is best understood by example, starting with loading the package.

library (pkgmatch)

Then match packages to an input string:

input <- "genomics and transcriptomics sequence data"
pkgmatch_similar_pkgs (input)
#> [1] "onekp"         "UCSCXenaTools" "biomartr"      "restez"       
#> [5] "DataPackageR"

By default, the top five matching packages are printed to the screen. The function actually returns information on all packages, along with a head method to display the first few rows:

p <- pkgmatch_similar_pkgs (input)
head (p)
#>         package rank
#> 1         onekp    1
#> 2 UCSCXenaTools    2
#> 3      biomartr    3
#> 4        restez    4
#> 5  DataPackageR    5

The head method also accepts an n parameter to control how many rows are displayed, or as.data.frame can be used to see the entire data.frame of results.

The following lines find equivalent matches against all packages currently on CRAN:

pkgmatch_similar_pkgs (input, corpus = "cran")
#> [1] "microseq"       "read.gb"        "seq2R"          "tidysq"        
#> [5] "rnaCrosslinkOO"

Using an R package as input

The package also accepts as input a path to a local R package. The following code downloads a “tarball” (.tar.gz file) from CRAN and finds matching packages from that corpus. We of course expect the best matches against CRAN packages to include that package itself:

u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz"
destfile <- file.path (tempdir (), basename (u))
download.file (u, destfile = destfile, quiet = TRUE)
pkgmatch_similar_pkgs (destfile, corpus = "cran")
#> $text
#> [1] "odbc"              "rocker"            "connections"      
#> [4] "DatabaseConnector" "DBI"              
#> 
#> $code
#> [1] "odbc"     "RMariaDB" "RSQLite"  "noctua"   "RAthena"

which they indeed do. As explained in the documentation, the pkgmatch_similar_pkgs() function ranks final results by combining several distinct components, primarily from Large Language Model (LLM) embeddings, as well as from more conventional document token-frequency analyses. The rankings from each of these components can be seen as above with the head method:

p <- pkgmatch_similar_pkgs (destfile, corpus = "cran")
head (p)
#>             package version text_rank code_rank
#> 1              odbc   1.5.0         1         1
#> 2            rocker   0.3.1         2      1105
#> 3       connections   0.2.0         3       287
#> 4 DatabaseConnector   6.3.2         4        69
#> 5               DBI   1.2.3         5         9

Controlling how ranks are combined

As explained in the documentation for the main pkgmatch_similar_pkgs() function, ranks for the different components are combined to form a single final ranking using the Reciprocal Rank Fusion (RRF) algorithm. That function also includes an additional llm_proportion parameter which can be used to weight the relative contributions of these different components. Results from the LLM component are:

pkgmatch_similar_pkgs (destfile, corpus = "cran", llm_proportion = 1)
#> $text
#> [1] "odbc"         "rocker"       "FormShare"    "CDMConnector" "ODB"         
#> 
#> $code
#> [1] "odbc"          "RODBCDBI"      "sjdbc"         "RODBC"        
#> [5] "stacomirtools"

Results from other other component, comparing relative token frequencies with all CRAN packages, including frequencies of code tokens, are:

pkgmatch_similar_pkgs (destfile, corpus = "cran", llm_proportion = 0)
#> $text
#> [1] "odbc"              "implyr"            "DatabaseConnector"
#> [4] "sparklyr"          "gbifdb"           
#> 
#> $code
#> [1] "odbc"      "pkgload"   "httr2"     "gganimate" "gargle"

And there are notable differences between the two sets of results. As also explained in the documentation for pkgmatch_similar_pkgs(), all internal function calls are locally cached, so that this function can be easily and quickly re-run with different values of llm_proportion.