---
title: "The pkgmatch package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{pkgmatch}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set (
    collapse = TRUE,
    comment = "#>"
)
```

The "pkgmatch" package is a search and matching engine for R packages. It finds
the best-matching R packages to an input of either a text description, or a
local path to an R package. `pkgmatch` was developed to enable rOpenSci to
identify similar packages to each new package submitted for [our software
peer-review scheme](https://ropensci.org/software-review/). By default, matches
are found from [rOpenSci's own package suite](https://ropensci.org/packages/),
but it is also possible to find matches from all [packages currently on
CRAN](https://cran.r-project.org).

## What does the package do?

What the package does is best understood by example, starting with loading the package.

```{r library}
library (pkgmatch)
```

Then match packages to an input string:

```{r match-text-1-fakey, eval = FALSE}
input <- "genomics and transcriptomics sequence data"
pkgmatch_similar_pkgs (input)
```

```{r redef-sim-pkgs1, eval = TRUE, echo = FALSE}
c ("onekp", "UCSCXenaTools", "biomartr", "restez", "DataPackageR")
```

By default, the top five matching packages are printed to the screen. The
function actually returns information on all packages, along with a `head`
method to display the first few rows:

```{r match-text-1-fakey-return, eval = FALSE}
p <- pkgmatch_similar_pkgs (input)
head (p)
```
```{r match-text-1-return, eval = TRUE, echo = FALSE}
data.frame (
    package = c ("onekp", "UCSCXenaTools", "biomartr", "restez", "DataPackageR"),
    rank = 1:5
)
```

The `head` method also accepts an `n` parameter to control how many rows are
displayed, or `as.data.frame` can be used to see the entire `data.frame` of
results.

The following lines find equivalent matches against all packages currently on
CRAN:

```{r match-text-2-cran-fakey, eval = FALSE}
pkgmatch_similar_pkgs (input, corpus = "cran")
```

```{r redef-sim-pkgs2, eval = TRUE, echo = FALSE}
c ("microseq", "read.gb", "seq2R", "tidysq", "rnaCrosslinkOO")
```

### Using an R package as input

The package also accepts as input a path to a local R package. The following
code downloads a "tarball" (`.tar.gz` file) from CRAN and finds matching
packages from that corpus. We of course expect the best matches against CRAN
packages to include that package itself:

```{r odbc-cran-match-fakey, eval = FALSE}
u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz"
destfile <- file.path (tempdir (), basename (u))
download.file (u, destfile = destfile, quiet = TRUE)
pkgmatch_similar_pkgs (destfile, corpus = "cran")
```

```{r odbc-cran-match, echo = FALSE, eval = TRUE}
list (
    text = c ("odbc", "rocker", "connections", "DatabaseConnector", "DBI"),
    code = c ("odbc", "sparklyr", "noctua", "RAthena", "pkgcache")
)
```

which they indeed do. As explained in the documentation, the
`pkgmatch_similar_pkgs()` function ranks final results by combining several
distinct components, primarily from Language Model (LM) embeddings, as
well as from [more conventional document token-frequency
analyses](https://en.wikipedia.org/wiki/Okapi_BM25). The rankings from each of
these components can be seen as above with the `head` method:

```{r odbc-match-head-fakey, eval = FALSE}
p <- pkgmatch_similar_pkgs (destfile, corpus = "cran")
head (p)
```
```{r odbc-cran-match-head, echo = FALSE, eval = TRUE}
data.frame (
    package = c ("odbc", "rocker", "connections", "DatabaseConnector", "DBI"),
    version = c ("1.5.0", "0.3.1", "0.2.0", "6.3.2", "1.2.3"),
    text_rank = 1:5,
    code_rank = c (1, 1183, 256, 64, 102)
)
```

### Controlling how ranks are combined

As explained in the documentation for the main `pkgmatch_similar_pkgs()`
function, ranks for the different components are combined to form a single
final ranking using the [Reciprocal Rank Fusion (RRF)
algorithm](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf). That
function also includes an additional `lm_proportion` parameter which can be
used to weight the relative contributions of these different components.
Results from the LM component are:

```{r odbc-cran-match-lm-fakey, eval = FALSE}
pkgmatch_similar_pkgs (destfile, corpus = "cran", lm_proportion = 1)
```
```{r odbc-cran-match-lm, echo = FALSE, eval = TRUE}
list (
    text = c ("odbc", "rocker", "FormShare", "CDMConnector", "ODB"),
    code = c ("odbc", "RODBCDBI", "sjdbc", "RODBC", "stacomirtools")
)
```

Results from other other component, comparing relative token frequencies with
all CRAN packages, including frequencies of code tokens, are:

```{r odbc-cran-match-bm25-fakey, eval = FALSE}
pkgmatch_similar_pkgs (destfile, corpus = "cran", lm_proportion = 0)
```
```{r odbc-cran-match-bm25, echo = FALSE, eval = TRUE}
list (
    text = c ("odbc", "implyr", "DatabaseConnector", "sparklyr", "gbifdb"),
    code = c ("odbc", "rsconnect", "pkgload", "xfun", "pak")
)
```

And there are notable differences between the two sets of results. As also
explained in the documentation for `pkgmatch_similar_pkgs()`, all internal
function calls are locally cached, so that this function can be easily and
quickly re-run with different values of `lm_proportion`.