Title: | Find R Packages Matching Either Descriptions or Other R Packages |
---|---|
Description: | Find R packages matching either descriptions or other R packages. |
Authors: | Mark Padgham [aut, cre] , Davis Vaughan [ctb] |
Maintainer: | Mark Padgham <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.2.007 |
Built: | 2024-11-11 11:14:04 UTC |
Source: | https://github.com/ropensci-review-tools/pkgmatch |
Head method for 'pkgmatch' objects
## S3 method for class 'pkgmatch' head(x, n = 5L, ...)
## S3 method for class 'pkgmatch' head(x, n = 5L, ...)
x |
Object for which head is to be printed |
n |
Number of rows of full |
... |
Not used |
A (usually) smaller version of x
, with all columns displayed.
Other utils:
pkgmatch_browse()
,
pkgmatch_load_data()
,
print.pkgmatch()
,
text_is_code()
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
Check that ollama is installed with required models, and download if not.
ollama_check(sudo = is_docker_sudo())
ollama_check(sudo = is_docker_sudo())
sudo |
Set to |
TRUE if everything works okay, otherwise the function will error before returning.
## Not run: chk <- ollama_check () ## End(Not run)
## Not run: chk <- ollama_check () ## End(Not run)
See https://en.wikipedia.org/wiki/Okapi_BM25.
pkgmatch_bm25(input, txt = NULL, idfs = NULL, corpus = "ropensci")
pkgmatch_bm25(input, txt = NULL, idfs = NULL, corpus = "ropensci")
input |
A single character string to match against the second parameter of all input documents. |
txt |
An optional list of input documents. If not specified, data will
be loaded as specified by the |
idfs |
Optional list of Inverse Document Frequency weightings generated
by the internal |
corpus |
If |
A data.frame
of package names and 'BM25' measures against text
from whole packages both with and without function descriptions.
Other bm25:
pkgmatch_bm25_fn_calls()
## Not run: input <- "Download open spatial data from NASA" bm25 <- pkgmatch_bm25 (input) # Or pre-load document-frequency weightings: idfs <- pkgmatch_load_data ("idfs", fns = FALSE) bm25 <- pkgmatch_bm25 (input, idfs = idfs) ## End(Not run)
## Not run: input <- "Download open spatial data from NASA" bm25 <- pkgmatch_bm25 (input) # Or pre-load document-frequency weightings: idfs <- pkgmatch_load_data ("idfs", fns = FALSE) bm25 <- pkgmatch_bm25 (input, idfs = idfs) ## End(Not run)
Calculate a "BM25" index from function-call frequencies between a local R package and all packages in specified corpus.
pkgmatch_bm25_fn_calls(path, corpus = "ropensci")
pkgmatch_bm25_fn_calls(path, corpus = "ropensci")
path |
Local path to source code of an R package. |
corpus |
One of "ropensci" or "cran" |
A data.frame
of two columns:
"package" Naming the package from the specified corpus;
bm25 The "BM25" index value for the nominated packages, where high values indicate greater overlap in term frequencies.
Other bm25:
pkgmatch_bm25()
## Not run: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) download.file (u, destfile = path) bm25 <- pkgmatch_bm25_fn_calls (path) ## End(Not run)
## Not run: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) download.file (u, destfile = path) bm25 <- pkgmatch_bm25_fn_calls (path) ## End(Not run)
pkgmatch
resultsOpen web pages for pkgmatch
results
pkgmatch_browse(p, n = NULL)
pkgmatch_browse(p, n = NULL)
p |
A |
n |
Number of top-matching entries which should be opened. Defaults to the value passed to the main functions. |
(Invisibly) A named vector of integers, with 0 for all pages able to be successfully opened, and 1 otherwise.
Other utils:
head.pkgmatch()
,
pkgmatch_load_data()
,
print.pkgmatch()
,
text_is_code()
## Not run: input <- "genomics and transcriptomics sequence data" p <- pkgmatch_similar_pkgs (input) pkgmatch_browse (p) # Open main package pages on rOpenSci p <- pkgmatch_similar_pkgs (input, corpus = "cran") pkgmatch_browse (p) # Open main package pages on CRAN p <- pkgmatch_similar_fns (input) pkgmatch_browse (p) # Open pages for best-matching rOpenSci functions ## End(Not run)
## Not run: input <- "genomics and transcriptomics sequence data" p <- pkgmatch_similar_pkgs (input) pkgmatch_browse (p) # Open main package pages on rOpenSci p <- pkgmatch_similar_pkgs (input, corpus = "cran") pkgmatch_browse (p) # Open main package pages on CRAN p <- pkgmatch_similar_fns (input) pkgmatch_browse (p) # Open pages for best-matching rOpenSci functions ## End(Not run)
The embeddings are currently retrieved from a local 'ollama' server running Jina AI embeddings.
pkgmatch_embeddings_from_pkgs(packages = NULL, functions_only = FALSE)
pkgmatch_embeddings_from_pkgs(packages = NULL, functions_only = FALSE)
packages |
A vector of local paths to directories containing R packages. |
functions_only |
If |
If !functions_only
, a list of two matrices of embeddings: one for
the text descriptions of the specified packages, including individual
descriptions of all package functions, and one for the entire code base. For
functions_only
, a single matrix of embeddings for all function
descriptions.
Other embeddings:
pkgmatch_embeddings_from_text()
## Not run: packages <- c ("cli", "fs") emb_fns <- pkgmatch_embeddings_from_pkgs (packages, functions_only = TRUE) colnames (emb_fns) # All functions of the two packages emb_pkg <- pkgmatch_embeddings_from_pkgs (packages, functions_only = FALSE) names (emb_pkg) # text_with_fns, text_wo_fns, code colnames (emb_pkg$text_with_fns) # cli, fs ## End(Not run)
## Not run: packages <- c ("cli", "fs") emb_fns <- pkgmatch_embeddings_from_pkgs (packages, functions_only = TRUE) colnames (emb_fns) # All functions of the two packages emb_pkg <- pkgmatch_embeddings_from_pkgs (packages, functions_only = FALSE) names (emb_pkg) # text_with_fns, text_wo_fns, code colnames (emb_pkg$text_with_fns) # cli, fs ## End(Not run)
The embeddings are currently retrieved from a local 'ollama' server running Jina AI embeddings.
pkgmatch_embeddings_from_text(input = NULL)
pkgmatch_embeddings_from_text(input = NULL)
input |
A vector of one or more text strings for which embeddings are to be extracted. |
A matrix of embeddings, one column for each input
item, and a
fixed number of rows defined by the embedding length of the language models.
Other embeddings:
pkgmatch_embeddings_from_pkgs()
## Not run: input <- "Download open spatial data from NASA" emb <- pkgmatch_embeddings_from_text (input = input) ## End(Not run)
## Not run: input <- "Download open spatial data from NASA" emb <- pkgmatch_embeddings_from_text (input = input) ## End(Not run)
fns = TRUE
, all
individual functions within those packages.Load embeddings generated by the pkgmatch_embeddings_from_pkgs
function, either for all rOpenSci packages or, if fns = TRUE
, all
individual functions within those packages.
pkgmatch_load_data( what = "embeddings", corpus = "ropensci", fns = FALSE, raw = FALSE )
pkgmatch_load_data( what = "embeddings", corpus = "ropensci", fns = FALSE, raw = FALSE )
what |
One of:
|
corpus |
If |
fns |
If |
raw |
Only has effect of |
The loaded data.frame
.
Other utils:
head.pkgmatch()
,
pkgmatch_browse()
,
print.pkgmatch()
,
text_is_code()
## Not run: embeddings <- pkgmatch_load_data ("embeddings") embeddings_fns <- pkgmatch_load_data ("embeddings", fns = TRUE) idfs <- pkgmatch_load_data ("idfs") idfs_fns <- pkgmatch_load_data ("idfs", fns = TRUE) ## End(Not run)
## Not run: embeddings <- pkgmatch_load_data ("embeddings") embeddings_fns <- pkgmatch_load_data ("embeddings", fns = TRUE) idfs <- pkgmatch_load_data ("idfs") idfs_fns <- pkgmatch_load_data ("idfs", fns = TRUE) ## End(Not run)
Function matching is only available for Only applies to functions from the corpus of rOpenSci packages.
pkgmatch_similar_fns(input, embeddings = NULL, n = 5L, browse = FALSE)
pkgmatch_similar_fns(input, embeddings = NULL, n = 5L, browse = FALSE)
input |
A text string. |
embeddings |
Large Language Model embeddings for all rOpenSci packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory. |
n |
When the result of this function is printed to screen, the top |
browse |
If |
A character vector of function names in the form
"
Other main:
pkgmatch_similar_pkgs()
## Not run: input <- "Process raster satellite images" p <- pkgmatch_similar_fns (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
## Not run: input <- "Process raster satellite images" p <- pkgmatch_similar_fns (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
This function accepts as input
either a text description, or
a path to a local R package, and returns information on R packages which
best match that input. Matches are found from within a specified "corpus",
currently all packages from either rOpenSci's package suite, or from
CRAN.
The returned object has a default print
method which prints the best 5
matches directly to the screen, yet returns information on all packages
within the specified corpus. This information is in the form of a
data.frame
, with one column for the package name, and one or more
additional columns of integer ranks for each package. There is also a head
method to print the first few entries of these full data (default n = 5
).
To see all data, use as.data.frame()
.
Ranks are obtained from scores derived from:
Cosine similarities between Language Model (LM) embeddings for the
input
, and corresponding embeddings for the specified corpus.
"Best Match 25" (BM25) scores based on document token frequencies.
Ranks for text matches are generally obtained from packages both including
and excluding function descriptions as part of the package text. This
results in up to four scores for each input. These scores are then combined
to a final ranking using the Reciprocal Rank Fusion (RRF) algorithm. The
additional parameter of lm_proportion
determines the extent to which the
final ranking weights the LM versus BM25 components.
Finally, all components of this function are locally cached for each call
(by the memoise package), so additional calls to this function with
the same input
and corpus
should be much faster than initial calls. This
means the effect of changing lm_proportion
can easily be examined by
simply repeating calls to this function.
pkgmatch_similar_pkgs( input, corpus = "ropensci", embeddings = NULL, idfs = NULL, input_is_code = text_is_code(input), lm_proportion = 0.5, n = 5L, browse = FALSE )
pkgmatch_similar_pkgs( input, corpus = "ropensci", embeddings = NULL, idfs = NULL, input_is_code = text_is_code(input), lm_proportion = 0.5, n = 5L, browse = FALSE )
input |
Either a path to local source code of an R package, or a text string. |
corpus |
If |
embeddings |
Large Language Model embeddings for all rOpenSci packages, generated from pkgmatch_embeddings_from_pkgs. If not provided, pre-generated embeddings will be downloaded and stored in a local cache directory. |
idfs |
Inverse Document Frequency tables for all rOpenSci packages, generated from pkgmatch_bm25. If not provided, pre-generated IDF tables will be downloaded and stored in a local cache directory. |
input_is_code |
A binary flag indicating whether |
lm_proportion |
A value between 0 and 1 to control the relative
contributions of results from Language Models ("LMs") versus results from
traditional token-frequency models. Final rankings are generated by
combining these two kinds of results, so that |
n |
When the result of this function is printed to screen, the top |
browse |
If |
A data.frame
with a "package" column naming packages, and one or
more columns of package ranks in terms of text similarity and, if input
is
a local path to an entire R package, of similarity in code structure. As
described above, the default print
method prints package names only. To
see full result, use as.data.frame()
.
The first time this function is run without passing either
embeddings
or idfs
, required values will be automatically downloaded and
stored in a locally persistent cache directory. Especially for the "cran"
corpus, this downloading may take quite some time.
input_is_code
Other main:
pkgmatch_similar_fns()
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object # This second call will be much faster than first call: p2 <- pkgmatch_similar_pkgs (input, lm_proportion = 0.25) ## End(Not run)
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object # This second call will be much faster than first call: p2 <- pkgmatch_similar_pkgs (input, lm_proportion = 0.25) ## End(Not run)
Use "treesitter" to tag all function calls made within local package, and to associate those calls with package namespaces. This is used as input to the pkgmatch_bm25_fn_calls function.
pkgmatch_treesitter_fn_tags(path)
pkgmatch_treesitter_fn_tags(path)
path |
Path to local package, or |
A data.frame
of all function calls made within the package, with
the following columns:
'fn' Name of the package function within which call is made, including namespace identifiers of "::" for exported functions and ":::" for non-exported functions.
name Name of function being called, including namespace.
start Byte number within file corresponding to start of definition
end Byte number within file corresponding to end of definition
file Name of file in which fn call is defined.
## Not run: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) download.file (u, destfile = path) tags <- pkgmatch_treesitter_fn_tags (path) ## End(Not run)
## Not run: u <- "https://cran.r-project.org/src/contrib/odbc_1.5.0.tar.gz" path <- file.path (tempdir (), basename (u)) download.file (u, destfile = path) tags <- pkgmatch_treesitter_fn_tags (path) ## End(Not run)
This function is intended for internal rOpenSci use only. Usage by any
unauthorized users will error and have no effect unless run with upload = FALSE
, in which case updated data will be created in the sub-directory
"pkgmatch-results" of R's current temporary directory. This updating may
take a very long time!
pkgmatch_update_data(upload = TRUE)
pkgmatch_update_data(upload = TRUE)
upload |
If |
Local path to directory containing updated results.
## Not run: pkgmatch_update_data (upload = FALSEE) ## End(Not run)
## Not run: pkgmatch_update_data (upload = FALSEE) ## End(Not run)
Print method for 'pkgmatch' objects
## S3 method for class 'pkgmatch' print(x, ...)
## S3 method for class 'pkgmatch' print(x, ...)
x |
Object to be printed |
... |
Additional parameters passed to default 'print' method. |
The result of printing x
, in form of either a single character
vector, or a named list of character vectors.
Other utils:
head.pkgmatch()
,
pkgmatch_browse()
,
pkgmatch_load_data()
,
text_is_code()
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
## Not run: input <- "Download open spatial data from NASA" p <- pkgmatch_similar_pkgs (input) p # Default print method, lists 5 best matching packages head (p) # Shows first 5 rows of full `data.frame` object ## End(Not run)
This is only approximate, and there are even software packages which can give false negatives and be identified as prose (like rOpenSci's "geonames" package), and prose which may be wrongly identified as code.
text_is_code(txt)
text_is_code(txt)
txt |
Single input text string |
Logical value indicating whether or not txt
was identified as
code.
Other utils:
head.pkgmatch()
,
pkgmatch_browse()
,
pkgmatch_load_data()
,
print.pkgmatch()
txt <- "Some text without any code" text_is_code (txt) txt <- "this_is_code <- function (x) { x }" text_is_code (txt)
txt <- "Some text without any code" text_is_code (txt) txt <- "this_is_code <- function (x) { x }" text_is_code (txt)