Title: | Managing Larger Data on a GitHub Repository |
---|---|
Description: | Helps store files as GitHub release assets, which is a convenient way for large/binary data files to piggyback onto public and private GitHub repositories. Includes functions for file downloads, uploads, and managing releases via the GitHub API. |
Authors: | Carl Boettiger [aut, cre, cph] , Tan Ho [aut] , Mark Padgham [ctb] , Jeffrey O Hanson [ctb] , Kevin Kuo [ctb] |
Maintainer: | Carl Boettiger <[email protected]> |
License: | GPL-3 |
Version: | 0.1.5.9005 |
Built: | 2025-01-19 06:17:34 UTC |
Source: | https://github.com/ropensci/piggyback |
Because larger (> 50 MB) data files cannot easily be committed to git, a different approach is required to manage data associated with an analysis in a GitHub repository. This package provides a simple work-around by allowing larger (up to 2 GB) data files to piggyback on a repository as assets attached to individual GitHub releases. These files are not handled by git in any way, but instead are uploaded, downloaded, or edited directly by calls through the GitHub API. These data files can be versioned manually by creating different releases. This approach works equally well with public or private repositories. Data can be uploaded and downloaded programmatically from scripts. No authentication is required to download data from public repositories.
Maintainer: Carl Boettiger [email protected] (ORCID) [copyright holder]
Authors:
Tan Ho (ORCID)
Other contributors:
Mark Padgham (ORCID) [contributor]
Jeffrey O Hanson (ORCID) [contributor]
Kevin Kuo (ORCID) [contributor]
Useful links:
Report bugs at https://github.com/ropensci/piggyback/issues
Reads environment variable GITHUB_API_URL to determine base URL of API. Same
as gh package. Defaults to https://api.github.com
.
.gh_api_url()
.gh_api_url()
string: API base url
https://gh.r-lib.org/#environment-variables
Delete an asset attached to a release
pb_delete( file = NULL, repo = guess_repo(), tag = "latest", .token = gh::gh_token() )
pb_delete( file = NULL, repo = guess_repo(), tag = "latest", .token = gh::gh_token() )
file |
file(s) to be deleted from the release. If |
repo |
string: GH repository name in format "owner/repo". Default |
tag |
string: tag for the GH release, defaults to "latest" |
.token |
GitHub authentication token, see |
TRUE
(invisibly) if a file is found and deleted.
Otherwise, returns NULL
(invisibly) if no file matching the name was found.
## Not run: readr::write_tsv(mtcars, "mtcars.tsv.gz") ## Upload pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests", overwrite = TRUE) pb_delete("mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1") ## End(Not run)
## Not run: readr::write_tsv(mtcars, "mtcars.tsv.gz") ## Upload pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests", overwrite = TRUE) pb_delete("mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1") ## End(Not run)
Download data from an existing release
pb_download( file = NULL, dest = ".", repo = guess_repo(), tag = "latest", overwrite = TRUE, ignore = "manifest.json", use_timestamps = TRUE, show_progress = getOption("piggyback.verbose", default = interactive()), .token = gh::gh_token() )
pb_download( file = NULL, dest = ".", repo = guess_repo(), tag = "latest", overwrite = TRUE, ignore = "manifest.json", use_timestamps = TRUE, show_progress = getOption("piggyback.verbose", default = interactive()), .token = gh::gh_token() )
file |
character: vector of names of files to be downloaded. If |
dest |
character: path to destination directory (if length one) or
vector of destination filepaths the same length as |
repo |
string: GH repository name in format "owner/repo". Default |
tag |
string: tag for the GH release, defaults to "latest" |
overwrite |
boolean: should any local files of the same name be overwritten? default |
ignore |
character: vector of files to ignore (used if downloading "all" via |
use_timestamps |
DEPRECATED. |
show_progress |
logical, show a progress bar be shown for uploading?
Defaults to |
.token |
GitHub authentication token, see |
## Download a specific file. ## (if dest is omitted, will write to current directory) dest <- tempdir() piggyback::pb_download( "iris.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = dest ) list.files(dest) ## Download all files piggyback::pb_download( repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = dest ) list.files(dest)
## Download a specific file. ## (if dest is omitted, will write to current directory) dest <- tempdir() piggyback::pb_download( "iris.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = dest ) list.files(dest) ## Download all files piggyback::pb_download( repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = dest ) list.files(dest)
Returns the URL download for a given file. This can be useful when using functions that are able to accept URLs.
pb_download_url( file = NULL, repo = guess_repo(), tag = "latest", url_type = c("browser", "api"), .token = gh::gh_token() )
pb_download_url( file = NULL, repo = guess_repo(), tag = "latest", url_type = c("browser", "api"), .token = gh::gh_token() )
file |
character: vector of names of files to be downloaded. If |
repo |
string: GH repository name in format "owner/repo". Default |
tag |
string: tag for the GH release, defaults to "latest" |
url_type |
choice: one of "browser" or "api" - default "browser" is a web-facing URL that is not subject to API ratelimits but does not work for private repositories. "api" URLs work for private repos, but require a GitHub token passed in an Authorization header (see examples) |
.token |
GitHub authentication token, see |
the URL to download a file
# returns browser url by default (and all files if none are specified) browser_url <- pb_download_url( repo = "tanho63/piggyback-tests", tag = "v0.0.2" ) print(browser_url) utils::read.csv(browser_url[[1]]) # can return api url if desired api_url <- pb_download_url( "mtcars.csv", repo = "tanho63/piggyback-tests", tag = "v0.0.2" ) print(api_url) # for public repositories, this will still work utils::read.csv(api_url) # for private repos, can use httr or curl to fetch and then pass into read function gh_pat <- Sys.getenv("GITHUB_PAT") if(!identical(gh_pat, "")){ resp <- httr::GET(api_url, httr::add_headers(Authorization = paste("Bearer", gh_pat))) utils::read.csv(text = httr::content(resp, as = "text")) } # or use pb_read which bundles some of this for you
# returns browser url by default (and all files if none are specified) browser_url <- pb_download_url( repo = "tanho63/piggyback-tests", tag = "v0.0.2" ) print(browser_url) utils::read.csv(browser_url[[1]]) # can return api url if desired api_url <- pb_download_url( "mtcars.csv", repo = "tanho63/piggyback-tests", tag = "v0.0.2" ) print(api_url) # for public repositories, this will still work utils::read.csv(api_url) # for private repos, can use httr or curl to fetch and then pass into read function gh_pat <- Sys.getenv("GITHUB_PAT") if(!identical(gh_pat, "")){ resp <- httr::GET(api_url, httr::add_headers(Authorization = paste("Bearer", gh_pat))) utils::read.csv(text = httr::content(resp, as = "text")) } # or use pb_read which bundles some of this for you
List all assets attached to a release
pb_list(repo = guess_repo(), tag = NULL, .token = gh::gh_token())
pb_list(repo = guess_repo(), tag = NULL, .token = gh::gh_token())
repo |
string: GH repository name in format "owner/repo". Default |
tag |
which release tag(s) do we want information for? If |
.token |
GitHub authentication token, see |
a data.frame of release asset names, release tag, timestamp, owner, and repo.
pb_releases
for a list of all releases in repository
## Not run: pb_list("cboettig/piggyback-tests") ## End(Not run)
## Not run: pb_list("cboettig/piggyback-tests") ## End(Not run)
A convenience wrapper around writing an object to a temporary file and then
uploading to a specified repo/release. This convenience comes at a cost to
performance efficiency, since it first downloads the data to disk and then
reads the data from disk into memory. See vignette("cloud_native")
for
alternative ways to bypass this flow and work with the data directly.
pb_read( file, ..., repo = guess_repo(), tag = "latest", read_function = guess_read_function(file), .token = gh::gh_token() )
pb_read( file, ..., repo = guess_repo(), tag = "latest", read_function = guess_read_function(file), .token = gh::gh_token() )
file |
string: file name |
... |
additional arguments passed to |
repo |
string: GH repository name in format "owner/repo". Default
|
tag |
string: tag for the GH release, defaults to "latest" |
read_function |
function: used to read in the data, where the file is
passed as the first argument and any additional arguments are subsequently
passed in via |
.token |
GitHub authentication token, see |
Result of reading in the file in question.
Other pb_rw:
guess_read_function()
,
guess_write_function()
,
pb_write()
try({ # try block is to avoid CRAN issues and is not required in ordinary usage piggyback::pb_read("mtcars.tsv.gz", repo = "cboettig/piggyback-tests") })
try({ # try block is to avoid CRAN issues and is not required in ordinary usage piggyback::pb_read("mtcars.tsv.gz", repo = "cboettig/piggyback-tests") })
Create a new release on GitHub repo
pb_release_create( repo = guess_repo(), tag, commit = NULL, name = tag, body = "Data release", draft = FALSE, prerelease = FALSE, .token = gh::gh_token() )
pb_release_create( repo = guess_repo(), tag, commit = NULL, name = tag, body = "Data release", draft = FALSE, prerelease = FALSE, .token = gh::gh_token() )
repo |
Repository name in format "owner/repo". Will guess the current repo if not specified. |
tag |
tag to create for this release |
commit |
Specifies the commit-ish value that
determines where the Git tag is created from.
Can be any branch or full commit SHA (not the short hash). Unused if the
git tag already exists. Default: the repository's
default branch (usually |
name |
The name of the release. Defaults to tag. |
body |
Text describing the contents of the tag. default text is "Data release". |
draft |
default |
prerelease |
default |
.token |
GitHub authentication token, see |
Other release_management:
pb_release_delete()
## Not run: pb_release_create("cboettig/piggyback-tests", "v0.0.5") ## End(Not run)
## Not run: pb_release_create("cboettig/piggyback-tests", "v0.0.5") ## End(Not run)
Delete release from GitHub repo
pb_release_delete(repo = guess_repo(), tag, .token = gh::gh_token())
pb_release_delete(repo = guess_repo(), tag, .token = gh::gh_token())
repo |
Repository name in format "owner/repo". Defaults to |
tag |
tag name to delete. Must be one of those found in |
.token |
GitHub authentication token, see |
Other release_management:
pb_release_create()
## Not run: pb_release_delete("cboettig/piggyback-tests", "v0.0.5") ## End(Not run)
## Not run: pb_release_delete("cboettig/piggyback-tests", "v0.0.5") ## End(Not run)
This function retrieves information about all releases attached to a given repository.
pb_releases( repo = guess_repo(), .token = gh::gh_token(), verbose = getOption("piggyback.verbose", default = TRUE) )
pb_releases( repo = guess_repo(), .token = gh::gh_token(), verbose = getOption("piggyback.verbose", default = TRUE) )
repo |
GitHub repository specification in the form of |
.token |
a GitHub API token, defaults to |
verbose |
defaults to TRUE, use FALSE to silence messages |
a dataframe of all releases available within a repository.
try({ # wrapped in try block to prevent CRAN errors pb_releases("nflverse/nflverse-data") })
try({ # wrapped in try block to prevent CRAN errors pb_releases("nflverse/nflverse-data") })
NOTE: you must first create a release if one does not already exists.
pb_upload( file, repo = guess_repo(), tag = "latest", name = NULL, overwrite = "use_timestamps", use_timestamps = NULL, show_progress = getOption("piggyback.verbose", default = interactive()), .token = gh::gh_token(), dir = NULL )
pb_upload( file, repo = guess_repo(), tag = "latest", name = NULL, overwrite = "use_timestamps", use_timestamps = NULL, show_progress = getOption("piggyback.verbose", default = interactive()), .token = gh::gh_token(), dir = NULL )
file |
string: path to file to be uploaded |
repo |
string: GH repository name in format "owner/repo". Default |
tag |
string: tag for the GH release, defaults to "latest" |
name |
string: name for uploaded file. If not provided will use the basename of
|
overwrite |
choice: overwrite any existing file with the same name already attached to the on release? Options are "use_timestamps", TRUE, or FALSE: default "use_timestamps" will only overwrite files where the release timestamp is newer than the local file. |
use_timestamps |
DEPRECATED. |
show_progress |
logical, show a progress bar be shown for uploading?
Defaults to |
.token |
GitHub authentication token, see |
dir |
directory relative to which file names should be based, defaults to NULL for current working directory. |
## Not run: # Needs your real token to run readr::write_tsv(mtcars,"mtcars.tsv.xz") pb_upload("mtcars.tsv.xz", "cboettig/piggyback-tests") ## End(Not run)
## Not run: # Needs your real token to run readr::write_tsv(mtcars,"mtcars.tsv.xz") pb_upload("mtcars.tsv.xz", "cboettig/piggyback-tests") ## End(Not run)
A convenience wrapper around writing an object to a temporary file and then uploading to a specified repo/release.
pb_write( x, file, ..., repo = guess_repo(), tag = "latest", write_function = guess_write_function(file), .token = gh::gh_token() )
pb_write( x, file, ..., repo = guess_repo(), tag = "latest", write_function = guess_write_function(file), .token = gh::gh_token() )
x |
object: memory object to save to piggyback |
file |
string: file name |
... |
additional arguments passed to |
repo |
string: GH repository name in format "owner/repo". Default
|
tag |
string: tag for the GH release, defaults to "latest" |
write_function |
function: used to write an R object to file, where the
object is passed as the first argument, the filename as the second argument,
and any additional arguments are subsequently passed in via |
.token |
GitHub authentication token, see |
Writes file to release and returns github API response
Other pb_rw:
guess_read_function()
,
guess_write_function()
,
pb_read()
pb_write(mtcars, "mtcars.rds", repo = "tanho63/piggyback-tests") #> ℹ Uploading to latest release: "v0.0.2". #> ℹ Uploading mtcars.rds ... #> |===============================================================| 100%
pb_write(mtcars, "mtcars.rds", repo = "tanho63/piggyback-tests") #> ℹ Uploading to latest release: "v0.0.2". #> ℹ Uploading mtcars.rds ... #> |===============================================================| 100%