Title: | R Interface to the Data Retriever |
---|---|
Description: | Provides an R interface to the Data Retriever <https://retriever.readthedocs.io/en/latest/> via the Data Retriever's command line interface. The Data Retriever automates the tasks of finding, downloading, and cleaning public datasets, and then stores them in a local database. |
Authors: | Henry Senyondo [aut, cre] , Daniel McGlinn [aut] , Pranita Sharma [aut] , David J Harris [aut] , Hao Ye [aut] , Shawn Taylor [aut] , Jeroen Ooms [aut] , Francisco Rodríguez-Sánchez Ooms [aut] , Karthik Ram [aut] , Apoorva Pandey [aut] , Harshit Bansal [aut] , Max Pohlman [aut] , Ethan White [aut] |
Maintainer: | Henry Senyondo <[email protected]> |
License: | MIT + file LICENSE |
Version: | 3.1.1 |
Built: | 2025-01-21 05:12:34 UTC |
Source: | https://github.com/ropensci/rdataretriever |
Check for updates
check_for_updates(repo = "")
check_for_updates(repo = "")
repo |
path to the repository |
No return value, checks for updates repo
## Not run: rdataretriever::check_for_updates() ## End(Not run)
## Not run: rdataretriever::check_for_updates() ## End(Not run)
Check to see if minimum version of retriever Python package is installed
check_retriever_availability()
check_retriever_availability()
boolean
## Not run: rdataretriever::check_retriever_availability() ## End(Not run)
## Not run: rdataretriever::check_retriever_availability() ## End(Not run)
Commit a dataset
commit(dataset, commit_message = "", path = NULL, quiet = FALSE)
commit(dataset, commit_message = "", path = NULL, quiet = FALSE)
dataset |
name of the dataset |
commit_message |
commit message for the commit |
path |
path to save the committed dataset, if no path given save in provenance directory |
quiet |
logical, if true retriever runs in quiet mode |
No return value, provides confirmation for commit
## Not run: rdataretriever::commit("iris") ## End(Not run)
## Not run: rdataretriever::commit("iris") ## End(Not run)
See the log of committed dataset stored in provenance directory
commit_log(dataset)
commit_log(dataset)
dataset |
name of the dataset stored in provenance directory |
No return value, prints message after commit
## Not run: rdataretriever::commit_log("iris") ## End(Not run)
## Not run: rdataretriever::commit_log("iris") ## End(Not run)
Get Data Retriever version
data_retriever_version(clean = TRUE)
data_retriever_version(clean = TRUE)
clean |
boolean return cleaned version appropriate for semver |
returns a string with the version information
## Not run: rdataretriever::data_retriever_version() ## End(Not run)
## Not run: rdataretriever::data_retriever_version() ## End(Not run)
Additional information on the available datasets can be found at url https://retriever.readthedocs.io/en/latest/datasets.html
dataset_names()
dataset_names()
returns a character vector with the available datasets for download
## Not run: rdataretriever::dataset_names() ## End(Not run)
## Not run: rdataretriever::dataset_names() ## End(Not run)
Additional information on the available datasets can be found at url https://retriever.readthedocs.io/en/latest/datasets.html
datasets(keywords = "", licenses = "")
datasets(keywords = "", licenses = "")
keywords |
search all datasets by keywords |
licenses |
search all datasets by licenses |
returns a character vector with the available datasets for download
Returns the names of all available dataset scripts
## Not run: rdataretriever::datasets() ## End(Not run)
## Not run: rdataretriever::datasets() ## End(Not run)
Can take a list of packages, or NULL or a string 'all' for all rdataset packages and datasets
display_all_rdataset_names(package_name = NULL)
display_all_rdataset_names(package_name = NULL)
package_name |
print datasets in the package, default to print rdataset and all to print all |
No return value, displays the list of rdataset names present
## Not run: rdataretriever::display_all_rdataset_names() ## End(Not run)
## Not run: rdataretriever::display_all_rdataset_names() ## End(Not run)
Directly downloads data files with no processing, allowing downloading of non-tabular data.
download( dataset, path = "./", quiet = FALSE, sub_dir = "", debug = FALSE, use_cache = TRUE )
download( dataset, path = "./", quiet = FALSE, sub_dir = "", debug = FALSE, use_cache = TRUE )
dataset |
the name of the dataset that you wish to download |
path |
the path where the data should be downloaded to |
quiet |
logical, if true retriever runs in quiet mode |
sub_dir |
downloaded dataset is stored into a custom subdirectory. |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
Setting FALSE reinstalls scripts even if they are already installed |
No return value, downloads the raw dataset
## Not run: rdataretriever::download("plant-comp-ok") # downloaded files will be copied to your working directory # when no path is specified dir() ## End(Not run)
## Not run: rdataretriever::download("plant-comp-ok") # downloaded files will be copied to your working directory # when no path is specified dir() ## End(Not run)
Each datafile in a given dataset is downloaded to a temporary directory and then imported as a data.frame as a member of a named list.
fetch(dataset, quiet = TRUE, data_names = NULL)
fetch(dataset, quiet = TRUE, data_names = NULL)
dataset |
the names of the dataset that you wish to download |
quiet |
logical, if true retriever runs in quiet mode |
data_names |
the names you wish to assign to cells of the list which stores the fetched dataframes. This is only relevant if you are downloading more than one dataset. |
Returns a dataframe of dataset
## Not run: ## fetch the portal Database portal <- rdataretriever::fetch("portal") class(portal) names(portal) ## preview the data in the portal species datafile head(portal$species) vegdata <- rdataretriever::fetch(c("plant-comp-ok", "plant-occur-oosting")) names(vegdata) names(vegdata$plant_comp_ok) ## End(Not run)
## Not run: ## fetch the portal Database portal <- rdataretriever::fetch("portal") class(portal) names(portal) ## preview the data in the portal species datafile head(portal$species) vegdata <- rdataretriever::fetch(c("plant-comp-ok", "plant-occur-oosting")) names(vegdata) names(vegdata$plant_comp_ok) ## End(Not run)
Returns metadata for the following dataset id
find_socrata_dataset_by_id(dataset_id)
find_socrata_dataset_by_id(dataset_id)
dataset_id |
id of the dataset |
No return value, shows metadata for the following dataset id
## Not run: rdataretriever::socrata_dataset_info() ## End(Not run)
## Not run: rdataretriever::socrata_dataset_info() ## End(Not run)
Get dataset names from upstream
get_dataset_names_upstream(keywords = "", licenses = "", repo = "")
get_dataset_names_upstream(keywords = "", licenses = "", repo = "")
keywords |
filter datasets based on keywords |
licenses |
filter datasets based on license |
repo |
path to the repository |
No return value, gets dataset names from upstream
## Not run: rdataretriever::get_dataset_names_upstream(keywords = "", licenses = "", repo = "") ## End(Not run)
## Not run: rdataretriever::get_dataset_names_upstream(keywords = "", licenses = "", repo = "") ## End(Not run)
Returns a list of all the available RDataset names present
get_rdataset_names()
get_rdataset_names()
No return value, list of all the available RDataset
## Not run: rdataretriever::get_rdataset_names() ## End(Not run)
## Not run: rdataretriever::get_rdataset_names() ## End(Not run)
Get retriever citation
get_retriever_citation()
get_retriever_citation()
No return value, outputs citation of the Data Retriever Python package
## Not run: rdataretriever::get_retriever_citation() ## End(Not run)
## Not run: rdataretriever::get_retriever_citation() ## End(Not run)
Get citation of a script
get_script_citation(dataset = NULL)
get_script_citation(dataset = NULL)
dataset |
dataset to obtain citation |
No return value, gets citation of a script
## Not run: rdataretriever::get_script_citation(dataset = "") ## End(Not run)
## Not run: rdataretriever::get_script_citation(dataset = "") ## End(Not run)
Get scripts upstream
get_script_upstream(dataset, repo = "")
get_script_upstream(dataset, repo = "")
dataset |
name of the dataset |
repo |
path to the repository |
No return value, gets upstream scripts
## Not run: rdataretriever::get_script_upstream("iris") ## End(Not run)
## Not run: rdataretriever::get_script_upstream("iris") ## End(Not run)
This function will check if the version of the retriever's scripts in your local directory ‘~/.retriever/scripts/’ is up-to-date with the most recent official retriever release. Note it is possible that even more updated scripts exist at the retriever repository https://github.com/weecology/retriever/tree/main/scripts that have not yet been incorperated into an official release, and you should consider checking that page if you have any concerns.
get_updates()
get_updates()
No return value, updatea the retriever's dataset scripts to the most recent versions
## Not run: rdataretriever::get_updates() ## End(Not run)
## Not run: rdataretriever::get_updates() ## End(Not run)
Data is stored in either CSV files or one of the following database management systems: MySQL, PostgreSQL, SQLite, or Microsoft Access.
install( dataset, connection, db_file = NULL, conn_file = NULL, data_dir = ".", log_dir = NULL )
install( dataset, connection, db_file = NULL, conn_file = NULL, data_dir = ".", log_dir = NULL )
dataset |
the name of the dataset that you wish to download |
||||||||
connection |
what type of database connection should be used. The options include: mysql, postgres, sqlite, msaccess, or csv' |
||||||||
db_file |
the name of the datbase file the dataset should be loaded into |
||||||||
conn_file |
the path to the .conn file that contains the connection configuration options for mysql and postgres databases. This defaults to mysql.conn or postgres.conn respectively. The connection file is a file that is formated in the following way:
|
||||||||
data_dir |
the location where the dataset should be installed. Only relevant for csv connection types. Defaults to current working directory |
||||||||
log_dir |
the location where the retriever log should be stored if the progress is not printed to the console |
No return value, main install function
## Not run: rdataretriever::install("iris", "csv") ## End(Not run)
## Not run: rdataretriever::install("iris", "csv") ## End(Not run)
Data is stored in CSV files
install_csv( dataset, table_name = "{db}_{table}.csv", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_csv( dataset, table_name = "{db}_{table}.csv", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
table_name |
the name of the database file to store data |
data_dir |
the dir path to store data, defaults to working dir |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
Setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets into CSV
## Not run: rdataretriever::install_csv("iris") ## End(Not run)
## Not run: rdataretriever::install_csv("iris") ## End(Not run)
Data is stored in JSON files
install_json( dataset, table_name = "{db}_{table}.json", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_json( dataset, table_name = "{db}_{table}.json", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
table_name |
the name of the database file to store data |
data_dir |
the dir path to store data, defaults to working dir |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets in to JSON
## Not run: rdataretriever::install_json("iris") ## End(Not run)
## Not run: rdataretriever::install_json("iris") ## End(Not run)
Data is stored in MSAccess database
install_msaccess( dataset, file = "access.mdb", table_name = "[{db} {table}]", debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_msaccess( dataset, file = "access.mdb", table_name = "[{db} {table}]", debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
file |
file name for database |
table_name |
table name for installing of dataset |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
Setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets into MSAccess database
## Not run: rdataretriever::install_msaccess(dataset = "iris", file = "sqlite.db") ## End(Not run)
## Not run: rdataretriever::install_msaccess(dataset = "iris", file = "sqlite.db") ## End(Not run)
Data is stored in MySQL database
install_mysql( dataset, user = "root", password = "", host = "localhost", port = 3306, database_name = "{db}", table_name = "{db}.{table}", debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_mysql( dataset, user = "root", password = "", host = "localhost", port = 3306, database_name = "{db}", table_name = "{db}.{table}", debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
user |
username for database connection |
password |
password for database connection |
host |
hostname for connection |
port |
port number for connection |
database_name |
database name in which dataset will be installed |
table_name |
table name specified especially for datasets containing one file |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets into MySQL database
## Not run: rdataretriever::install_mysql(dataset = "portal", user = "postgres", password = "abcdef") ## End(Not run)
## Not run: rdataretriever::install_mysql(dataset = "portal", user = "postgres", password = "abcdef") ## End(Not run)
Data is stored in PostgreSQL database
install_postgres( dataset, user = "postgres", password = "", host = "localhost", port = 5432, database = "postgres", database_name = "{db}", table_name = "{db}.{table}", bbox = list(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_postgres( dataset, user = "postgres", password = "", host = "localhost", port = 5432, database = "postgres", database_name = "{db}", table_name = "{db}.{table}", bbox = list(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
user |
username for database connection |
password |
password for database connection |
host |
hostname for connection |
port |
port number for connection |
database |
the database name default is postres |
database_name |
database schema name in which dataset will be installed |
table_name |
table name specified especially for datasets containing one file |
bbox |
optional extent values used to fetch data from the spatial dataset |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets into PostgreSQL database
## Not run: rdataretriever::install_postgres(dataset = "portal", user = "postgres", password = "abcdef") ## End(Not run)
## Not run: rdataretriever::install_postgres(dataset = "portal", user = "postgres", password = "abcdef") ## End(Not run)
install the python module 'retriever'
install_retriever(method = "auto", conda = "auto")
install_retriever(method = "auto", conda = "auto")
method |
Installation method. By default, "auto" automatically finds a method that will work in the local environment. Change the default to force a specific installation method. Note that the "virtualenv" method is not available on Windows. |
conda |
The path to a |
No return value, install the python module 'retriever'
Data is stored in SQLite database
install_sqlite( dataset, file = "sqlite.db", table_name = "{db}_{table}", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_sqlite( dataset, file = "sqlite.db", table_name = "{db}_{table}", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
file |
Sqlite database file name or path |
table_name |
table name for installing of dataset |
data_dir |
the dir path to store the db, defaults to working dir |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets into SQLite database
## Not run: rdataretriever::install_sqlite(dataset = "iris", file = "sqlite.db") ## End(Not run)
## Not run: rdataretriever::install_sqlite(dataset = "iris", file = "sqlite.db") ## End(Not run)
Data is stored in XML files
install_xml( dataset, table_name = "{db}_{table}.xml", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
install_xml( dataset, table_name = "{db}_{table}.xml", data_dir = getwd(), debug = FALSE, use_cache = TRUE, force = FALSE, hash_value = NULL )
dataset |
the name of the dataset that you wish to install or path to a committed dataset zip file |
table_name |
the name of the database file to store data |
data_dir |
the dir path to store data, defaults to working dir |
debug |
setting TRUE helps in debugging in case of errors |
use_cache |
Setting FALSE reinstalls scripts even if they are already installed |
force |
setting TRUE doesn't prompt for confirmation while installing committed datasets when changes are discovered in environment |
hash_value |
the hash value of committed dataset when installing from provenance directory |
No return value, installs datasets into XML
## Not run: rdataretriever::install_xml("iris") ## End(Not run)
## Not run: rdataretriever::install_xml("iris") ## End(Not run)
Update the retriever's global_script_list with the scripts present in the ~/.retriever directory.
reload_scripts()
reload_scripts()
No return value, fetches most resent scripts
## Not run: rdataretriever::reload_scripts() ## End(Not run)
## Not run: rdataretriever::reload_scripts() ## End(Not run)
Reset the scripts or data(raw_data) directory or both
reset(scope = "all")
reset(scope = "all")
scope |
All resets both scripst and data directory |
No return value, resets the scripts and the data directory
## Not run: rdataretriever::reset("iris") ## End(Not run)
## Not run: rdataretriever::reset("iris") ## End(Not run)
Returns the list of dataset names after autocompletion
socrata_autocomplete_search(dataset)
socrata_autocomplete_search(dataset)
dataset |
the name of the dataset |
No return value, show dataset names after autocompletion
## Not run: rdataretriever::socrata_autocomplete_search() ## End(Not run)
## Not run: rdataretriever::socrata_autocomplete_search() ## End(Not run)
Get socrata dataset info
socrata_dataset_info(dataset_name)
socrata_dataset_info(dataset_name)
dataset_name |
dataset name to obtain info |
No return value, shows socrata dataset info
## Not run: rdataretriever::socrata_dataset_info() ## End(Not run)
## Not run: rdataretriever::socrata_dataset_info() ## End(Not run)
Updates the datasets_url.json from the github repo
update_rdataset_catalog(test = FALSE)
update_rdataset_catalog(test = FALSE)
test |
flag set when testing |
No return value, updates the datasets_url.json
## Not run: rdataretriever::update_rdataset_catalog() ## End(Not run)
## Not run: rdataretriever::update_rdataset_catalog() ## End(Not run)
Setting path of retriever
use_RetrieverPath(path)
use_RetrieverPath(path)
path |
location of retriever in the system |
No return value, setting path of retriever
## Not run: rdataretriever::use_RetrieverPath("/home/<system_name>/anaconda2/envs/py27/bin/") ## End(Not run)
## Not run: rdataretriever::use_RetrieverPath("/home/<system_name>/anaconda2/envs/py27/bin/") ## End(Not run)