Package 'restez' reference manual

Title:	Create and Query a Local Copy of 'GenBank' in R
Description:	Download large sections of 'GenBank' <https://www.ncbi.nlm.nih.gov/genbank/> and generate a local SQL-based database. A user can then query this database using 'restez' functions or through 'rentrez' <https://CRAN.R-project.org/package=rentrez> wrappers.
Authors:	Joel H. Nitta [aut, cre] , Dom Bennett [aut]
Maintainer:	Joel H. Nitta <joelnitta@gmail.com>
License:	MIT + file LICENSE
Version:	2.1.5.9000
Built:	2025-03-07 01:21:36 UTC
Source:	https://github.com/ropensci/restez

Log files added to the SQL database in the restez path

Description

This function is called whenever sequence files have been successfully added to the nucleotide SQL database. Row entries are added to 'add_lot.tsv' in the user's restez path containing the filename, GB release numbers and the time of successful adding. The log is to help users keep track of when sequences have been added.

Usage

add_rcrd_log(fl)
add_rcrd_log(fl)

Arguments

`fl`	filename, character

Cat lines

Description

Helper function for printing lines to console. Automatically formats lines by adding newlines.

Usage

cat_line(...)
cat_line(...)

Arguments

...

Text to print, character

Print green

Description

Print to console green text to indicate a name/filepath/text

Usage

char(x)
char(x)

Arguments

`x`	Text to print, character

Value

coloured character encoding, character

Helper function to test if a stable internet connection can be established.

Description

All retrieval functions need a stable internet connection to work properly. This internal function pings the google homepage and throws an error if it cannot be reached.

Usage

check_connection()
check_connection()

Author(s)

Hajk-Georg Drost

Clean up test data

Description

Removes all temporary test data created.

Usage

cleanup()
cleanup()

Is restez connected?

Description

Returns TRUE if a restez SQL database has been connected.

Usage

connected()
connected()

Value

Logical

Retrieve restez connection

Description

Safely acquire the restez connection. Raises error if no connection set.

Usage

connection_get()
connection_get()

Value

connection

Return the number of ids

Description

Return the number of ids in a user's restez database.

Usage

count_db_ids(db = "nucleotide")
count_db_ids(db = "nucleotide")

Arguments

`db`	character, database name

Details

Requires an open connection. If no connection or db 0 is returned.

Value

integer

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(count_db_ids())

# delete demo after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(count_db_ids())

# delete demo after example
db_delete(everything = TRUE)

Create new NCBI database

Description

Create a new local SQL database from downloaded files. Currently only GenBank/nucleotide/nuccore database is supported.

Usage

db_create(
  db_type = "nucleotide",
  min_length = 0,
  max_length = NULL,
  acc_filter = NULL,
  invert = FALSE,
  alt_restez_path = NULL,
  scan = FALSE
)
db_create(
  db_type = "nucleotide",
  min_length = 0,
  max_length = NULL,
  acc_filter = NULL,
  invert = FALSE,
  alt_restez_path = NULL,
  scan = FALSE
)

Arguments

`db_type`	character, database type
`min_length`	Minimum sequence length, default 0.
`max_length`	Maximum sequence length, default NULL.
`acc_filter`	Character vector; accessions to include or exclude from the database as specified by `invert`.
`invert`	Logical vector of length 1; if TRUE, accessions in `acc_filter` will be excluded from the database; if FALSE, only accessions in `acc_filter` will be included in the database. Default FALSE.
`alt_restez_path`	Alternative restez path if you would like to use the downloads from a different restez path.
`scan`	Logical vector of length 1; should the sequence file be scanned for accessions in `acc_filter` prior to processing? Requires zgrep to be installed (so does not work on Windows). Only used if `acc_filter` is not NULL and `invert` is FALSE. Default FALSE.

Details

All .seq.gz files are added to the database by default. A user can specify minimum/maximum sequence lengths or accession numbers to limit the sequences to be added to the database – smaller databases are faster to search. The final selection of sequences is the result of applying all filters (acc_filter, min_length, max_length) in combination.

The scan option can decrease the time needed to build a database if only a small number of sequences should be written to the database compared to the number of the sequences downloaded from GenBank; i.e., if many of the files downloaded from GenBank do not contain any sequences that should be written to the database. When set to TRUE, if a file does not contain any of the accessions in acc_filter, further processing of that file will be skipped and none of the sequences it contains will be added to the database.

Alternatively, a user can use the alt_restez_path to add the files from an alternative restez file path. For example, you may wish to have a database of all environmental sequences but then an additional smaller one of just the sequences with lengths below 100 bp. Instead of having to download all environmental sequences twice, you can generate multiple restez databases using the same downloaded files from a single restez path.

This function will not overwrite a pre-existing database. Old databases must be deleted before a new one can be created. Use db_delete() with everything=FALSE to delete an SQL database.

Connections/disconnections to the database are made automatically.

Examples

## Not run: 
# Example of general usage
library(restez)
restez_path_set(filepath = 'path/for/downloads/and/database')
db_download()
db_create()

# Example of using `acc_filter`
#
# Download files to temporary directory
temp_dir <- paste0(tempdir(), "/restez", collapse = "")
dir.create(temp_dir)
restez_path_set(filepath = temp_dir)
# Choose GenBank domain 20 ('unannotated'), the smallest
db_download(preselection = 20)
# Only include three accessions in database
db_create(
  acc_filter = c("AF000122", "AF000123", "AF000124")
)
list_db_ids()
db_delete()
unlink(temp_dir)

## End(Not run)
## Not run: 
# Example of general usage
library(restez)
restez_path_set(filepath = 'path/for/downloads/and/database')
db_download()
db_create()

# Example of using `acc_filter`
#
# Download files to temporary directory
temp_dir <- paste0(tempdir(), "/restez", collapse = "")
dir.create(temp_dir)
restez_path_set(filepath = temp_dir)
# Choose GenBank domain 20 ('unannotated'), the smallest
db_download(preselection = 20)
# Only include three accessions in database
db_create(
  acc_filter = c("AF000122", "AF000123", "AF000124")
)
list_db_ids()
db_delete()
unlink(temp_dir)

## End(Not run)

Delete database

Description

Delete the local SQL database and/or restez folder.

Usage

db_delete(everything = FALSE)
db_delete(everything = FALSE)

Arguments

everything

T/F, delete the whole restez folder as well?

Details

Any connected database will be automatically disconnected.

Examples

library(restez)
fp <- tempdir()
restez_path_set(filepath = fp)
demo_db_create(n = 10)
db_delete(everything = FALSE)
# Will not run: gb_sequence_get(id = 'demo_1')
# only the SQL database is deleted
db_delete(everything = TRUE)
# Now returns NULL
(restez_path_get())
library(restez)
fp <- tempdir()
restez_path_set(filepath = fp)
demo_db_create(n = 10)
db_delete(everything = FALSE)
# Will not run: gb_sequence_get(id = 'demo_1')
# only the SQL database is deleted
db_delete(everything = TRUE)
# Now returns NULL
(restez_path_get())

Download database

Description

Download .seq.tar files from the latest GenBank release.

Usage

db_download(
  db = "nucleotide",
  overwrite = FALSE,
  preselection = NULL,
  max_tries = 1
)
db_download(
  db = "nucleotide",
  overwrite = FALSE,
  preselection = NULL,
  max_tries = 1
)

Arguments

`db`	Database type, only 'nucleotide' currently available.
`overwrite`	T/F, overwrite pre-existing downloaded files?
`preselection`	Character vector of length 1; GenBank domains to download. If not specified (default), a menu will be provided for selection. To specify, provide either a single number or a single character string of numbers separated by spaces, e.g. "19 20" for 'Phage' (19) and 'Unannotated' (20).
`max_tries`	Numeric vector of length 1; maximum number of times to attempt downloading database (default 1).

Details

In default mode, the user interactively selects the parts (i.e., "domains") of GenBank to download (e.g. primates, plants, bacteria ...). Alternatively, the selected domains can be provided as a character string to preselection.

The max_tries argument is useful for large databases that may otherwise fail due to periodic lapses in internet connectivity. This value can be set to Inf to continuously try until the database download succeeds (not recommended if you do not have an internet connection!).

Value

T/F, if all files download correctly, TRUE else FALSE.

Examples

## Not run: 
library(restez)
restez_path_set(filepath = 'path/for/downloads')
db_download()

## End(Not run)
## Not run: 
library(restez)
restez_path_set(filepath = 'path/for/downloads')
db_download()

## End(Not run)

Download database (internal version)

Description

Download .seq.tar files from the latest GenBank release. The user interactively selects the parts of GenBank to download (e.g. primates, plants, bacteria ...). This is an internal function so the download can be wrapped in ⁠while()⁠ to enable persistent downloading.

Usage

db_download_intern(db = "nucleotide", overwrite = FALSE, preselection = NULL)
db_download_intern(db = "nucleotide", overwrite = FALSE, preselection = NULL)

Arguments

`db`	Database type, only 'nucleotide' currently available.
`overwrite`	T/F, overwrite pre-existing downloaded files?
`preselection`	Character vector of length 1; GenBank domains to download. If not specified (default), a menu will be provided for selection. To specify, provide either a single number or a single character string of numbers separated by spaces, e.g. "19 20" for 'Phage' (19) and 'Unannotated' (20).

Details

The downloaded files will appear in the restez filepath under downloads.

Value

T/F, if all files download correctly, TRUE else FALSE.

Return the minimum and maximum sequence lengths in db

Description

Returns the maximum and minimum sequence lengths as set by the user upon db creation.

Usage

db_sqlngths_get()
db_sqlngths_get()

Details

If no file found, returns empty character vector.

Value

vector of integers

Log the min and max sequence lengths

Description

Log the min and maximum sequence length used in the created db.

Usage

db_sqlngths_log(min_lngth, max_lngth)
db_sqlngths_log(min_lngth, max_lngth)

Arguments

`min_lngth`	Minimum length
`max_lngth`	Maximum length

Create demo database

Description

Creates a local mock SQL database from package test data for demonstration purposes. No internet connection required.

Usage

demo_db_create(db_type = "nucleotide", n = 100)
demo_db_create(db_type = "nucleotide", n = 100)

Arguments

`db_type`	character, database type
`n`	integer, number of mock sequences

Examples

library(restez)
# set the restez path to a temporary dir
restez_path_set(filepath = tempdir())
# create demo database
demo_db_create(n = 5)
# in the demo, IDs are 'demo_1', 'demo_2' ...
(gb_sequence_get(id = 'demo_1'))

# Delete a demo database after an example
db_delete(everything = TRUE)
library(restez)
# set the restez path to a temporary dir
restez_path_set(filepath = tempdir())
# create demo database
demo_db_create(n = 5)
# in the demo, IDs are 'demo_1', 'demo_2' ...
(gb_sequence_get(id = 'demo_1'))

# Delete a demo database after an example
db_delete(everything = TRUE)

Calculate the size of a directory

Description

Returns the size of directory in GB

Usage

dir_size(fp)
dir_size(fp)

Arguments

`fp`	File path, character

Value

numeric

Get dwnld path

Description

Return path to folder where raw .seq files are stored.

Usage

dwnld_path_get()
dwnld_path_get()

Value

character

Log a downloaded file in the restez path

Description

This function is called whenever a file is successfully downloaded. A row entry is added to the 'download_log.tsv' in the user's restez path containing the file name, the GB release number and the time of successfully download. The log is to help users keep track of when they downloaded files and to determine if the downloaded files are out of date.

Usage

dwnld_rcrd_log(fl)
dwnld_rcrd_log(fl)

Arguments

`fl`	file name, character

Get Entrez fasta

Description

Return fasta format as expected from an Entrez call. If not all IDs are returned, will run rentrez::entrez_fetch.

Usage

entrez_fasta_get(id, ...)
entrez_fasta_get(id, ...)

Arguments

`id`	vector, unique ID(s) for record(s)
`...`	arguments passed on to rentrez

Value

character string containing the file created

Entrez fetch

Description

Wrapper for rentrez::entrez_fetch.

Usage

entrez_fetch(db, id = NULL, rettype, retmode = "", ...)
entrez_fetch(db, id = NULL, rettype, retmode = "", ...)

Arguments

`db`	character, name of the database
`id`	vector, unique ID(s) for record(s)
`rettype`	character, data format
`retmode`	character, data mode
`...`	Arguments to be passed on to rentrez

Details

Attempts to first search local database with user-specified parameters, if the record is missing in the database, the function then calls rentrez::entrez_fetch to search GenBank remotely.

rettype='fasta' and rettype='gb' are respectively equivalent to gb_fasta_get() and gb_record_get().

Value

character string containing the file created

Supported return types and modes

XML retmode is not supported. Rettypes 'seqid', 'ft', 'acc' and 'uilist' are also not supported.

Note

It is advisable to call restez and rentrez functions with '::' notation rather than library() calls to avoid namespace issues. e.g. restez::entrez_fetch().

Examples

library(restez)
restez_path_set(tempdir())
demo_db_create(n = 5)
# return fasta record
fasta_res <- entrez_fetch(db = 'nucleotide',
                          id = c('demo_1', 'demo_2'),
                          rettype = 'fasta')
cat(fasta_res)
# return whole GB record in text format
gb_res <- entrez_fetch(db = 'nucleotide',
                       id = c('demo_1', 'demo_2'),
                       rettype = 'gb')
cat(gb_res)
# NOT RUN
# whereas these request would go through rentrez
# fasta_res <- entrez_fetch(db = 'nucleotide',
#                           id = c('S71333', 'S71334'),
#                           rettype = 'fasta')
# gb_res <- entrez_fetch(db = 'nucleotide',
#                        id = c('S71333', 'S71334'),
#                        rettype = 'gb')

# delete demo after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(tempdir())
demo_db_create(n = 5)
# return fasta record
fasta_res <- entrez_fetch(db = 'nucleotide',
                          id = c('demo_1', 'demo_2'),
                          rettype = 'fasta')
cat(fasta_res)
# return whole GB record in text format
gb_res <- entrez_fetch(db = 'nucleotide',
                       id = c('demo_1', 'demo_2'),
                       rettype = 'gb')
cat(gb_res)
# NOT RUN
# whereas these request would go through rentrez
# fasta_res <- entrez_fetch(db = 'nucleotide',
#                           id = c('S71333', 'S71334'),
#                           rettype = 'fasta')
# gb_res <- entrez_fetch(db = 'nucleotide',
#                        id = c('S71333', 'S71334'),
#                        rettype = 'gb')

# delete demo after example
db_delete(everything = TRUE)

Get Entrez GenBank record

Description

Return gb and gbwithparts format as expected from an Entrez call. If not all IDs are returned, will run rentrez::entrez_fetch.

Usage

entrez_gb_get(id, ...)
entrez_gb_get(id, ...)

Arguments

`id`	vector, unique ID(s) for record(s)
`...`	arguments passed on to rentrez

Value

character string containing the file created

Extract accession

Description

Return accession ID from GenBank record

Usage

extract_accession(record)
extract_accession(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Extract by keyword

Description

Search through GenBank record for a keyword and return text up to the end_pattern.

Usage

extract_by_patterns(record, start_pattern, end_pattern = "\n")
extract_by_patterns(record, start_pattern, end_pattern = "\n")

Arguments

`record`	GenBank record in text format, character
`start_pattern`	REGEX pattern indicating the point to start extraction, character
`end_pattern`	REGEX pattern indicating the point to stop extraction, character

Details

The start_pattern should be any of the capitalized elements in a GenBank record (e.g. LOCUS, DESCRIPTION, ACCESSION). The end_pattern depends on how much of the selected element a user wants returned. By default, the extraction will stop at the next newline. If keyword or end pattern not found, returns NULL.

Value

character or NULL

Extract clean sequence from sequence part

Description

Return clean sequence from seqrecpart of a GenBank record

Usage

extract_clean_sequence(seqrecpart, max_len = 1e+08)
extract_clean_sequence(seqrecpart, max_len = 1e+08)

Arguments

`seqrecpart`	Sequence part of a GenBank record, character
`max_len`	Number: maximum number of characters allowed in a single record before splitting the record into parts. Does not affect output, but only internal calculations, so generally should not be changed. Default = 1e8.

Details

If element is not found, ” returned.

Value

character

Extract definition

Description

Return definition from GenBank record.

Usage

extract_definition(record)
extract_definition(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Extract features

Description

Return feature table as list from GenBank record

Usage

extract_features(record)
extract_features(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, empty list returned.

Value

list of lists

Extract the information record part

Description

Return information part from GenBank record

Usage

extract_inforecpart(record)
extract_inforecpart(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Extract keywords

Description

Return keywords as list from GenBank record

Usage

extract_keywords(record)
extract_keywords(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character vector

Extract locus

Description

Return locus information from GenBank record

Usage

extract_locus(record)
extract_locus(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

named character vector

Extract organism

Description

Return organism name from GenBank record

Usage

extract_organism(record)
extract_organism(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Extract the sequence record part

Description

Return sequence part from GenBank record

Usage

extract_seqrecpart(record)
extract_seqrecpart(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Extract sequence

Description

Return sequence from GenBank record

Usage

extract_sequence(record)
extract_sequence(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Extract version

Description

Return accession + version ID from GenBank record

Usage

extract_version(record)
extract_version(record)

Arguments

record

GenBank record in text format, character

Details

If element is not found, ” returned.

Value

character

Download a file

Description

Download a GenBank .seq.tar file. Check the file has downloaded properly. If not, returns FALSE. If overwrite is true, any previous file will be overwritten.

Usage

file_download(fl, overwrite = FALSE)
file_download(fl, overwrite = FALSE)

Arguments

`fl`	character, base filename (e.g. gbpri9.seq) to be downloaded
`overwrite`	T/F

Value

T/F

Write filenames to log files

Description

Record a filename in a log file along with GB release and time.

Usage

filename_log(fl, fp)
filename_log(fl, fp)

Arguments

`fl`	file name, character
`fp`	filepath to log file, character

Read flatfile sequence records

Description

Read records from a .seq file.

Usage

flatfile_read(flpth)
flatfile_read(flpth)

Arguments

flpth

Path to .seq file

Value

list of GenBank records in text format

Read and add .seq files to database

Description

Given a list of seq_files, read and add the contents of the files to a SQL-like database. If any errors during the process, FALSE is returned.

Usage

gb_build(
  dpth,
  seq_files,
  max_length,
  min_length,
  acc_filter = NULL,
  invert = FALSE,
  scan = FALSE
)
gb_build(
  dpth,
  seq_files,
  max_length,
  min_length,
  acc_filter = NULL,
  invert = FALSE,
  scan = FALSE
)

Arguments

`dpth`	Download path (where seq_files are stored)
`seq_files`	.seq.tar seq file names
`max_length`	Maximum sequence length, default NULL.
`min_length`	Minimum sequence length, default 0.
`acc_filter`	Character vector; accessions to include or exclude from the database as specified by `invert`.
`invert`	Logical vector of length 1; if TRUE, accessions in `acc_filter` will be excluded from the database; if FALSE, only accessions in `acc_filter` will be included in the database. Default FALSE.
`scan`	Logical vector of length 1; should the sequence file be scanned for accessions in `acc_filter` prior to processing? Requires zgrep to be installed (so does not work on Windows). Only used if `acc_filter` is not NULL and `invert` is FALSE. Default FALSE.

Details

This function will automatically connect to the restez database.

Value

Logical

Get definition from GenBank

Description

Return the definition line for an accession ID.

Usage

gb_definition_get(id)
gb_definition_get(id)

Arguments

`id`	character, sequence accession ID(s)

Value

named vector of definitions, if no results found NULL

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(def <- gb_definition_get(id = 'demo_1'))
(defs <- gb_definition_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(def <- gb_definition_get(id = 'demo_1'))
(defs <- gb_definition_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)

Create GenBank data.frame

Description

Make data.frame from columns vectors for nucleotide entries. As part of gb_df_generate().

Usage

gb_df_create(accessions, versions, organisms, definitions, sequences, records)
gb_df_create(accessions, versions, organisms, definitions, sequences, records)

Arguments

`accessions`	character, vector of accessions
`versions`	character, vector of accessions + versions
`organisms`	character, vector of organism names
`definitions`	character, vector of sequence definitions
`sequences`	character, vector of sequences
`records`	character, vector of GenBank records in text format

Value

data.frame

Generate GenBank records data.frame

Description

For a list of records, construct a data.frame for insertion into SQL database.

Usage

gb_df_generate(
  records,
  min_length = 0,
  max_length = NULL,
  acc_filter = NULL,
  invert = FALSE
)
gb_df_generate(
  records,
  min_length = 0,
  max_length = NULL,
  acc_filter = NULL,
  invert = FALSE
)

Arguments

`records`	character, vector of GenBank records in text format
`min_length`	Minimum sequence length, default 0.
`max_length`	Maximum sequence length, default NULL.
`acc_filter`	Character vector; accessions to include or exclude from the database as specified by `invert`.
`invert`	Logical vector of length 1; if TRUE, accessions in `acc_filter` will be excluded from the database; if FALSE, only accessions in `acc_filter` will be included in the database. Default FALSE.

Details

The resulting data.frame has five columns: accession, organism, raw_definition, raw_sequence, raw_record. The prefix 'raw_' indicates the data has been converted to the raw format, see ?charToRaw, in order to save on RAM. The raw_record contains the entire GenBank record in text format.

Use acc_filter and max and min sequence lengths to minimize the size of the database. All sequences have to be at least as long as min and less than or equal in length to max, unless max is NULL in which there is no maximum length. The final selection of sequences is the result of applying all filters (acc_filter, min_length, max_length) in combination.

Value

data.frame, or NULL if no records pass filters

Extract elements of a GenBank record

Description

Return elements of GenBank record e.g. sequence, definition ...

Usage

gb_extract(
  record,
  what = c("accession", "version", "organism", "sequence", "definition", "locus",
    "features", "keywords")
)
gb_extract(
  record,
  what = c("accession", "version", "organism", "sequence", "definition", "locus",
    "features", "keywords")
)

Arguments

`record`	GenBank record in text format, character
`what`	Which element to extract

Details

This function uses a REGEX to extract particular elements of a GenBank record. All of the what options return a single character with the exception of 'locus' or 'keywords' that return character vectors and 'features' that returns a list of lists for all features.

The accuracy of these functions cannot be guaranteed due to the enormity of the GenBank database. But the function is regularly tested on a range of GenBank records.

Note: all non-latin1 characters are converted to '-'.

Value

character or list of lists (what='features') or named character vector (what='locus')

Examples

library(restez)
data('record')
(gb_extract(record = record, what = 'locus'))
library(restez)
data('record')
(gb_extract(record = record, what = 'locus'))

Get fasta from GenBank

Description

Get sequence and definition data in FASTA format. Equivalent to rettype='fasta' in rentrez::entrez_fetch().

Usage

gb_fasta_get(id, width = 70)
gb_fasta_get(id, width = 70)

Arguments

`id`	character, sequence accession ID(s)
`width`	integer, maximum number of characters in a line

Value

named vector of fasta sequences, if no results found NULL

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(fasta <- gb_fasta_get(id = 'demo_1'))
(fastas <- gb_fasta_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(fasta <- gb_fasta_get(id = 'demo_1'))
(fastas <- gb_fasta_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)

Get organism from GenBank

Description

Return the organism name for an accession ID.

Usage

gb_organism_get(id)
gb_organism_get(id)

Arguments

`id`	character, sequence accession ID(s)

Value

named vector of definitions, if no results found NULL

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(org <- gb_organism_get(id = 'demo_1'))
(orgs <- gb_organism_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(org <- gb_organism_get(id = 'demo_1'))
(orgs <- gb_organism_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)

Get record from GenBank

Description

Return the entire GenBank record for an accession ID. Equivalent to rettype='gb' in rentrez::entrez_fetch().

Usage

gb_record_get(id)
gb_record_get(id)

Arguments

`id`	character, sequence accession ID(s)

Value

named vector of records, if no results found NULL

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(rec <- gb_record_get(id = 'demo_1'))
(recs <- gb_record_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(rec <- gb_record_get(id = 'demo_1'))
(recs <- gb_record_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)

Get sequence from GenBank

Description

Return the sequence(s) for a record(s) from the accession ID(s).

Usage

gb_sequence_get(id, dnabin = FALSE)
gb_sequence_get(id, dnabin = FALSE)

Arguments

`id`	character, sequence accession ID(s)
`dnabin`	Logical vector of length 1; should the sequences be returned using the bit-level coding scheme of the ape package? Default FALSE.

Details

For more information about the dnabin format, see ape::DNAbin().

Value

named vector of sequences, if no results found NULL

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(seq <- gb_sequence_get(id = 'demo_1'))
(seqs <- gb_sequence_get(id = c('demo_1', 'demo_2')))
(fasta_dnabin <- gb_sequence_get(id = 'demo_1', dnabin = TRUE))

# delete demo after example
db_delete(everything = TRUE)

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(seq <- gb_sequence_get(id = 'demo_1'))
(seqs <- gb_sequence_get(id = c('demo_1', 'demo_2')))
(fasta_dnabin <- gb_sequence_get(id = 'demo_1', dnabin = TRUE))

# delete demo after example
db_delete(everything = TRUE)

Add to GenBank SQL database

Description

Add records data.frame to SQL-like database.

Usage

gb_sql_add(df)
gb_sql_add(df)

Arguments

`df`	Records data.frame

Query the GenBank SQL

Description

Generic query function for retrieving data from the SQL database for the get functions.

Usage

gb_sql_query(nm, id)
gb_sql_query(nm, id)

Arguments

`nm`	character, column name
`id`	character, sequence accession ID(s)

Value

data.frame

Get version from GenBank

Description

Return the accession version for an accession ID.

Usage

gb_version_get(id)
gb_version_get(id)

Arguments

`id`	character, sequence accession ID(s)

Value

named vector of versions, if no results found NULL

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(ver <- gb_version_get(id = 'demo_1'))
(vers <- gb_version_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)


library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
(ver <- gb_version_get(id = 'demo_1'))
(vers <- gb_version_get(id = c('demo_1', 'demo_2')))


# delete demo after example
db_delete(everything = TRUE)

Check if the last GenBank release number is the latest

Description

Returns TRUE if the GenBank release number is the most recent GenBank release available.

Usage

gbrelease_check()
gbrelease_check()

Value

logical

Get the GenBank release number in the restez path

Description

Returns the GenBank release number. Returns empty character if none found.

Usage

gbrelease_get()
gbrelease_get()

Details

If no file found, returns empty character vector.

Value

character

Log the GenBank release number in the restez path

Description

This function is called whenever db_download is run. It logs the GB release number in the 'gb_release.txt' in the user's restez path. The log is to help users keep track of whether their database if out of date.

Usage

gbrelease_log(release)
gbrelease_log(release)

Arguments

release

GenBank release number, character

Does the connected database have data?

Description

Returns TRUE if a restez SQL database has data.

Usage

has_data()
has_data()

Value

Logical

Identify downloadable files

Description

Searches through the release notes for a GenBank release to find all listed .seq files. Returns a data.frame for all .seq files and their description.

Usage

identify_downloadable_files()
identify_downloadable_files()

Value

data.frame

Is in db

Description

Determine whether an id(s) is/are present in a database.

Usage

is_in_db(id, db = "nucleotide")
is_in_db(id, db = "nucleotide")

Arguments

`id`	character, sequence accession ID(s)
`db`	character, database name

Value

named vector of booleans

Examples

library(restez)
# set the restez path to a temporary dir
restez_path_set(filepath = tempdir())
# create demo database
demo_db_create(n = 5)
# in the demo, IDs are 'demo_1', 'demo_2' ...
ids <- c('thisisnotanid', 'demo_1', 'demo_2')
(is_in_db(id = ids))


# delete demo after example
db_delete(everything = TRUE)
library(restez)
# set the restez path to a temporary dir
restez_path_set(filepath = tempdir())
# create demo database
demo_db_create(n = 5)
# in the demo, IDs are 'demo_1', 'demo_2' ...
ids <- c('thisisnotanid', 'demo_1', 'demo_2')
(is_in_db(id = ids))


# delete demo after example
db_delete(everything = TRUE)

Return date and time of the last added sequence

Description

Return the date and time of the last added sequence as determined using the 'add_log.tsv'.

Usage

last_add_get()
last_add_get()

Details

If no file found, returns empty character vector.

Value

character

Return date and time of the last download

Description

Return the date and time of the last download as determined using the 'download_log.tsv'.

Usage

last_dwnld_get()
last_dwnld_get()

Details

If no file found, returns empty character vector.

Value

character

Return the last entry

Description

Return the last entry from a tab-delimited log file.

Usage

last_entry_get(fp)
last_entry_get(fp)

Arguments

`fp`	Filepath, character

Value

vector

Retrieve latest GenBank release number

Description

Downloads the latest GenBank release number and returns it.

Usage

latest_genbank_release()
latest_genbank_release()

Value

character

Download the latest GenBank Release Notes

Description

Downloads the latest GenBank release notes to a user's restez download path.

Usage

latest_genbank_release_notes()
latest_genbank_release_notes()

List database IDs

Description

Return a vector of all IDs in a database.

Usage

list_db_ids(db = "nucleotide", n = 100)
list_db_ids(db = "nucleotide", n = 100)

Arguments

`db`	character, database name
`n`	Maximum number of IDs to return, if NULL returns all

Details

Warning: can return very large vectors for large databases.

Value

vector of characters

Examples

library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
# Warning: not recommended for real databases
#  with potentially millions of IDs
all_ids <- list_db_ids()


# What shall we do with these IDs?
# ... how about make a mock fasta file
seqs <- gb_sequence_get(id = all_ids)
defs <- gb_definition_get(id = all_ids)
# paste together
fasta_seqs <- paste0('>', defs, '\n', seqs)
fasta_file <- paste0(fasta_seqs, collapse = '\n')
cat(fasta_file)


# delete after example
db_delete(everything = TRUE)
library(restez)
restez_path_set(filepath = tempdir())
demo_db_create(n = 5)
# Warning: not recommended for real databases
#  with potentially millions of IDs
all_ids <- list_db_ids()


# What shall we do with these IDs?
# ... how about make a mock fasta file
seqs <- gb_sequence_get(id = all_ids)
defs <- gb_definition_get(id = all_ids)
# paste together
fasta_seqs <- paste0('>', defs, '\n', seqs)
fasta_file <- paste0(fasta_seqs, collapse = '\n')
cat(fasta_file)


# delete after example
db_delete(everything = TRUE)

Produce message of missing IDs

Description

Sends message to console stating number of missing IDs.

Usage

message_missing(n)
message_missing(n)

Arguments

`n`	Number of missing IDs

Mock def

Description

Make a mock sequence definition. Designed to be part of a loop.

Usage

mock_def(i)
mock_def(i)

Arguments

`i`	integer, iterator

Value

character

Generate mock GenBank records data.frame

Description

Make a mock nucleotide data.frame for entry into a demonstration SQL database.

Usage

mock_gb_df_generate(n)
mock_gb_df_generate(n)

Arguments

`n`	integer, number of entries

Value

data.frame

Mock org

Description

Make a mock sequence organism. Designed to be part of a loop.

Usage

mock_org(i)
mock_org(i)

Arguments

`i`	integer, iterator

Value

character

Mock rec

Description

Create a mock GenBank record for demo-ing and testing purposes. Designed to be part of a loop. Accession, organism... etc. are optional arguments.

Usage

mock_rec(
  i,
  definition = NULL,
  accession = NULL,
  version = NULL,
  organism = NULL,
  sequence = NULL
)
mock_rec(
  i,
  definition = NULL,
  accession = NULL,
  version = NULL,
  organism = NULL,
  sequence = NULL
)

Arguments

`i`	integer, iterator
`definition`	character
`accession`	character
`version`	character
`organism`	character
`sequence`	character

Value

character

Mock seq

Description

Make a mock sequence. Designed to be part of a loop.

Usage

mock_seq(i, sqlngth = 10)
mock_seq(i, sqlngth = 10)

Arguments

`i`	integer, iterator
`sqlngth`	integer, sequence length

Value

character

Get accession numbers by querying NCBI GenBank

Description

The query string can be formatted using GenBank advanced query terms to obtain accession numbers corresponding to a specific set of criteria.

Usage

ncbi_acc_get(query, strict = TRUE, drop_ver = TRUE)
ncbi_acc_get(query, strict = TRUE, drop_ver = TRUE)

Arguments

`query`	Character vector of length 1; query string to search GenBank.
`strict`	Logical vector of length 1; should an error be issued if the number of unique accessions retrieved does not match the number of hits from GenBank? Default TRUE.
`drop_ver`	Logical vector of length 1; should the version part of the accession number (e.g., '.1' in 'AB001538.1') be dropped? Default TRUE.

Details

Note this queries NCBI GenBank, not the local database generated with restez.

It can be used either to restrict the accessions used to construct the local database (acc_filter argument of db_create()) or to specify accessions to read from the local database (id argument of gb_fasta_get() and other gb_*_get() functions).

Value

Character vector; accession numbers resulting from query.

Examples

## Not run: 
  # requires an internet connection
  cmin_accs <- ncbi_acc_get("Crepidomanes minutum")
  length(cmin_accs)
  head(cmin_accs)

## End(Not run)
## Not run: 
  # requires an internet connection
  cmin_accs <- ncbi_acc_get("Crepidomanes minutum")
  length(cmin_accs)
  head(cmin_accs)

## End(Not run)

Print file size predictions to screen

Description

Predicts the file sizes of the downloads and the database from the GenBank filesize information. Conversion factors are based on previous restez downloads.

Usage

predict_datasizes(uncompressed_filesize)
predict_datasizes(uncompressed_filesize)

Arguments

uncompressed_filesize

GBs of the stated filesize, numeric

Print method for status class

Description

Prints to screen the three sections of the status class. Not meant to be used interactively.

Usage

## S3 method for class 'status'
print(x, ...)
## S3 method for class 'status'
print(x, ...)

Arguments

`x`	Status object
`...`	Other arguments (not used by this function)

Create README in restez_path

Description

Write notes for the curious sorts who peruse the restez_path.

Usage

readme_log()
readme_log()

Example GenBank record

Description

Example GenBank record in text format for demonstration purposes.

Usage

data("record")data("record")

Format

A large character object containing record information and DNA sequence.

Source

https://www.ncbi.nlm.nih.gov/nuccore/AY952423.1

References

GenBank

Examples

data(record)
cat(record)
data(record)
cat(record)

Connect to the restez database

Description

Sets a connection to the local database.

Usage

restez_connect(read_only = FALSE)
restez_connect(read_only = FALSE)

Arguments

read_only

Logical; should the connection be made in read-only mode? Read-only mode is required for multiple R processes to access the database simultaneously. Default FALSE.

Disconnect from restez database

Description

Safely disconnect from the restez connection

Usage

restez_disconnect()
restez_disconnect()

Check restez filepath

Description

Raises error if restez path does not exist.

Usage

restez_path_check()
restez_path_check()

Get restez path

Description

Return filepath to where the restez database is stored.

Usage

restez_path_get()
restez_path_get()

Value

character

Examples

library(restez)
# set a restez path with a tempdir
restez_path_set(filepath = tempdir())
# check what the set path is
(restez_path_get())
library(restez)
# set a restez path with a tempdir
restez_path_set(filepath = tempdir())
# check what the set path is
(restez_path_get())

Set restez path

Description

Specify the filepath for the local GenBank database.

Usage

restez_path_set(filepath)
restez_path_set(filepath)

Arguments

filepath

character, valid filepath to the folder where the database should be stored.

Details

Adds 'restez_path' to options(). In this path the folder 'restez' will be created and all downloaded and database files will be stored there.

Examples

## Not run: 
library(restez)
restez_path_set(filepath = 'path/to/where/you/want/files/to/download')

## End(Not run)
## Not run: 
library(restez)
restez_path_set(filepath = 'path/to/where/you/want/files/to/download')

## End(Not run)

Unset restez path

Description

Set the restez path to NULL

Usage

restez_path_unset()
restez_path_unset()

Is restez ready?

Description

Returns TRUE if a restez SQL database is available. Use restez_status() for more information.

Usage

restez_ready()
restez_ready()

Value

Logical

Examples

library(restez)
fp <- tempdir()
restez_path_set(filepath = fp)
demo_db_create(n = 5)
(restez_ready())
db_delete(everything = TRUE)
(restez_ready())
library(restez)
fp <- tempdir()
restez_path_set(filepath = fp)
demo_db_create(n = 5)
(restez_ready())
db_delete(everything = TRUE)
(restez_ready())

Restez readline

Description

Wrapper for base readline.

Usage

restez_rl(prompt)
restez_rl(prompt)

Arguments

prompt

character, display text

Value

character

Check restez status

Description

Report to console current setup status of restez.

Usage

restez_status(gb_check = FALSE)
restez_status(gb_check = FALSE)

Arguments

gb_check

Check whether last download was from latest GenBank release? Default FALSE.

Details

Set gb_check=TRUE to see if your downloads are up-to-date.

Value

Status class

Examples

library(restez)
fp <- tempdir()
restez_path_set(filepath = fp)
demo_db_create(n = 5)
restez_status()
db_delete(everything = TRUE)
# Errors:
# restez_status()
library(restez)
fp <- tempdir()
restez_path_set(filepath = fp)
demo_db_create(n = 5)
restez_status()
db_delete(everything = TRUE)
# Errors:
# restez_status()

Scan a gzipped file for text

Description

Scans a zipped file for text strings and returns TRUE if any are present.

Usage

search_gz(terms, path)
search_gz(terms, path)

Arguments

`terms`	Character vector; search terms (most likely GenBank accession numbers)
`path`	Path to the gzipped file to scan

Value

Logical

Log the system session information in restez path

Description

Records the session and system information to file.

Usage

seshinfo_log()
seshinfo_log()

Set up test common test data

Description

Creates temporary test folders.

Usage

setup()
setup()

Retrieve GenBank selections made by user

Description

Returns the selections made by the user.

Usage

slctn_get()
slctn_get()

Details

If no file found, returns empty character vector.

Value

character vector

Log the GenBank selection made by a user

Description

This function is called whenever a user makes a selection with the db_download(). It records GenBank numbers selections.

Usage

slctn_log(selection)
slctn_log(selection)

Arguments

selection

selected GenBank sequences, named vector

Get SQL path

Description

Return path to where SQL database is stored.

Usage

sql_path_get()
sql_path_get()

Value

character

Print blue

Description

Print to console blue text to indicate a number/statistic.

Usage

stat(...)
stat(...)

Arguments

...

Any number of text arguments to print, character

Value

coloured character encoding, character

Generate a list class for storing status information

Description

Creates a three-part list for holding information on the status of the restez file path.

Usage

status_class()
status_class()

Value

Status class

Get test data directory

Description

Get the folder containing test data.

Usage

testdatadir_get()
testdatadir_get()

Package 'restez'

Help Index

Log files added to the SQL database in the restez path

Description

Usage

Arguments

See Also

Cat lines

Description

Usage

Arguments

See Also

Print green

Description

Usage

Arguments

Value

See Also

Helper function to test if a stable internet connection can be established.

Description

Usage

Author(s)

See Also

Clean up test data

Description

Usage

See Also

Is restez connected?

Description

Usage

Value

See Also

Retrieve restez connection

Description

Usage

Value

See Also

Return the number of ids

Description

Usage

Arguments

Details

Value

See Also

Examples

Create new NCBI database

Description

Usage

Arguments

Details

See Also

Examples

Delete database

Description

Usage

Arguments

Details

See Also

Examples

Download database

Description

Usage

Arguments

Details

Value

See Also

Examples

Download database (internal version)

Description

Usage

Arguments

Details

Value

See Also

Return the minimum and maximum sequence lengths in db

Description

Usage

Details

Value

See Also