--- title: "Meta-Genome Retrieval" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Meta-Genome Retrieval} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} options(width = 750) knitr::opts_chunk$set( comment = "#>", error = FALSE, tidy = FALSE) ``` > **_NOTE:_** To make sure that you have a sufficiently stable (internet) connection between R and the respective databases, please set the default `timeout` setting __on your local machine__ from 60sec to at least 30000sec before running any retrieval functions via: ```r options(timeout = 30000) ``` ## Topics - [1. Perform Meta-Genome Retrieval for Specific Kingdoms of Life](#perform-meta-genome-retrieval) - [1.1 Retrieve Genomic Sequences](#retrieve-genomic-sequences) - [1.1.1 Retrieval from `NCBI RefSeq`](#retrieval-from-ncbi-refseq) - [1.1.2 Retrieval from `NCBI Genbank`](#retrieval-from-ncbi-genbank) - [1.1.3 Retrieval from `ENSEMBL`](#retrieval-from-ensembl) - [1.2 Retrieve Protein Sequences](#retrieve-protein-sequences) - [1.2.1 Retrieval from `NCBI RefSeq`](#retrieval-from-ncbi-refseq-1) - [1.2.2 Retrieval from `NCBI Genbank`](#retrieval-from-ncbi-genbank-1) - [1.2.3 Retrieval from `ENSEMBL`](#retrieval-from-ensembl-1) - [1.3 Retrieve CDS Sequences](#retrieve-cds-sequences) - [1.3.1 Retrieval from `NCBI RefSeq`](#retrieval-from-ncbi-refseq-2) - [1.3.2 Retrieval from `NCBI Genbank`](#retrieval-from-ncbi-genbank-2) - [1.3.3 Retrieval from `ENSEMBL`](#retrieval-from-ensembl-2) - [1.4 Retrieve GFF files](#retrieve-gff-files) - [1.5 Retrieve GTF files](#retrieve-gtf-files) - [1.6 Retrieve RNA sequences](#retrieve-rna-sequences) - [1.6.1 Retrieval from `NCBI RefSeq`](#retrieval-from-ncbi-refseq-3) - [1.6.2 Retrieval from `NCBI Genbank`](#retrieval-from-ncbi-genbank-3) - [1.6.3 Retrieval from `ENSEMBL`](#retrieval-from-ensembl-3) - [1.7 Retrieve Repeat Masker Repeat Annotation File](#retrieve-repeat-masker-sequences) - [1.7.1 Retrieval from `NCBI RefSeq`](#retrieval-from-ncbi-refseq-4) - [1.7.2 Retrieval from `NCBI Genbank`](#retrieval-from-ncbi-genbank-4) - [2. Retrieve groups or subgroups of species ](#retrieve-groups-or-subgroups-of-species) - [2.1 Example retrieval of all `Gammaproteobacteria` genomes from `NCBI RefSeq`](#example-retrieval-of-all-gammaproteobacteria-genomes-from-ncbi-refseq) - [2.2 Example retrieval of all `Adenoviridae` genomes from `NCBI RefSeq`](#example-retrieval-of-all-adenoviridae-genomes-from-ncbi-refseq) - [3. Meta retrieval of genome assembly quality information](#meta-retrieval-of-genome-assembly-quality-information) - [4. Retrieve data from metagenome projects such as `human gut metagenome` project from `NCBI Genbank`](#metagenome-project-retrieval-from-ncbi-genbank) - [5. Retrieve Individual Genomes for all Species in the Tree of Life](#retrieve-individual-genomes-for-all-species-in-the-tree-of-life) - [5.1 Example: Genome Retrieval](#genome-retrieval) - [5.2 Example: Proteome Retrieval](#proteome-retrieval) ## Perform Meta-Genome Retrieval The number of genome assemblies generated and stored in sequence databases is growing exponentially every year. With the availability of this growing amount of genomic data, meta-genomics studies become more and more popular. By using this bulk of genomes for comparing them to thousands of other genomes new structural patterns and evolutionary insights can be obtained. However, the first step in any meta-genomics study is the retrieval of the genomes, proteomes, coding sequences or annotation files that shall be compared and investigated. For this purpose, the `meta.retrieval()` and `meta.retrieval.all()` functions allows users to perform straightforward meta-genome retrieval of hundreds of genomes, proteomes, CDS, etc in R. Finally, in addition to the retrieved sequence information the `meta.retrieval()` and `meta.retrieval.all()` functions will generate a `summary file` which contains information about the genome version, genome status, submitter, etc for each organism to promote computational and scientific reproducibility of the meta-genomics study at hand. This `summary file` can for example be attached as `Supplementary Data` to the respective study. #### Getting Started The `meta.retrieval()` and `meta.retrieval.all()` functions aim to simplify the genome retrieval and computational reproducibility process for meta-genomics studies. Both functions allow users to either download genomes, proteomes, CDS, etc for species within a specific kingdom or subgroup of life (`meta.retrieval()`) or of all species of all kingdoms (`meta.retrieval.all()`). Before `biomartr` users had to write `shell` scripts to download respective genomic data. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing R workflows and for easier reproducibility. For example, the pipeline logic of the [magrittr](https://github.com/tidyverse/magrittr) package can be used with `meta.retrieval()` and `meta.retrieval.all()` as follows. ```{r,eval=FALSE} # download all vertebrate genomes, then apply ... meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") %>% ... ``` Here `...` denotes any subsequent meta-genomics analysis. Hence, `meta.retrieval()` enables the pipeline methodology for meta-genomics. ### Retrieve Genomic Sequences To retrieve a list of all available kingdoms stored in the `NCBI RefSeq`, `NCBI Genbank`, and `ENSEMBL` databases users can consult the `getKingdoms()` function which stores a list of all available kingdoms of life for the corresponding database. ### Example `NCBI RefSeq`: ```{r, eval=FALSE} getKingdoms(db = "refseq") ``` ``` [1] "archaea" "bacteria" "fungi" "invertebrate" [5] "plant" "protozoa" "vertebrate_mammalian" "vertebrate_other" [9] "viral" ``` ### Example `NCBI Genbank`: ```{r, eval=FALSE} getKingdoms(db = "genbank") ``` ``` [1] "archaea" "bacteria" "fungi" [4] "invertebrate" "plant" "protozoa" [7] "vertebrate_mammalian" "vertebrate_other" ``` In these examples the difference betwenn `db = "refseq"` and `db = "genbank"` is that `db = "genbank"` does not store `viral` information. ### Example `ENSEMBL` ```{r, eval=FALSE} getKingdoms(db = "ensembl") ``` ``` [1] "EnsemblVertebrates" ``` The `ENSEMBL` database does not differentiate between different kingdoms, but specialized on storing high-quality reference genomes of diverse biological disciplines. #### Retrieval from `NCBI RefSeq` Download all mammalian vertebrate genomes from `NCBI RefSeq`. ```{r,eval=FALSE} # download all vertebrate genomes meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome", reference = FALSE) ``` The argument `kingdom` specifies the kingdom selected with `getKingdoms()` from which genomes of organisms shall be retrieved. The `db` argument specifies the database from which respective genomes shall be downloaded. The argument `type` specifies that `genome assembly` files shall be retrieved. The argument `reference` indicates whether or not a genome shall be downloaded if it isn't marked in the database as either a `reference genome` or a `representative genome`. Options are: - `reference = FALSE` (__Default__): all organisms (reference, representative, and non-representative genomes) are downloaded. - `reference = TRUE`: organisms that are downloaded must be either a `reference` or `representative genome`. Thus, most genomes which are usually non-reference genomes will not be downloaded and the user will retrieve much less organisms than are stored in databases. When running this command all geneomes are stored in a folder which is either named according to the kingdom (in this case `vertebrate_mammalian`). Alternatively, users can specify the `out.folder` argument to define a custom output folder path. Internally, in this example `meta.retrieval()` will generate a folder named `vertebrate_mammalian` in which respective genomes will be stored. In addition, the `vertebrate_mammalian` folder contains a folder named `documentation` which stores individual documentation files for each individual organism and a `summary file` which stores documentation for all retrieved organisms. This `summary file` can then be used as `Supplementary Data` in studies to promote computational reproducibility. An example documentation file of an individual organism looks like this: ``` File Name: Mus_musculus_genomic_genbank.gff.gz Organism Name: Mus_musculus Database: NCBI genbank URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/635/GCA_000001635.7_GRCm38.p5/GCA_000001635.7_GRCm38.p5_genomic.gff.gz Download_Date: Mon Nov 14 12:43:45 2016 refseq_category: reference genome assembly_accession: GCA_000001635.7 bioproject: PRJNA20689 biosample: NA taxid: 10090 infraspecific_name: NA version_status: latest release_type: Patch genome_rep: Full seq_rel_date: 2016-06-29 submitter: Genome Reference Consortium ``` An example `summary file` of all organism looks like this (here we use 105 Plant species as an example): ``` # A tibble: 105 x 16 file_name organism url database path refseq_category 1 Aegilops_tau Aegilops_ ftp://ftp.ncbi refseq Prot representative 2 Amborella_tr Amborella ftp://ftp.ncbi refseq Prot representative 3 Ananas_comos Ananas_co ftp://ftp.ncbi refseq Prot representative 4 Arabidopsis_ Arabidops ftp://ftp.ncbi refseq Prot representative 5 Arabidopsis_ Arabidops ftp://ftp.ncbi refseq Prot reference geno 6 Arachis_dura Arachis_d ftp://ftp.ncbi refseq Prot representative 7 Arachis_ipae Arachis_i ftp://ftp.ncbi refseq Prot representative 8 Asparagus_of Asparagus ftp://ftp.ncbi refseq Prot representative 9 Auxenochlore Auxenochl ftp://ftp.ncbi refseq Prot representative 10 Bathycoccus_ Bathycocc ftp://ftp.ncbi refseq Prot representative # ... with 95 more rows, and 10 more variables: assembly_accession , # bioproject , biosample , taxid , # infraspecific_name , version_status , release_type , # genome_rep , seq_rel_date , submitter ``` #### Restarting a corrupted download Unfortunately, when downloading large amounts of genomes the NCBI RefSeq database limits the number of queries from an individual IP address. This causes that the download process might stop or time out at a particular step. To overcome this limitation users can simply __re-run__ the `meta.retrieval()` command they previously executed and specify the argument `restart_at_last` which has the following two options: - If `restart_at_last = TRUE` (__Default__) then `meta.retrieval()` will skip all organisms that are already present in the folder and will start downloading all remaining species (thus will pick up from where the initial download process stopped). However, this way `meta.retrieval()` will not be able to check whether already downloaded organism files are corrupted or not by checking the `md5 checksum` of the respective file. Thus, I recommend to download the last organism before `meta.retrieval()` stopped manually using `getGenome()` to make sure that the respective file is not corrupted. - If `restart_at_last = FALSE` then `meta.retrieval()` will start from the beginning and crawl through already downloaded organism files and check whether already downloaded organism files are corrupted or not by checking the `md5 checksum` (this procedure takes longer than `restart_at_last = TRUE`). After checking existing files the function will start downloading all remaining organisms. #### Un-zipping downloaded files After downloading genomes users can format the output of `meta.retrieval()` by first un-zipping downloaded files and renaming them for more convenient downstream data analysis (e.g. from `Saccharomyces_cerevisiae_cds_from_genomic_refseq.fna.gz` to `Scerevisiae.fa`). The easiest way to use `clean.retrieval()` in combination with `meta.retrieval()` is to use the pipe operator from the `magrittr` package: ```r library(magrittr) meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") %>% clean.retrieval() ``` In the first step, genome assembly files are downloaded with `meta.retrieval` and subsequently (`%>%`) un-zipped and re-named using `clean.retrieval()`. Example `Bacteria` ```{r,eval=FALSE} # download all bacteria genomes meta.retrieval(kingdom = "bacteria", db = "refseq", type = "genome", reference = FALSE) ``` Example `Viruses` ```{r,eval=FALSE} # download all virus genomes meta.retrieval(kingdom = "viral", db = "refseq", type = "genome", reference = FALSE) ``` Example `Archaea` ```{r,eval=FALSE} # download all archaea genomes meta.retrieval(kingdom = "archaea", db = "refseq", type = "genome", reference = FALSE) ``` Example `Fungi` ```{r,eval=FALSE} # download all fungi genomes meta.retrieval(kingdom = "fungi", db = "refseq", type = "genome", reference = FALSE) ``` Example `Plants` ```{r,eval=FALSE} # download all plant genomes meta.retrieval(kingdom = "plant", db = "refseq", type = "genome", reference = FALSE) ``` Example `Invertebrates` ```{r,eval=FALSE} # download all invertebrate genomes meta.retrieval(kingdom = "invertebrate", db = "refseq", type = "genome", reference = FALSE) ``` Example `Protozoa` ```{r,eval=FALSE} # download all invertebrate genomes meta.retrieval(kingdom = "protozoa", db = "refseq", type = "genome", reference = FALSE) ``` #### Retrieval from `NCBI Genbank` Alternatively, download all mammalian vertebrate genomes from `NCBI Genbank`, e.g. ```{r,eval=FALSE} # download all vertebrate genomes meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "genome", reference = FALSE) ``` Example `Bacteria` ```{r,eval=FALSE} # download all bacteria genomes meta.retrieval(kingdom = "bacteria", db = "genbank", type = "genome", reference = FALSE) ``` Example `Archaea` ```{r,eval=FALSE} # download all archaea genomes meta.retrieval(kingdom = "archaea", db = "genbank", type = "genome", reference = FALSE) ``` Example `Fungi` ```{r,eval=FALSE} # download all fungi genomes meta.retrieval(kingdom = "fungi", db = "genbank", type = "genome", reference = FALSE) ``` Example `Plants` ```{r,eval=FALSE} # download all plant genomes meta.retrieval(kingdom = "plant", db = "genbank", type = "genome", reference = FALSE) ``` Example `Invertebrates` ```{r,eval=FALSE} # download all invertebrate genomes meta.retrieval(kingdom = "invertebrate", db = "genbank", type = "genome", reference = FALSE) ``` Example `Protozoa` ```{r,eval=FALSE} # download all invertebrate genomes meta.retrieval(kingdom = "protozoa", db = "genbank", type = "genome", reference = FALSE) ``` #### Retrieval from `ENSEMBL` ```{r,eval=FALSE} # download all genomes from ENSEMBL meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "genome", reference = FALSE) ``` ### Retrieve groups or subgroups of species In case users do not wish to retrieve genomes from an entire kingdom, but rather from a group or subgoup (e.g. from species belonging to the `Gammaproteobacteria` class, a subgroup of the `bacteria` kingdom), they can use the following workflow. #### Example retrieval of all `Gammaproteobacteria` genomes from `NCBI RefSeq`: First, users can again consult the `getKingdoms()` function to retrieve kingdom information. ```{r, eval=FALSE} getKingdoms(db = "refseq") ``` ``` [1] "archaea" "bacteria" "fungi" "invertebrate" [5] "plant" "protozoa" "vertebrate_mammalian" "vertebrate_other" [9] "viral" ``` In this example, we will choose the `bacteria` kingdom. Now, the `getGroups()` function allows users to obtain available subgroups of the `bacteria` kingdom. ```{r, eval=FALSE} getGroups(db = "refseq", kingdom = "bacteria") ``` ``` [1] "Acidithiobacillia" "Acidobacteriia" [3] "Actinobacteria" "Alphaproteobacteria" [5] "Aquificae" "Armatimonadetes" [7] "Bacteroidetes/Chlorobi group" "Balneolia" [9] "Betaproteobacteria" "Blastocatellia" [11] "Candidatus Kryptonia" "Chlamydiae" [13] "Chloroflexi" "Cyanobacteria/Melainabacteria group" [15] "Deinococcus-Thermus" "delta/epsilon subdivisions" [17] "Endomicrobia" "Fibrobacteres" [19] "Firmicutes" "Fusobacteriia" [21] "Gammaproteobacteria" "Gemmatimonadetes" [23] "Kiritimatiellaeota" "Nitrospira" [25] "Planctomycetes" "Spirochaetia" [27] "Synergistia" "Tenericutes" [29] "Thermodesulfobacteria" "Thermotogae" [31] "unclassified Acidobacteria" "unclassified Bacteria (miscellaneous)" [33] "unclassified Proteobacteria" "Verrucomicrobia" [35] "Zetaproteobacteria" ``` Please note, that the `kingdom` argument specified in `getGroups()` needs to match with an available kingdom retrieved with `getKingdoms()`. It is also important that in both cases: `getKingdoms()` and `getGroups()` the same database should be specified. Now we choose the group `Gammaproteobacteria` and specify the `group` argument in the `meta.retrieval()` function. ```{r, eval=FALSE} meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "genome", reference = FALSE) ``` Using this command, all bacterial (`kingdom = "bacteria"`) genomes (`type = "genome"`) that belong to the group `Gammaproteobacteria` (`group = "Gammaproteobacteria"`) will be retrieved from NCBI RefSeq (`db = "refseq"`). Alternatively, `Gammaproteobacteria` genomes can be retrieved from NCBI Genbank by exchanging `db = "refseq"` to `db = "genbank"`. If users wish to download proteome, CDS, or GFF files instead of genomes, they can specify the argument: `type = "proteome"`, `type = "cds"`, or `type = "gff"`. #### Example retrieval of all `Adenoviridae` genomes from `NCBI RefSeq`: Retrieve groups for viruses. ```{r, eval=FALSE} getGroups(db = "refseq", kingdom = "viral") ``` ``` [1] "Adenoviridae" "Alloherpesviridae" [3] "Alphaflexiviridae" "Alphatetraviridae" [5] "Alvernaviridae" "Amalgaviridae" [7] "Ampullaviridae" "Anelloviridae" [9] "Apple fruit crinkle viroid" "Apple hammerhead viroid-like circular RNA" [11] "Apscaviroid" "Arenaviridae" [13] "Arteriviridae" "Ascoviridae" [15] "Asfarviridae" "Astroviridae" [17] "Avsunviroid" "Baculoviridae" [19] "Barnaviridae" "Benyviridae" [21] "Betaflexiviridae" "Bicaudaviridae" [23] "Birnaviridae" "Bornaviridae" [25] "Bromoviridae" "Bunyaviridae" [27] "Caliciviridae" "Carmotetraviridae" [29] "Caulimoviridae" "Cherry leaf scorch small circular viroid-like RNA 1" [31] "Cherry small circular viroid-like RNA" "Chrysoviridae" [33] "Circoviridae" "Closteroviridae" [35] "Cocadviroid" "Coleviroid" [37] "Coronaviridae" "Corticoviridae" [39] "Cystoviridae" "Dicistroviridae" [41] "Elaviroid" "Endornaviridae" [43] "Filoviridae" "Flaviviridae" [45] "Fusarividae" "Fuselloviridae" [47] "Gammaflexiviridae" "Geminiviridae" [49] "Genomoviridae" "Globuloviridae" [51] "Grapevine latent viroid" "Guttaviridae" [53] "Hepadnaviridae" "Hepeviridae" [55] "Herpesviridae" "Hostuviroid" [57] "Hypoviridae" "Hytrosaviridae" [59] "Iflaviridae" "Inoviridae" [61] "Iridoviridae" "Lavidaviridae" [63] "Leviviridae" "Lipothrixviridae" [65] "Luteoviridae" "Malacoherpesviridae" [67] "Marnaviridae" "Marseilleviridae" [69] "Megabirnaviridae" "Mesoniviridae" [71] "Microviridae" "Mimiviridae" [73] "Mulberry small circular viroid-like RNA 1" "Mymonaviridae" [75] "Myoviridae" "Nanoviridae" [77] "Narnaviridae" "Nimaviridae" [79] "Nodaviridae" "Nudiviridae" [81] "Nyamiviridae" "Ophioviridae" [83] "Orthomyxoviridae" "Other" [85] "Papillomaviridae" "Paramyxoviridae" [87] "Partitiviridae" "Parvoviridae" [89] "Pelamoviroid" "Permutotetraviridae" [91] "Persimmon viroid" "Phycodnaviridae" [93] "Picobirnaviridae" "Picornaviridae" [95] "Plasmaviridae" "Pneumoviridae" [97] "Podoviridae" "Polydnaviridae" [99] "Polyomaviridae" "Pospiviroid" [101] "Potyviridae" "Poxviridae" [103] "Quadriviridae" "Reoviridae" [105] "Retroviridae" "Rhabdoviridae" [107] "Roniviridae" "Rubber viroid India/2009" [109] "Rudiviridae" "Secoviridae" [111] "Siphoviridae" "Sphaerolipoviridae" [113] "Sunviridae" "Tectiviridae" [115] "Togaviridae" "Tombusviridae" [117] "Totiviridae" "Turriviridae" [119] "Tymoviridae" "unclassified" [121] "unclassified Pospiviroidae" "Virgaviridae" ``` Now we can choose `Adenoviridae` as group argument for the `meta.retrieval()` function. ```{r, eval=FALSE} meta.retrieval(kingdom = "viral", group = "Adenoviridae", db = "refseq", type = "genome", reference = FALSE) ``` Again, by exchanging `type = "genome"` by either `type = "proteome"`, `type = "cds"`, `type = "rna"`, `type = "assemblystats"`, or `type = "gff"`, users can retrieve proteome, CDS, RNA, genome assembly statistics or GFF files instead of genomes. ### Meta retrieval of genome assembly quality information Although much effort is invested to increase the genome assembly quality when new genomes are published or new versions are released, the influence of genome assembly quality on downstream analyses cannot be neglected. A rule of thumb is, that the larger the genome the more prone it is to genome assembly errors and therefore, a reduction of assembly quality. In [Veeckman et al., 2016](https://doi.org/10.1105/tpc.16.00349) the authors conclude: > As yet, no uniform metrics or standards are in place to estimate the completeness of a genome assembly or > the annotated gene space, despite their importance for downstream analyses In most metagenomics studies, however, the influence or bias of genome assembly quality on the outcome of the analysis (e.g. comparative genomics, annotation, etc.) is neglected. To better grasp the genome assembly quality, the NCBI databases store genome assembly statistics of some species for which genome assemblies are available. An example assembly statistics report can be found at: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_assembly_stats.txt. The `biomartr` package allows users to retrieve these genome assembly stats file in an automated way by specifying the argument `type = "assemblystats"` and `combine = TRUE`. Please make sure that `combine = TRUE` when selecting `type = "assemblystats"`. ```{r, eval=FALSE} # show all elements of the data.frame options(tibble.print_max = Inf) # retrieve genome assembly stats for all mammal genome assemblies # and store these stats in a data.frame mammals.gc <- meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "assemblystats", combine = TRUE) mammals.gc ``` ``` species total_length spanned_gaps unspanned_gaps region_count scaffold_count 1 Ornithorhynchus anatinus 1995607322 243698 137 0 200283 2 Sarcophilus harrisii NA 201317 0 0 35974 3 Dasypus novemcinctus NA 268413 0 0 46559 4 Erinaceus europaeus NA 219764 0 0 5803 5 Echinops telfairi NA 269444 0 0 8402 6 Pteropus alecto 1985975446 104566 0 0 65598 7 Rousettus aegyptiacus 1910250568 559 0 0 NA 8 Callithrix jacchus NA 184972 2242 0 16399 9 Cebus capucinus imitator NA 133441 0 0 7156 10 Cercocebus atys NA 65319 0 0 11433 # ... with 89 more rows, and 9 more variables: scaffold_N50 , scaffold_L50 , # scaffold_N75 , scaffold_N90 , contig_count , contig_N50 , total_gap_length , # molecule_count , top_level_count ``` Analogously, this information can be retrieved for each kingdom other than `kingdom = "vertebrate_mammalian"`. Please consult `getKingdoms()` for available kingdoms. ### Metagenome project retrieval from NCBI Genbank NCBI Genbank stores [metagenome projects](ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/) in addition to species specific genome, proteome or CDS sequences. To retrieve these metagenomes users can perform the following combination of commands: First, users can list the project names of available metagenomes by typing ```{r,eval=FALSE} # list available metagenomes at NCBI Genbank listMetaGenomes() ``` ``` [1] "metagenome" "human gut metagenome" "epibiont metagenome" [4] "marine metagenome" "soil metagenome" "mine drainage metagenome" [7] "mouse gut metagenome" "marine sediment metagenome" "termite gut metagenome" [10] "hot springs metagenome" "human lung metagenome" "fossil metagenome" [13] "freshwater metagenome" "saltern metagenome" "stromatolite metagenome" [16] "coral metagenome" "mosquito metagenome" "fish metagenome" [19] "bovine gut metagenome" "chicken gut metagenome" "wastewater metagenome" [22] "microbial mat metagenome" "freshwater sediment metagenome" "human metagenome" [25] "hydrothermal vent metagenome" "compost metagenome" "wallaby gut metagenome" [28] "groundwater metagenome" "gut metagenome" "sediment metagenome" [31] "ant fungus garden metagenome" "food metagenome" "hypersaline lake metagenome" [34] "hydrocarbon metagenome" "activated sludge metagenome" "viral metagenome" [37] "bioreactor metagenome" "wasp metagenome" "permafrost metagenome" [40] "sponge metagenome" "aquatic metagenome" "insect gut metagenome" [43] "activated carbon metagenome" "anaerobic digester metagenome" "rock metagenome" [46] "terrestrial metagenome" "rock porewater metagenome" "seawater metagenome" [49] "scorpion gut metagenome" "soda lake metagenome" "glacier metagenome" ``` Internally the `listMetaGenomes()` function downloads the assembly_summary.txt file from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/metagenomes/ to retrieve available metagenome information. This procedure might take a few seconds during the first run of `listMetaGenomes()`. Subsequently, the assembly_summary.txt file will be stored in the `tempdir()` directory to achieve a much faster access of this information during following uses of `listMetaGenomes()`. In case users wish to retrieve detailed information about available metagenome projects they can specify the `details = TRUE` argument. ```{r,eval=FALSE} # show all elements of the data.frame options(tibble.print_max = Inf) # detailed information on available metagenomes at NCBI Genbank listMetaGenomes(details = TRUE) ``` ``` # A tibble: 857 x 21 assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid 1 GCA_000206185.1 PRJNA32359 SAMN02954317 AAGA00000000.1 na 256318 256318 2 GCA_000206205.1 PRJNA32355 SAMN02954315 AAFZ00000000.1 na 256318 256318 3 GCA_000206225.1 PRJNA32357 SAMN02954316 AAFY00000000.1 na 256318 256318 4 GCA_000208265.2 PRJNA17779 SAMN02954240 AASZ00000000.1 na 256318 256318 5 GCA_000208285.1 PRJNA17657 SAMN02954268 AATO00000000.1 na 256318 256318 6 GCA_000208305.1 PRJNA17659 SAMN02954269 AATN00000000.1 na 256318 256318 7 GCA_000208325.1 PRJNA16729 SAMN02954263 AAQL00000000.1 na 256318 256318 8 GCA_000208345.1 PRJNA16729 SAMN02954262 AAQK00000000.1 na 256318 256318 9 GCA_000208365.1 PRJNA13699 SAMN02954283 AAFX00000000.1 na 256318 256318 10 GCA_900010595.1 PRJEB11544 SAMEA3639840 CZPY00000000.1 na 256318 256318 # ... with 847 more rows, and 14 more variables: organism_name , infraspecific_name , # isolate , version_status , assembly_level , release_type , genome_rep , # seq_rel_date , asm_name , submitter , gbrs_paired_asm , paired_asm_comp , # ftp_path , excluded_from_refseq ``` Finally, users can retrieve available metagenomes using `getMetaGenomes()`. The `name` argument receives the metagenome project name retrieved with `listMetaGenomes()`. The `path` argument specifies the folder path in which corresponding genomes shall be stored. ```{r,eval=FALSE} # retrieve all genomes belonging to the human gut metagenome project getMetaGenomes(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut")) ``` ``` 1] "The metagenome of 'human gut metagenome' has been downloaded to '_ncbi_downloads/human_gut'." [1] "_ncbi_downloads/human_gut/GCA_000205525.2_ASM20552v2_genomic.fna.gz" [2] "_ncbi_downloads/human_gut/GCA_000205765.1_ASM20576v1_genomic.fna.gz" [3] "_ncbi_downloads/human_gut/GCA_000205785.1_ASM20578v1_genomic.fna.gz" [4] "_ncbi_downloads/human_gut/GCA_000207925.1_ASM20792v1_genomic.fna.gz" [5] "_ncbi_downloads/human_gut/GCA_000207945.1_ASM20794v1_genomic.fna.gz" [6] "_ncbi_downloads/human_gut/GCA_000207965.1_ASM20796v1_genomic.fna.gz" [7] "_ncbi_downloads/human_gut/GCA_000207985.1_ASM20798v1_genomic.fna.gz" [8] "_ncbi_downloads/human_gut/GCA_000208005.1_ASM20800v1_genomic.fna.gz" [9] "_ncbi_downloads/human_gut/GCA_000208025.1_ASM20802v1_genomic.fna.gz" [10] "_ncbi_downloads/human_gut/GCA_000208045.1_ASM20804v1_genomic.fna.gz" [11] "_ncbi_downloads/human_gut/GCA_000208065.1_ASM20806v1_genomic.fna.gz" [12] "_ncbi_downloads/human_gut/GCA_000208085.1_ASM20808v1_genomic.fna.gz" [13] "_ncbi_downloads/human_gut/GCA_000208105.1_ASM20810v1_genomic.fna.gz" [14] "_ncbi_downloads/human_gut/GCA_000208125.1_ASM20812v1_genomic.fna.gz" [15] "_ncbi_downloads/human_gut/GCA_000208145.1_ASM20814v1_genomic.fna.gz" [16] "_ncbi_downloads/human_gut/GCA_000208165.1_ASM20816v1_genomic.fna.gz" ... ``` Internally, `getMetaGenomes()` creates a folder specified in the `path` argument. Genomes associated with the metagenomes project specified in the `name` argument will then be downloaded and stored in this folder. As return value `getMetaGenomes()` returns the file paths to the genome files which can then be used as input to the `read*()` functions. Alternatively or subsequent to the metagenome retrieval, users can retrieve annotation files of genomes belonging to a metagenome project selected with `listMetaGenomes()` by using the `getMetaGenomeAnnotations()` function. ```{r,eval=FALSE} # retrieve all genomes belonging to the human gut metagenome project getMetaGenomeAnnotations(name = "human gut metagenome", path = file.path("_ncbi_downloads","human_gut","annotations")) ``` ``` [1] "The annotations of metagenome 'human gut metagenome' have been downloaded and stored at '_ncbi_downloads/human_gut/annotations'." [1] "_ncbi_downloads/human_gut/annotations/GCA_000205525.2_ASM20552v2_genomic.gff.gz" [2] "_ncbi_downloads/human_gut/annotations/GCA_000205765.1_ASM20576v1_genomic.gff.gz" [3] "_ncbi_downloads/human_gut/annotations/GCA_000205785.1_ASM20578v1_genomic.gff.gz" [4] "_ncbi_downloads/human_gut/annotations/GCA_000207925.1_ASM20792v1_genomic.gff.gz" [5] "_ncbi_downloads/human_gut/annotations/GCA_000207945.1_ASM20794v1_genomic.gff.gz" [6] "_ncbi_downloads/human_gut/annotations/GCA_000207965.1_ASM20796v1_genomic.gff.gz" [7] "_ncbi_downloads/human_gut/annotations/GCA_000207985.1_ASM20798v1_genomic.gff.gz" [8] "_ncbi_downloads/human_gut/annotations/GCA_000208005.1_ASM20800v1_genomic.gff.gz" [9] "_ncbi_downloads/human_gut/annotations/GCA_000208025.1_ASM20802v1_genomic.gff.gz" [10] "_ncbi_downloads/human_gut/annotations/GCA_000208045.1_ASM20804v1_genomic.gff.gz" [11] "_ncbi_downloads/human_gut/annotations/GCA_000208065.1_ASM20806v1_genomic.gff.gz" [12] "_ncbi_downloads/human_gut/annotations/GCA_000208085.1_ASM20808v1_genomic.gff.gz" [13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz" [13] "_ncbi_downloads/human_gut/annotations/GCA_000208105.1_ASM20810v1_genomic.gff.gz" [14] "_ncbi_downloads/human_gut/annotations/GCA_000208125.1_ASM20812v1_genomic.gff.gz" [15] "_ncbi_downloads/human_gut/annotations/GCA_000208145.1_ASM20814v1_genomic.gff.gz" [16] "_ncbi_downloads/human_gut/annotations/GCA_000208165.1_ASM20816v1_genomic.gff.gz" ... ``` The file paths of the downloaded `*.gff` are retured by `getMetaGenomeAnnotations()` and can be used as input for the `read.gff()` function in the [seqreadr](https://github.com/HajkD/seqreadr) package. #### Retrieve Protein Sequences Download all mammalian vertebrate proteomes. #### Retrieval from `NCBI RefSeq`: ```{r,eval=FALSE} # download all vertebrate genomes meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "proteome", reference = FALSE) ``` #### Retrieval from `NCBI Genbank`: ```{r,eval=FALSE} # download all vertebrate genomes meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "proteome", reference = FALSE) ``` #### Retrieval from `ENSEMBL`: ```{r,eval=FALSE} # download all Ensembl proteome sequneces meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "proteome", reference = FALSE) ``` #### Retrieve CDS Sequences Download all mammalian vertebrate CDS from RefSeq (Genbank does not store CDS data). #### Retrieval from `NCBI RefSeq`: ```{r,eval=FALSE} # download all vertebrate genomes meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "cds", reference = FALSE) ``` #### Retrieval from `NCBI Genbank`: ```{r,eval=FALSE} # download all vertebrate genomes meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "cds", reference = FALSE) ``` #### Retrieval from `ENSEMBL`: ```{r,eval=FALSE} # download all Ensembl CDS sequneces meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "cds", reference = FALSE) ``` ### Retrieve GFF files Download all mammalian vertebrate gff files. Example `NCBI RefSeq`: ```{r,eval=FALSE} # download all vertebrate gff files meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "gff", reference = FALSE) ``` Example `NCBI Genbank`: ```{r,eval=FALSE} # download all vertebrate gff files meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "gff", reference = FALSE) ``` ### Retrieve GTF files Download all mammalian vertebrate gtf files. Example `ENSEMBL`: ```{r,eval=FALSE} # download all vertebrate gff files meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "gtf", reference = FALSE) ``` ### Retrieve RNA sequences Download all mammalian vertebrate RNA sequences from `NCBI RefSeq` and `NCBI Genbank`. #### Retrieval from `NCBI RefSeq`: ```{r,eval=FALSE} # download all vertebrate RNA sequneces meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "rna", reference = FALSE) ``` #### Retrieval from `NCBI Genbank`: ```{r,eval=FALSE} # download all vertebrate RNA sequneces meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "rna", reference = FALSE) ``` #### Retrieval from `ENSEMBL`: ```{r,eval=FALSE} # download all Ensembl RNA sequneces meta.retrieval(kingdom = "Ensembl", db = "ensembl", type = "rna", reference = FALSE) ``` ### Retrieve Repeat Masker Sequences Download all mammalian vertebrate Repeat Masker Annotation files from `NCBI RefSeq` and `NCBI Genbank`. #### Retrieval from `NCBI RefSeq`: ```{r,eval=FALSE} # download all vertebrate RNA sequneces meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "rm", reference = FALSE) ``` #### Retrieval from `NCBI Genbank`: ```{r,eval=FALSE} # download all vertebrate RNA sequneces meta.retrieval(kingdom = "vertebrate_mammalian", db = "genbank", type = "rm", reference = FALSE) ``` Users can obtain alternative kingdoms using `getKingdoms()`. ## Retrieve Individual Genomes for all Species in the Tree of Life If users wish to download the all genomes, proteome, CDS, or gff files for all species available in RefSeq or Genbank, they can use the `meta.retrieval.all()` function for this purpose. ### Genome Retrieval Example `RefSeq`: ```{r,eval=FALSE} # download all geneomes stored in RefSeq meta.retrieval.all(db = "refseq", type = "genome", reference = FALSE) ``` Example `Genbank`: ```{r,eval=FALSE} # download all geneomes stored in Genbank meta.retrieval.all(db = "genbank", type = "genome", reference = FALSE) ``` ### Proteome Retrieval Example `RefSeq`: ```{r,eval=FALSE} # download all proteome stored in RefSeq meta.retrieval.all(db = "refseq", type = "proteome", reference = FALSE) ``` Example `Genbank`: ```{r,eval=FALSE} # download all proteome stored in Genbank meta.retrieval.all(db = "genbank", type = "proteome", reference = FALSE) ``` Again, by exchanging `type = "proteome"` by either - `type = "genome"` - `type = "cds"` - `type = "rna"` - `type = "assemblystats"` - `type = "gff"` users can retrieve genome, CDS, RNA, genome assembly statistics or GFF files instead of proteomes.