---
title: "Getting started with phruta"
vignette: >
  %\VignetteIndexEntry{Getting started with phruta}
  \usepackage[utf8]{inputenc}
  %\VignetteEngine{knitr::rmarkdown}
output:
#  pdf_document:
#    highlight: null
#    number_sections: yes
  knitr:::html_vignette:
    toc: true
    fig_caption: yes
---


# Table of Contents

1.  [`phruta` en Español](#esp)?
2.  [What is `phruta`](#intro)?
3.  [Functions in `phruta`](#paragraph0)


## `phruta` en Español <a name="intro"></a>

[Placeholder]

## What is `phruta`? <a name="intro"></a>

The `phruta` package is designed to simplify the basic phylogenetic pipeline in `R`. `phruta` is designed to allow scientists from different backgrounds to assemble their own reproducible phylogenies with as minimal code as possible. All code in `phruta` is run within the same program (`R`) and data from intermediate steps are either stored to the environment or exported locally in independent folders. All code in `phruta` is run within the same environment, an aspect that increases the reproducibility of your analysis. `phruta` looks for potentially (phylogenetically) relevant gene regions for a given set of taxa, retrieves gene sequences, could combine newly downloaded and local gene sequences, performs sequence alignment, phylogenetic inference, and tree dating. `phruta` is largely a wrapper for alternative `R` packages and software.

## Functions in `phruta` <a name="paragraph0"></a>

The current release of `phruta` includes a set of eight major functions. All eight functions form a pipeline within `phruta` to output a time-calibrated phylogeny. However, users interested in using their own files at any stage can run each function independently.

Note that all the functions for which their primary output are sequences (aligned or unaligned) are listed under `sq.*`. All the files that output phylogenies (time-calibrated or not) are listed under `tree.*`.

-   First, the distribution of gene sampled for a given organism or set of taxa can be explored using the `acc.gene.sampling` function. This function will return a table that summarizes either the distribution of genes sampled for the search term in general or specifically across species.

-   Second, given a list of target organisms, users can retrieve a list of accession numbers that are relevant to their search using `acc.table.retrieve()`. Instead of directly downloading sequences from genbank (see `sq.retrieve.direct()` below), retrieving accession numbers allow users to have more control over the sequences that are being used in the analyses. Note that users can also curate the content of the dataset obtained using `sq.retrieve.direct()`.

-   Third, users should download gene sequences. Sequences can be download using the `sq.retrieve.indirect()` from the accession numbers retrieved before using the `acc.table.retrieve()` function. This is the preferred option within `phruta`. Additionally, users can directly download gene sequences using the `sq.retrieve.direct()` function. Both `sq.retrieve.indirect()` and `sq.retrieve.direct()` functions save gene sequences in `fasta` files that will be located in a new directory named `0.Sequences`.

-   Fourth, `sq.add()` allows users to include local sequences to those retrieved from genbank in the previous step. This function saves all the resulting `fasta` files in two directories, combined sequences in `0.Sequences` and local sequences in `0.AdditionalSequences` (originally downloaded sequences are moved to `0.0.OriginalDownloaded` at this step). Note that `sq.add()` is optional.

-   Fifth, the `sq.curate()` function filters out unreliable sequences based on information listed in genbank (e.g. PREDICTED) and on taxonomic information provided by the user. Specifically, this function retrieves taxonomic information from the Global Biodiversity Information Facility (GBIF) database's taxonomic backbone (see alternatives in the advanced vignette to `phruta`). If a given species belongs to a non-target group, this species is dropped from the analyses. This function automatically corrects taxonomy and renames sequences.

-   Sixth, `sq.aln()` performs multiple sequence alignment on `fasta` files. Currently, `phruta` uses the [`DECIPHER` R package](http://www2.decipher.codes/), here. This package allows for adjusting sequence orientation and masking (removing ambiguous sites).

The final two functions in `phruta` focus on tree inference and dating. These two functions depend on external software that needs to be installed (**and tested**) before running. Please make sure both `RAxML` and `PATHd-8` or `treePL` are installed and can be called within `R` using the `system()` function. Note that you can choose between `PATHd-8` and `treePL`. More details on how to install `RAxML` are provided in the phylogenetic vignette of `phruta`. Similarly, we provide details on how to install `PATHd-8` and `treePL` in the same vignette.

-   Seventh, the `tree.raxml()` function allows users to perform tree inference under `RAxML` for sequences in a given folder. This is a wrapper to `ips::raxml()` and each of the arguments can be customized. The current release of `phruta` can manage both partitioned and unpartitioned analyses. Starting and constrained trees are allowed.

-   Eight, `tree.dating()` enables users to perform time-calibrations of a given phylogeny using `geiger::congruify.phylo()`. `phruta` includes a basic set of comprehensively sampled, time-calibrated phylogenies that are used to extract secondary calibrations for the target phylogeny. Note that sampling in those phylogenies can be examined using `data(SW.phruta)`. Please make sure you have at least **two** groups in common with each of the phylogenies. Similarly, users can choose to run either `PATHd-8` or `treePL`.