---
title: "Stepping into the HYDAT Database"
author: "Dewey Dunnington"
date: "2024-01-09"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Stepping into the HYDAT Database}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


The HYDAT database is a massive hydrologic data resource. The functions in this package are designed to get the most out of the HYDAT database as quickly as possible, however in the process of reformatting the data to be more useful, the package modifies the original tables. This vignette is intended to demonstrate how to access tables within the database for your own custom HYDAT analysis, should additional information be needed. This should not be necessary for the vast majority of users, and is intended only for advanced R users.

## Downloading HYDAT

Before loading the HYDAT database, the latest version of the database must be downloaded using `hydat_download()`. This is a fairly lengthy operation (the download is around 1 GB) and may require several cups of coffee worth of your time.


```r
hydat_download()
```


## Working with HYDAT tables

The HYDAT database is a SQLite database, which can be accessed in R using the [dplyr](https://dplyr.tidyverse.org/) and [dbplyr](https://dbplyr.tidyverse.org/) packages. This package has simplified the connection process, so all you have to do to connect to the database is use `hy_src()`.


```r
src <- hy_src()
```

To list the tables, use `src_tbls()` from the **dplyr** package.


```r
src_tbls(src)
#>  [1] "AGENCY_LIST"            "ANNUAL_INSTANT_PEAKS"  
#>  [3] "ANNUAL_STATISTICS"      "CONCENTRATION_SYMBOLS" 
#>  [5] "DATA_SYMBOLS"           "DATA_TYPES"            
#>  [7] "DATUM_LIST"             "DLY_FLOWS"             
#>  [9] "DLY_LEVELS"             "MEASUREMENT_CODES"     
#> [11] "OPERATION_CODES"        "PEAK_CODES"            
#> [13] "PRECISION_CODES"        "REGIONAL_OFFICE_LIST"  
#> [15] "SAMPLE_REMARK_CODES"    "SED_DATA_TYPES"        
#> [17] "SED_DLY_LOADS"          "SED_DLY_SUSCON"        
#> [19] "SED_SAMPLES"            "SED_SAMPLES_PSD"       
#> [21] "SED_VERTICAL_LOCATION"  "SED_VERTICAL_SYMBOLS"  
#> [23] "STATIONS"               "STN_DATA_COLLECTION"   
#> [25] "STN_DATA_RANGE"         "STN_DATUM_CONVERSION"  
#> [27] "STN_DATUM_UNRELATED"    "STN_OPERATION_SCHEDULE"
#> [29] "STN_REGULATION"         "STN_REMARKS"           
#> [31] "STN_REMARK_CODES"       "STN_STATUS_CODES"      
#> [33] "VERSION"
```

To inspect any particular table, use the `tbl()` function with the `src` and the table name.


```r
tbl(src, "STN_OPERATION_SCHEDULE")
#> # Source:   table<STN_OPERATION_SCHEDULE> [?? x 5]
#> # Database: sqlite 3.41.2 [/Users/samalbers/_dev/gh_repos/tidyhydat/inst/test_db/tinyhydat.sqlite3]
#>    STATION_NUMBER DATA_TYPE  YEAR MONTH_FROM MONTH_TO
#>    <chr>          <chr>     <int> <chr>      <chr>   
#>  1 05AA008        H          2012 JAN        DEC     
#>  2 05AA008        H          2013 JAN        DEC     
#>  3 05AA008        H          2014 JAN        DEC     
#>  4 05AA008        H          2015 JAN        DEC     
#>  5 05AA008        H          2016 JAN        DEC     
#>  6 05AA008        H          2017 JAN        DEC     
#>  7 05AA008        H          2018 JAN        DEC     
#>  8 05AA008        H          2019 JAN        DEC     
#>  9 05AA008        H          2020 JAN        DEC     
#> 10 05AA008        Q          1910 <NA>       <NA>    
#> # ℹ more rows
```

Working with SQL tables in dplyr is much like working with regular data frames, except no data is actually read from the database until necessary. Because some of these tables are large (particularly those containing the actual data), you will want to `filter()` the tables before you `collect()` them (the `collect()` operation loads them into memory as a `data.frame`).


```r
tbl(src, "STN_OPERATION_SCHEDULE") %>%
  filter(STATION_NUMBER == "05AA008") %>%
  collect()
#> # A tibble: 103 × 5
#>    STATION_NUMBER DATA_TYPE  YEAR MONTH_FROM MONTH_TO
#>    <chr>          <chr>     <int> <chr>      <chr>   
#>  1 05AA008        H          2012 JAN        DEC     
#>  2 05AA008        H          2013 JAN        DEC     
#>  3 05AA008        H          2014 JAN        DEC     
#>  4 05AA008        H          2015 JAN        DEC     
#>  5 05AA008        H          2016 JAN        DEC     
#>  6 05AA008        H          2017 JAN        DEC     
#>  7 05AA008        H          2018 JAN        DEC     
#>  8 05AA008        H          2019 JAN        DEC     
#>  9 05AA008        H          2020 JAN        DEC     
#> 10 05AA008        Q          1910 <NA>       <NA>    
#> # ℹ 93 more rows
```

When you are finished with the database (i.e., the end of the script), it is good practice to close the connection (you may get a loud red warning if you don't!).


```r
hy_src_disconnect(src)
```