--- title: "NetCDF with tidync" author: "Michael D. Sumner" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: fig_width: 10 fig_height: 10 vignette: > %\VignetteIndexEntry{NetCDF with tidync} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(pillar.subtle = FALSE, pillar.sigfig = 4) ``` The goal of tidync is to ease exploring the contents of a NetCDF source and to simplify the process of data extraction. Array dimension expressions virtually *slice* (i.e. *index*) into data arrays with immediate feedback, and ease programming for automating data extraction. No data is read until explicitly requested, and can be pulled directly as an array, or in long-form as a data frame. NetCDF is **Network Common Data Form**, a [data model][not-a-format] and an [API](https://en.wikipedia.org/wiki/Application_programming_interface). This is a very common, and *very general* way to store and work with scientific array-based data. NetCDF is defined and provided by [Unidata](https://www.unidata.ucar.edu/software/netcdf/). Here we introduce traditional concepts of NetCDF, and show examples with built-in files and online sources to demonstrate tidync functionality. # NetCDF NetCDF is a very widely used file format for storing array-based data as *variables*. The **space** occupied by a **variable** is defined by its **dimensions** and their metadata. Dimensions are by definition *one-dimensional* (i.e. an atomic vector in R of length 1 or more), an array with coordinate metadata, units, type and interpretation. The **space** of a variable is defined as one or more of the dimensions in the file. A given variable won't necessarily use all the available dimensions and no dimensions are mandatory or particularly special. Some conventions exist to define usage and minimal standards for metadata for particular file schemas, but these are many and varied, and not always adhered to. Tidync does not provide a data interpretation as it is intended for general use, including by tools that create data formats. The R community is not particularly strong with use of NetCDF, though it is common and widely used it pales compared to use in general climate science work, and there the most used tool is the [CDO Climate Data Operators](https://code.mpimet.mpg.de/projects/cdo). In R the most common tools used are ncdf4 and raster (which uses ncdf4). Both the RNetCDF and ncdf4 packages provide a traditional summary format, familiar to many NetCDF users as the output of the command line program [`ncdump`](https://www.unidata.ucar.edu/software/netcdf/workshops/most-recent/utilities/Ncdump.html). ```{r} ice_file <- system.file("extdata", "ifremer", "20171002.nc", package = "tidync", mustWork = TRUE) library(RNetCDF) print.nc(open.nc(ice_file)) ``` Using `ncdump` at the command line on a suitable system would yield very similar output to the print above. ```bash ncdump -h /path/to/extdata/ifremer/20171002.nc ``` With the ncdf4 package it's a slightly different approach, but gives the same result. ```R print(ncdf4::nc_open(ice_file)) ``` Notice how the listing above is organized by *dimension* and then by *variable*. It's not particularly obvious that some variables are defined within the same set of dimensions as others. A NetCDF file is a container for simple array based data structures. There is limited capacity ^[I.e. it's not possible to query a file for an arbitrary sparse set of values, without constructing a degenerate hyperslab query for each point or reading a hyperslab containing cells not in the set. Do you know different? Please let me know!] in the formal API for accessing data randomly within a variable, the primary mechanism is to define offset and stride (start and count) hyperslab indexes. ## tidync Tidync aims to ease exploration of the contents of a NetCDF file and provides methods extract arbitrary hyperslabs. These can be used directly in array contexts, or in "long form" database contexts. On first contact with the file, the available variables are classified by grid and dimension. The "active" grid is the one that queries may be made against, and may be changed with the `activate` function. ```{r} library(tidync) tidync(ice_file) ``` Here we see variables are clearly grouped by the *grid* they exist in, where grid is a specific (and ordered!) set of dimensions. This allows us to see the set of variables that implicitly co-exist, they have the same *shape*. The first grid "D0,D1,D2" has two variables, *concentration* and *quality_flag*, and the second "D2" has only one variable *time*. There are no general rules here, a file might have any number of dimensions and variables, and any variable might be defined by one or more dimensions. In this case the D2 grid has only one variable in its single dimension, and it happens to be a special kind of variable - a "coordinate dimension", as indicated by the `coord_dim` flag. In the traditional `ncdump` summary above it's easy to see there's only really one data grid, in `ni,nj,time` that it holds two variables, and that time is a special coordinate dimension - in contrast neither `ni` or `nj` have an explicit 1-dimension variable. When there are many dimensions and/or many variables those patterns are *not* easy to see. A particular grid was chosen by default, this is the "D0,D1,D2" grid composed of 3 dimensions, generally the largest grid will be chosen as that is *usually* the target we would be after. To choose a different grid we may nominate it by name, or by member variable. By name we choose a grid composed of only one dimension, and the summary print makes a distinction based on which dimensions are *active*. ```{r activate} tidync(ice_file) %>% activate("D2") ``` It is also possible to choose a grid by *variable name*. If we choose a variable in the default grid, then it's no different to the first case. ```{r NSE-activate} tidync(ice_file) %>% activate("time") ## choose grid by variable name, which happens to be the default grid here tidync(ice_file) %>% activate("quality_flag") ## same as the default tidync(ice_file) %>% activate("D0,D1,D2") ``` The data variables available on a grid can be expanded out as a single data frame, which all the coordinates copied out - this is not efficient(!) but if we craft our queries sensibly to read only what we need, it's a very easy way to explore the data in a file. The 'hyper_filter' function allows specification of expressions to subset a variable based on each dimension's coordinate values. If no expressions are included we are presented with a table containing a row for each dimension, its extent in coordinates and its length. For convenience we also assign the activate form to an R variable, though we could just chain the entire operation without this. ```{r} concentration <- tidync(ice_file) concentration %>% hyper_filter() ``` By specifying inequality expressions we see an *implicit* subsetting of the array. Everything so far is implicit to delay any file-based computation required to actually interact with the file and read from it. Notice that these are "name = expr" paired expressions, because the right hand side may be quite general we need the left hand side name to be assured of the name of the dimension referred to. ```{r} concentration %>% hyper_filter(nj = nj < 20) ``` We can also use the special internal variable 'index', which will test against position in the dimension elements '1:length' rather than the values. It's not different in this case because ni and nj are just position dimensions anyway. The special 'dplyr' adverbs like 'between' will work. ```{r} concentration %>% hyper_filter(ni = index < 50, nj = dplyr::between(index, 30, 100)) ``` ## Data extraction How to use these idioms to extract actual data? We can now exercise these variable choice and dimension filters to return actual data, either in by slicing out a "slab" in array-form, or as a data frame. ```{r} hf <- concentration %>% hyper_filter(ni = index > 150, nj = dplyr::between(index, 30, 100)) ## as an array arr <- hf %>% hyper_array() str(arr) ## as a data frame hf %>% hyper_tibble() %>% dplyr::filter(!is.na(concentration)) %>% dplyr::distinct(concentration, quality_flag) ``` ## Interrogating data by dimensions The connection object 'hf' is available for determining what is available and how we might cut into it. 'time' interestingly is of length 1, so perhaps adds nothing to the information about this otherwise 2D data set. If we were to open this file in `ncdf4` or `RNetCDF` and wanted to take a subset of the file out, we would have to specify and `start` of 1 and a `count` of ` even though it's completely redundant. ```{r sea-ice-example} hf ``` The 'start' and 'count' values reported are directly useable by the traditional API tools, and in particular by the functions `ncdf4::ncvar_get()` (`varid`, `start`, `count`), its counterpart in `RNetCDF::var.get.nc()` and command line tools like CDO. But, it's a pain to have to know the dimension of the variable and specify every slot. Code in the traditional API that looks like this ```{r eval= FALSE} ## WARNING, pseudocode var_get(con, variable, start = c(1, 1, 1), count = c(10, 5, 1)) ``` Obviously, it should be reasonable to specify only the count of any dimensions that *we don't want the entirety of*. This problem manifests exactly in R arrays generally, we can't provide information only about the dimensions we want, they have to be specified explicitly - even if we mean *all of it*. It does not matter what we include in the filter query, it can be all, some or none of the available dimensions, in any order. ```{r dimension-index,eval=TRUE} hf %>% hyper_filter(nj = index < 20, ni = ni > 20) hf %>% hyper_filter(nj = index < 20) ``` The other requirement we have is the ability to automate these task, and so far we have only interacted with the dimensionality information from the print out. For programming, there are functions `hyper_vars()`, `hyper_dims()` and `hyper_grids()` to report on elements in the source. The value of `hyper_vars()` and `hyper_dims()` is restricted to the *active* grid. The function `hyper_grids()` reports all available grids, with the currently active one indicated. The name of the current grid is also available via `active()`. ```{r} hyper_vars(hf) hyper_dims(hf) ## change the active grid hf %>% activate("D2") %>% hyper_vars() active(hf) hf %>% activate("D2") %>% active() ``` ## Transforms Under the hood tidync manages the relationship between dimensions and coordinates via *transforms* tables. This is a 1-dimensional mapping between dimension index and its coordinate value (if there's no explicit value it is assumed to be equal to the 1-based index). These tables are available via the `hyper_transforms()` function. Each table has columns ``, `index`, `id`, `name`, `coord_dim` (whether the dimension has explicit coordinates, in the `` column), and `selected`. The `selected` column records which of the dimension elements is currently requested by a `hyper_filter` query, and by default is set to `TRUE`. Expressions in the filter function work by updating this column. ```{r} hyper_transforms(hf) ``` ## Future improvements * support multiple sources at once for lazy read of a virtual grid * support groups * support compound types * allow a group-by function for a polygon layer, against a pair of dimensions to classify cells * allow a truly DBI and dplyr level of lazy read, with more filter, select, mutate and collect idioms * provide converters to raster format, stars format [not-a-format]: https://twitter.com/TedHabermann/status/958034585002041344