--- title: "Generate computational grids" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Generate computational grids} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = TRUE, echo = FALSE} knitr::opts_knit$set( echo = TRUE, message = FALSE, warning = FALSE, out.width = "900px" ) ``` ```{r} library(chopin) library(dplyr) library(sf) library(terra) library(anticlust) library(igraph) library(future) library(future.mirai) options(sf_use_s2 = FALSE) ``` ## Computational grids - Computational grids refer to a set of polygons that cover the entire spatial domain of interest. These grids are used to split the spatial domain into smaller pieces for parallel processing. - `chopin` provides functions to generate computational grids for parallel processing by `par_grid()` function. - The standard sequence of running `par_grid()` includes the following steps: 1. Generate computational grids with `par_pad_grid()` or `par_pad_balanced()`. This step will give you a set of grid polygons, one of which is original splits and the other is padded in consideration of buffer radius in the subsequent spatial operations. 2. Run `par_grid()` with the input dataset and the grid polygons. - Before advancing, we define terms for clarity. - **Input data**: the data where the computed value will be stored. For example, when extracting raster values with vector objects, vector objects are the input data. - **Target data**: the data _from_ which the values come. In the example above, raster object is the target data. - Original and padded grids will be used to split the main input with the original grid and the target dataset for computation with the padded grid, respectively. ## Types of computational grids and their generation - There are two approaches to generate computational grids. One is to use `par_pad_grid()` with one of three `mode`s and the other is to use `par_group_grid()`. Thus, users have **four options** in total to generate computational grids. - `padding` argument is very important to ensure the accurate parallel operations in a case where buffering is involved. Suppose a user has point geometry inputs and apply circular buffer with a certain radius. Since the original grid (without padding = no overlap between adjacent grids) filters the original input in each parallel worker, points near each grid border may miss target data as buffered polygons exceed the grid. ### `par_pad_grid()`: standard interface - `par_pad_grid()` generates regular grid polygons with padding for parallel processing. Padding is the distance of overlapping between grid polygons, essentially from the buffer radius of the points when buffer polygons are concerned. - `par_pad_grid()` supports three internal `mode`s: - `mode = "grid"`: generates regular grid polygons with padding, `nx` and `ny` arguments determine the number of columns and rows in the grid, respectively. - `mode = "grid_quantile"`: generates regular grid polygons with padding based on quantiles of the number of points in each grid. The grids will look irregular and the points per grid will be more balanced than the `grid` mode. - `mode = "grid_advanced"`: generates regular grid polygons with padding based on the number of points in each grid and the number of points in the entire dataset. `nx` and `ny` arguments determine the number of columns and rows in the grid, then `merge_max` argument controls how many adjacent grids are merged into one grid. `grid_min_features` argument determines the minimum number of points in each grid, which means grids with fewer points than this value will be merged with adjacent grids. Adjusting these arguments can balance the computational load among the threads and reduce the overhead of parallelization. ### `par_pad_balanced()`: focusing on getting the balanced clusters - `par_pad_balanced()` groups the inputs into equal size, then generates padded rectangles that cover the same number of points per grid. Users can use the output of this function into `par_grid()` for parallel processing. ### Random points in NC - For demonstration of `par_pad_grid()`, we use moderately clustered point locations generated inside the counties of North Carolina. ```{r nc-gen-points} ncpoly <- system.file("shape/nc.shp", package = "sf") ncsf <- sf::read_sf(ncpoly) ncsf <- sf::st_transform(ncsf, "EPSG:5070") plot(sf::st_geometry(ncsf)) # sampling clustered point library(spatstat.random) set.seed(202404) ncpoints <- sf::st_sample( x = ncsf, type = "Thomas", mu = 20, scale = 1e4, kappa = 1.25e-9 ) ncpoints <- ncpoints[seq_len(2e3L), ] ncpoints <- sf::st_as_sf(ncpoints) ncpoints <- sf::st_set_crs(ncpoints, "EPSG:5070") ncpoints$pid <- sprintf("PID-%05d", seq(1, nrow(ncpoints))) plot(sf::st_geometry(ncpoints)) # convert to terra SpatVector ncpoints_tr <- terra::vect(ncpoints) ``` ## Visualize computational grids - The output of `par_pad_grid()` and `par_pad_balanced()` is length 2 list. A significant difference between the two is the first element of the output. In `par_pad_grid()`, it is always a sf or SpatVector object with polygon geometries. On the other hand, `par_pad_balanced()` will have the first element as a sf or SpatVector object with the same geometry type as the original input. For example, it will have point geometries if the input was point. ```{r grid-plain} compregions <- chopin::par_pad_grid( ncpoints_tr, mode = "grid", nx = 8L, ny = 5L, padding = 1e4L ) # a list object class(compregions) # length of 2 names(compregions) par(mfrow = c(2, 1)) plot(compregions$original, main = "Original grids") plot(compregions$padded, main = "Padded grids") ``` ### Generate regular grid computational regions - `chopin::par_pad_grid()` takes a spatial dataset to generate regular grid polygons with `nx` and `ny` arguments with padding. Users will have both overlapping (by the degree of `radius`) and non-overlapping grids, both of which will be utilized to split locations and target datasets into sub-datasets for efficient processing. ```{r gen-compregions} compregions <- chopin::par_pad_grid( ncpoints_tr, mode = "grid", nx = 8L, ny = 5L, padding = 1e4L ) ``` - The output of `par_pad_grid()` is a list object with two elements named `original` (non-overlapping grid polygons) and `padded` (overlapping by `padding`). The class of each element depends on the input dataset class. The figures below illustrate the grid polygons with and without overlaps. ```{r compare-compregions, fig.width = 8, fig.height = 8} names(compregions) oldpar <- par() par(mfrow = c(2, 1)) terra::plot(compregions$original, main = "Original grids") terra::plot(compregions$padded, main = "Padded grids") par(mfrow = c(1, 1)) terra::plot(compregions$original, main = "Original grids") terra::plot(ncpoints_tr, add = TRUE, col = "red", cex = 0.4) ``` ### Split the points by two 1D quantiles - `mode = "grid_quantile"` generates regular grid polygons with padding based on quantiles of the coordinates in each dimension. - When using this mode, users should define `quantiles` argument, which will be used to get the same number of quantiles in each dimension. A convenient way to define `seq()` function with `length.out` argument. The example below uses `length.out = 5`, which will give quartiles. ```{r split-points-grid} grid_quantiles <- chopin::par_pad_grid( input = ncpoints_tr, mode = "grid_quantile", quantiles = seq(0, 1, length.out = 5), padding = 1e4L ) names(grid_quantiles) par(mfrow = c(2, 1)) terra::plot(grid_quantiles$original, main = "Original grids") terra::plot(grid_quantiles$padded, main = "Padded grids") par(mfrow = c(1, 1)) terra::plot(grid_quantiles$original, main = "Original grids") terra::plot(ncpoints_tr, add = TRUE, col = "red", cex = 0.4) ``` ### Merge the grids based on the number of points - `mode = "grid_advanced"` utilizes finer grids to merge the results from the finer grids into the coarser grids. This behavior can balance the computational load among the threads and reduce the overhead of parallelization. That said, this mode internally generates grids in `mode = "grid"` and merges them based on the number of points in each grid. - To determine the adjacency and merging behavior, minimum spanning tree (MST) is identified. The function utilizes `igraph::mst()` for MST identification and other graph summary functions under the hood. - As a note, users can adjust the merging behavior by changing the arguments `grid_min_features` and `merge_max`. ```{r split-points-grid-adv} grid_advanced1 <- chopin::par_pad_grid( input = ncpoints_tr, mode = "grid_advanced", nx = 15L, ny = 8L, padding = 1e4L, grid_min_features = 25L, merge_max = 5L ) par(mfrow = c(2, 1)) terra::plot(grid_advanced1$original, main = "Original grids") terra::plot(grid_advanced1$padded, main = "Padded grids") par(mfrow = c(1, 1)) terra::plot(grid_advanced1$original, main = "Merged grids (merge_max = 5)") terra::plot(ncpoints_tr, add = TRUE, col = "red", cex = 0.4) ``` ```{r split-points-count} ncpoints_tr$n <- 1 n_points <- terra::zonal( ncpoints_tr, grid_advanced1$original, fun = "sum" )[["n"]] grid_advanced1g <- grid_advanced1$original grid_advanced1g$n_points <- n_points terra::plot(grid_advanced1g, "n_points", main = "Number of points in each grid") ``` #### Different values in `merge_max` - Keeping other arguments the same, we can see the difference in the number of merged grids by changing the `merge_max` argument. ```{r split-points-grid-adv2} grid_advanced2 <- chopin::par_pad_grid( input = ncpoints_tr, mode = "grid_advanced", nx = 15L, ny = 8L, padding = 1e4L, grid_min_features = 30L, merge_max = 4L ) par(mfrow = c(2, 1)) terra::plot(grid_advanced2$original, main = "Original grids") terra::plot(grid_advanced2$padded, main = "Padded grids") par(mfrow = c(1, 1)) terra::plot(grid_advanced2$original, main = "Merged grids (merge_max = 8)") terra::plot(ncpoints_tr, add = TRUE, col = "red", cex = 0.4) ``` ```{r split-points-grid-adv3} grid_advanced3 <- chopin::par_pad_grid( input = ncpoints_tr, mode = "grid_advanced", nx = 15L, ny = 8L, padding = 1e4L, grid_min_features = 25L, merge_max = 3L ) par(mfrow = c(2, 1)) terra::plot(grid_advanced3$original, main = "Original grids") terra::plot(grid_advanced3$padded, main = "Padded grids") par(mfrow = c(1, 1)) terra::plot(grid_advanced3$original, main = "Merged grids (merge_max = 3)") terra::plot(ncpoints_tr, add = TRUE, col = "red", cex = 0.4) ``` #### `par_make_balanced()` - `par_make_balanced()` uses `anticlust` package to split the point set into the balanced clusters. - In the background, `par_pad_balanced()` is run first to generate the equally sized clusters from the input. Then, padding is applied to the extent of each cluster to be compatible with `par_grid()`, where both the original and the padded grids are used. - Please note that `ngroups` argument value must be the **exact** divisor of the number of points. For example, in the example below, when one changes `ngroups` to `10L`, it will fail as the number of points is not divisible by 10. - Consult the [`anticlust`](https://cran.r-project.org/web/packages/anticlust/index.html) package for more details on the algorithm. - `par_pad_balanced()` makes a compatible object to the output of `par_pad_grid()` directly from the input points. - As illustrated in the figure below, the points will be split into `ngroups` clusters with the same number of points then processed in parallel by using the output object with `par_grid()`. ```{r par-group-balanced} # ngroups should be the exact divisor of the number of points! group_bal_grid <- chopin::par_pad_balanced( points_in = ncpoints_tr, ngroups = 10L, padding = 1e4 ) group_bal_grid$original$CGRIDID <- as.factor(group_bal_grid$original$CGRIDID) par(mfrow = c(2, 1)) terra::plot(group_bal_grid$original, "CGRIDID", legend = FALSE, main = "Assigned points (ngroups = 10)") terra::plot(group_bal_grid$padded, main = "Padded grids") ```