Getting started with chopin

Introduction

  • chopin automatically distributes geospatial data computation over multiple threads.
  • Function centered workflow will streamline the process of parallelizing geospatial computation with minimal effort.

chopin workflow

  • The simplest way of parallelizing generic geospatial computation is to start from par_pad_* functions to par_grid, or running par_hierarchy, or par_multirasters functions at once.
library(chopin)
library(terra)
library(sf)
library(collapse)
library(dplyr)
library(future)
library(future.mirai)
library(future.apply)

Example data

  • North Carolinian counties and a raster of elevation data are used as example data.
nccnty_path <- system.file("extdata", "nc_hierarchy.gpkg", package = "chopin")
ncelev_path <-
  system.file("extdata/nc_srtm15_otm.tif", package = "chopin")
nccnty <- terra::vect(nccnty_path)

ncelev <- terra::rast(ncelev_path)

Generating random points in North Carolina

  • To demonstrate chopin functions, we generate 10,000 random points in North Carolina
ncsamp <-
  terra::spatSample(
    nccnty,
    1e4L
  )
ncsamp$pid <- 1:nrow(ncsamp)

Creating grids

  • The example below will generate a regular grid from the random point data.
ncgrid <- par_pad_grid(ncsamp, mode = "grid", nx = 4L, ny = 2L, padding = 10000)

plot(ncgrid$original)

Extracting values from raster

  • Since all par_* functions operate on future backends, users should define the future plan before running the functions. multicore plan supports terra objects which may lead to faster computation, but it is not supported in Windows. An alternative is future.mirai’s mirai_multisession plan, which is supported in many platforms and generally faster than plain future multisession plan.
  • workers argument should be defined with an integer value to specify the number of threads to be used.
future::plan(future.mirai::mirai_multisession, workers = 2L)
  • Then we dispatch multiple extract_at runs on the grid polygons.
  • Before we proceed, the terra object should be converted to sf object.
pg <-
  par_grid(
    grids = ncgrid,
    pad_y = FALSE,
    .debug = TRUE,
    fun_dist = extract_at,
    x = ncelev_path,
    y = sf::st_as_sf(ncsamp),
    id = "pid",
    radius = 1e4,
    func = "mean"
  )

Hierarchical processing

  • Here we demonstrate hierarchical processing of the random points using census tract polygons.
nccnty <- sf::st_read(nccnty_path, layer = "county")
## Reading layer `county' from data source 
##   `/tmp/Rtmp98Tj94/Rinst1adb383ef800/chopin/extdata/nc_hierarchy.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 100 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1054155 ymin: 1341756 xmax: 1838923 ymax: 1690176
## Projected CRS: NAD83 / Conus Albers
nctrct <- sf::st_read(nccnty_path, layer = "tracts")
## Reading layer `tracts' from data source 
##   `/tmp/Rtmp98Tj94/Rinst1adb383ef800/chopin/extdata/nc_hierarchy.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 2672 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1054155 ymin: 1341756 xmax: 1838923 ymax: 1690176
## Projected CRS: NAD83 / Conus Albers
  • The example below will parallelize summarizing mean elevation at 10 kilometers circular buffers of random sample points by the first five characters of census tract unique identifiers, which are county codes.
  • This example demonstrates the hierarchy can be defined from any given polygons if the unique identifiers are suitably formatted for defining the hierarchy.
px <-
  par_hierarchy(
    # from here the par_hierarchy-specific arguments
    regions = nctrct,
    regions_id = "GEOID",
    length_left = 5,
    pad = 10000,
    pad_y = FALSE,
    .debug = TRUE,
    # from here are the dispatched function definition
    # for parallel workers
    fun_dist = extract_at,
    # below should follow the arguments of the dispatched function
    x = ncelev,
    y = sf::st_as_sf(ncsamp),
    id = "pid",
    radius = 1e4,
    func = "mean"
  )

dim(px)
## [1] 10000     2
head(px)
##   pid      mean
## 1  41  8.817408
## 2 313 13.226359
## 3 318  8.574569
## 4 381 -1.509414
## 5 469 15.428749
## 6 491  9.126508
tail(px)
##        pid      mean
## 9995  9182  3.564007
## 9996  9317  5.493004
## 9997  9479 -3.363616
## 9998  9531  3.991127
## 9999  9797  7.116028
## 10000 9863  4.943126

Multiraster processing

  • Here we demonstrate multiraster processing of the random points using multiple rasters.
ncelev <-
  system.file("extdata/nc_srtm15_otm.tif", package = "chopin")
ncelev <- terra::rast(ncelev)
tdir <- tempdir(check = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test1.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test2.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test3.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test4.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test5.tif"), overwrite = TRUE)

rasts <- list.files(tdir, pattern = "tif$", full.names = TRUE)

pm <-
  par_multirasters(
    filenames = rasts,
    fun_dist = extract_at,
    x = NA,
    y = sf::st_as_sf(ncsamp)[1:500, ],
    id = "pid",
    radius = 1e4,
    func = "mean",
    .debug = TRUE
  )

dim(pm)
## [1] 2500    2
head(pm)
##         mean               base_raster
## 1  161.93254 /tmp/RtmpPAxxbR/test1.tif
## 2   27.01298 /tmp/RtmpPAxxbR/test1.tif
## 3   52.22775 /tmp/RtmpPAxxbR/test1.tif
## 4 1114.79175 /tmp/RtmpPAxxbR/test1.tif
## 5  871.28333 /tmp/RtmpPAxxbR/test1.tif
## 6   47.10661 /tmp/RtmpPAxxbR/test1.tif
tail(pm)
##           mean               base_raster
## 2495  15.38290 /tmp/RtmpPAxxbR/test5.tif
## 2496  27.74776 /tmp/RtmpPAxxbR/test5.tif
## 2497 168.56837 /tmp/RtmpPAxxbR/test5.tif
## 2498 291.13727 /tmp/RtmpPAxxbR/test5.tif
## 2499 760.81891 /tmp/RtmpPAxxbR/test5.tif
## 2500 439.23389 /tmp/RtmpPAxxbR/test5.tif

Caveats

Why parallelization is slower than the ordinary function run?

  • Parallelization may underperform when the datasets are too small to take advantage of divide-and-compute approach, where parallelization overhead is involved. Overhead here refers to the required amount of computational resources for transferring objects to multiple processes.
  • Since the demonstrations above use quite small datasets, the advantage of parallelization was not as noticeable as it was expected. Should a large amount of data (spatial/temporal resolution or number of files, for example) be processed, users could find the efficiency of this package. A vignette in this package demonstrates use cases extracting various climate/weather datasets.

Notes on data restrictions

chopin works best with two-dimensional (planar) geometries. Users should disable s2 spherical geometry mode in sf by setting sf::sf_use_s2(FALSE). Running any chopin functions at spherical or three-dimensional (e.g., including M/Z dimensions) geometries may produce incorrect or unexpected results.