Introduction to pangaear ======================== `pangaear` is a data retrieval interface for the World Data Center PANGAEA (https://www.pangaea.de/). PANGAEA archieves published Earth & Environmental Science data under the following subjects: agriculture, atmosphere, biological classification, biosphere, chemistry, cryosphere, ecology, fisheries, geophysics, human dimensions, lakes & rives, land surface, lithosphere, oceans, and paleontology. ## Installation If you've not installed it yet, install from CRAN: ```r install.packages("pangaear") ``` Or the development version: ```r devtools::install_github("ropensci/pangaear") ``` ## Load pangaear ```r library("pangaear") ``` ## Search for data `pg_search` is a thin wrapper around the GUI search interface on the page . Everything you can do there, you can do here. For example, query for the term 'water', with a bounding box, and return only three results. ```r pg_search(query = 'water', bbox = c(-124.2, 41.8, -116.8, 46.1), count = 3) ``` ``` #> # A tibble: 3 x 6 #> score doi size size_measure citation supplement_to #> #> 1 13.0 10.1594/PANG… 4 datasets Krylova, EM; Sahling, H; Janssen, R (2010): A new genus… Krylova, EM; Sahling, H; Janssen, R (2010): Abyssogena: a new genus of the family … #> 2 12.8 10.1594/PANG… 2 datasets Simonyan, AV; Dultz, S; Behrens, H (2012): Diffusion tr… Simonyan, AV; Dultz, S; Behrens, H (2012): Diffusive transport of water in porous … #> 3 8.78 10.1594/PANG… 1148 data points WOCE Surface Velocity Program, SVP (2006): Water temper… ``` The resulting `data.frame` has details about different studies, and you can use the DOIs (Digital Object Identifiers) to get data and metadata for any studies you're interested in. ### Another search option There's another search option with the `pg_search_es` function. It is an interface to the Pangaea Elasticsearch interface. This provides a very flexible interface for search Pangaea data - though it is different from what you're used to with the Pangaea website. ```r (res <- pg_search_es()) ``` ``` #> # A tibble: 10 x 46 #> `_index` `_type` `_id` `_score` `_source.intern… `_source.parent… `_source.minEle… `_source.sf-aut… `_source.parent… `_source.techKe… `_source.geocod… `_source.sp-log… #> * #> 1 pangaea… panmd 9018… 1 2020-01-17T15:1… https://doi.org… -0.4 Anhaus Philipp#… 901247 1 #> 2 pangaea… panmd 9017… 1 2020-01-17T15:1… https://doi.org… 2 Anhaus Philipp#… 901247 1 #> 3 pangaea… panmd 9017… 1 2020-01-17T15:1… https://doi.org… 2 Anhaus Philipp#… 901247 1 #> 4 pangaea… panmd 9017… 1 2020-01-17T15:1… https://doi.org… 2 Anhaus Philipp#… 901247 1 #> 5 pangaea… panmd 9017… 1 2020-01-17T15:1… 2 Vuilleumier Lau… NA 3 #> 6 pangaea… panmd 9017… 1 2020-01-17T15:1… https://doi.pan… 0.3 Peeken Ilka#Mur… 901742 3 #> 7 pangaea… panmd 9017… 1 2020-01-17T15:1… NA Schreuder Laura… NA 1 #> 8 pangaea… panmd 9017… 1 2020-01-17T15:1… https://doi.org… 0 Schreuder Laura… 901739 1 #> 9 pangaea… panmd 9016… 1 2020-01-17T15:1… 10 Augustine John$… NA 3 #> 10 pangaea… panmd 9016… 1 2020-01-17T15:1… 2 Augustine John$… NA 3 #> # … with 34 more variables: `_source.agg-campaign` , `_source.agg-author` , `_source.eastBoundLongitude` , `_source.URI` , `_source.agg-pubYear` , #> # `_source.minDateTime` , `_source.agg-geometry` , `_source.xml-thumb` , `_source.xml` , `_source.sf-idDataSet` , `_source.elevationGeocode` , #> # `_source.agg-method` , `_source.maxDateTime` , `_source.xml-sitemap` , `_source.westBoundLongitude` , `_source.northBoundLatitude` , #> # `_source.sp-dataStatus` , `_source.nDataPoints` , `_source.sp-hidden` , `_source.agg-location` , `_source.internal-source` , #> # `_source.agg-basis` , `_source.southBoundLatitude` , `_source.boost` , `_source.oaiSet` , `_source.maxElevation` , `_source.agg-mainTopic` , #> # `_source.agg-topic` , `_source.agg-project` , `_source.meanPosition.lat` , `_source.meanPosition.lon` , `_source.geoCoverage.type` , #> # `_source.geoCoverage.coordinates` , `_source.geoCoverage.geometries` ``` The returned data.frame has a lot of columns. You can limit columns returned with the `source` parameter. There are attributes on the data.frame that give you the total number of results found as well as the max score found. ```r attributes(res) ``` ``` #> $names #> [1] "_index" "_type" "_id" "_score" "_source.internal-datestamp" #> [6] "_source.parentURI" "_source.minElevation" "_source.sf-authortitle" "_source.parentIdDataSet" "_source.techKeyword" #> [11] "_source.geocodes" "_source.sp-loginOption" "_source.agg-campaign" "_source.agg-author" "_source.eastBoundLongitude" #> [16] "_source.URI" "_source.agg-pubYear" "_source.minDateTime" "_source.agg-geometry" "_source.xml-thumb" #> [21] "_source.xml" "_source.sf-idDataSet" "_source.elevationGeocode" "_source.agg-method" "_source.maxDateTime" #> [26] "_source.xml-sitemap" "_source.westBoundLongitude" "_source.northBoundLatitude" "_source.sp-dataStatus" "_source.nDataPoints" #> [31] "_source.sp-hidden" "_source.agg-location" "_source.internal-source" "_source.agg-basis" "_source.southBoundLatitude" #> [36] "_source.boost" "_source.oaiSet" "_source.maxElevation" "_source.agg-mainTopic" "_source.agg-topic" #> [41] "_source.agg-project" "_source.meanPosition.lat" "_source.meanPosition.lon" "_source.geoCoverage.type" "_source.geoCoverage.coordinates" #> [46] "_source.geoCoverage.geometries" #> #> $row.names #> [1] 1 2 3 4 5 6 7 8 9 10 #> #> $class #> [1] "tbl_df" "tbl" "data.frame" #> #> $total #> [1] 390620 #> #> $max_score #> [1] 1 ``` ```r attr(res, "total") ``` ``` #> [1] 390620 ``` ```r attr(res, "max_score") ``` ``` #> [1] 1 ``` To get to the DOIs for each study, use ```r gsub("https://doi.org/", "", res$`_source.URI`) ``` ``` #> [1] "10.1594/PANGAEA.901810" "10.1594/PANGAEA.901736" "10.1594/PANGAEA.901733" #> [4] "10.1594/PANGAEA.901732" "https://doi.pangaea.de/10.1594/PANGAEA.901710" "https://doi.pangaea.de/10.1594/PANGAEA.901709" #> [7] "10.1594/PANGAEA.901739" "10.1594/PANGAEA.901738" "https://doi.pangaea.de/10.1594/PANGAEA.901697" #> [10] "https://doi.pangaea.de/10.1594/PANGAEA.901695" ``` ## Get data The function `pg_data` fetches datasets for studies by their DOIs. ```r res <- pg_data(doi = '10.1594/PANGAEA.807580') res[[1]] ``` ``` #> 10.1594/PANGAEA.807580 #> parent doi: 10.1594/PANGAEA.807580 #> url: https://doi.org/10.1594/PANGAEA.807580 #> citation: Schiebel, Ralf; Waniek, Joanna J; Bork, Matthias; Hemleben, Christoph (2001): Physical oceanography during METEOR cruise M36/6. PANGAEA, https://doi.org/10.1594/PANGAEA.807580, In supplement to: Schiebel, R et al. (2001): Planktic foraminiferal production stimulated by chlorophyll redistribution and entrainment of nutrients. Deep Sea Research Part I: Oceanographic Research Papers, 48(3), 721-740, https://doi.org/10.1016/S0967-0637(00)00065-0 #> path: /Users/sckott/Library/Caches/R/pangaear/10_1594_PANGAEA_807580.txt #> data: #> # A tibble: 32,179 x 13 #> Event `Date/Time` Latitude Longitude `Elevation [m]` `Depth water [m… `Press [dbar]` `Temp [°C]` Sal `Tpot [°C]` `Sigma-theta [kg/m… `Sigma in situ [kg… `Cond [mS/cm]` #> #> 1 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 0 0 15.7 35.7 15.7 26.4 26.4 44.4 #> 2 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 0.99 1 15.7 35.7 15.7 26.4 26.4 44.4 #> 3 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 1.98 2 15.7 35.7 15.7 26.4 26.4 44.4 #> 4 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 2.97 3 15.7 35.7 15.7 26.4 26.4 44.4 #> 5 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 3.96 4 15.7 35.7 15.7 26.4 26.4 44.4 #> 6 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 4.96 5 15.7 35.7 15.7 26.4 26.4 44.4 #> 7 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 5.95 6 15.7 35.7 15.7 26.4 26.4 44.4 #> 8 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 6.94 7 15.7 35.7 15.7 26.4 26.4 44.4 #> 9 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 7.93 8 15.7 35.7 15.7 26.4 26.4 44.4 #> 10 M36/6-CTD-… 1996-10-14T12… 49.0 -16.5 -4802 8.92 9 15.7 35.7 15.7 26.4 26.4 44.4 #> # … with 32,169 more rows ``` Search for data then pass one or more DOIs to the `pg_data` function. ```r res <- pg_search(query = 'water', bbox = c(-124.2, 41.8, -116.8, 46.1), count = 3) pg_data(res$doi[3])[1:3] ``` ``` #> [[1]] #> 10.1594/PANGAEA.405695 #> parent doi: 10.1594/PANGAEA.405695 #> url: https://doi.org/10.1594/PANGAEA.405695 #> citation: WOCE Surface Velocity Program, SVP (2006): Water temperature and current velocity from surface drifter SVP_9524470. PANGAEA, https://doi.org/10.1594/PANGAEA.405695 #> path: /Users/sckott/Library/Caches/R/pangaear/10_1594_PANGAEA_405695.txt #> data: #> # A tibble: 192 x 10 #> `Date/Time` Latitude Longitude `Depth water [m]` `Temp [°C]` `UC [cm/s]` `VC [cm/s]` `Lat e` `Lon e` Code #> #> 1 1995-12-21T18:00 42.9 -125. 0 12.4 NA NA 0.0001 0.0001 1 #> 2 1995-12-22T00:00 42.9 -125. 0 12.4 4.24 -35.2 0 0 1 #> 3 1995-12-22T06:00 42.8 -125. 0 12.4 -7.8 -38.7 0.0002 0.0001 1 #> 4 1995-12-22T12:00 42.7 -125. 0 12.4 -12.7 -23.1 0.0001 0.0001 1 #> 5 1995-12-22T18:00 42.7 -125. 0 12.4 -15.1 -13.7 0 0 1 #> 6 1995-12-23T00:00 42.7 -125. 0 12.3 -24.1 -9.15 0.0001 0.0001 1 #> 7 1995-12-23T06:00 42.7 -125. 0 12.3 -38.4 0.65 0.0001 0.0001 1 #> 8 1995-12-23T12:00 42.7 -125. 0 12.3 -37.2 15.8 0 0 1 #> 9 1995-12-23T18:00 42.7 -125. 0 12.3 -25.4 29.6 0.0001 0.0001 1 #> 10 1995-12-24T00:00 42.8 -125. 0 12.4 -18.5 35.1 0.0002 0.0002 1 #> # … with 182 more rows #> #> [[2]] #> NULL #> #> [[3]] #> NULL ``` ## OAI-PMH metadata [OAI-PMH](https://wiki.pangaea.de/wiki/OAI-PMH) is a standard protocol for serving metadata around objects, in this case datasets. If you are already familiar with OAI-PMH you are in luck as you can can use what you know here. If not familiar, it's relatively straight-forward. Note that you can't get data through these functions, rather only metadata about datasets. ### Identify the service ```r pg_identify() ``` ``` #> #> repositoryName: PANGAEA - Data Publisher for Earth & Environmental Science #> baseURL: https://ws.pangaea.de/oai/provider #> protocolVersion: 2.0 #> adminEmail: tech@pangaea.de #> adminEmail: tech@pangaea.de #> earliestDatestamp: 2015-01-01T00:00:00Z #> deletedRecord: transient #> granularity: YYYY-MM-DDThh:mm:ssZ #> compression: gzip #> description: oaipangaea.de:oai:pangaea.de:doi:10.1594/PANGAEA.999999 ``` ### List metadata formats ```r pg_list_metadata_formats() ``` ``` #> metadataPrefix schema metadataNamespace #> 1 oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd http://www.openarchives.org/OAI/2.0/oai_dc/ #> 2 pan_md http://ws.pangaea.de/schemas/pangaea/MetaData.xsd http://www.pangaea.de/MetaData #> 3 dif http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/dif_v9.4.xsd http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/ #> 4 iso19139 http://www.isotc211.org/2005/gmd/gmd.xsd http://www.isotc211.org/2005/gmd #> 5 iso19139.iodp http://www.isotc211.org/2005/gmd/gmd.xsd http://www.isotc211.org/2005/gmd #> 6 datacite3 http://schema.datacite.org/meta/kernel-3/metadata.xsd http://datacite.org/schema/kernel-3 #> 7 datacite4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd http://datacite.org/schema/kernel-4 ``` ### List identifiers ```r pg_list_identifiers(from = Sys.Date() - 2, until = Sys.Date()) ``` ### List sets ```r pg_list_sets() ``` ``` #> # A tibble: 282 x 2 #> setSpec setName #> #> 1 ACD PANGAEA set / keyword 'ACD' (2 data sets) #> 2 ASPS PANGAEA set / keyword 'ASPS' (59 data sets) #> 3 AWIXRFraw PANGAEA set / keyword 'AWIXRFraw' (1 data sets) #> 4 BAH1960 PANGAEA set / keyword 'BAH1960' (2 data sets) #> 5 BAH1961 PANGAEA set / keyword 'BAH1961' (2 data sets) #> 6 BAH1962 PANGAEA set / keyword 'BAH1962' (7 data sets) #> 7 BAH1963 PANGAEA set / keyword 'BAH1963' (7 data sets) #> 8 BAH1964 PANGAEA set / keyword 'BAH1964' (7 data sets) #> 9 BAH1965 PANGAEA set / keyword 'BAH1965' (7 data sets) #> 10 BAH1966 PANGAEA set / keyword 'BAH1966' (6 data sets) #> # … with 272 more rows ``` ### List records ```r pg_list_records(from = Sys.Date() - 1, until = Sys.Date()) ``` ### Get a record ```r pg_get_record(identifier = "oai:pangaea.de:doi:10.1594/PANGAEA.788382") ``` ``` #> $`oai:pangaea.de:doi:10.1594/PANGAEA.788382` #> $`oai:pangaea.de:doi:10.1594/PANGAEA.788382`$header #> # A tibble: 1 x 3 #> identifier datestamp setSpec #> #> 1 oai:pangaea.de:doi:10.1594/PANGAEA.788382 2020-01-18T03:11:42Z citable;supplement;topicChemistry;topicLithosphere #> #> $`oai:pangaea.de:doi:10.1594/PANGAEA.788382`$metadata #> # A tibble: 1 x 13 #> title creator source publisher date type format identifier description language rights coverage subject #> #> 1 Trace metals in… Demina, Ly… P.P. Shirshov Instit… PANGAEA 2012-… Datas… applic… https://doi.pan… Bioaccumulation of trace… en CC-BY-3.0: … MEDIAN LATITUDE: 29.1… Archive… ```