--- title: "rppo vignette" author: "John Deck" date: "2022-01-31" output: html_document: keep_md: yes vignette: | %\VignetteIndexEntry{rppo Vignette} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- The rppo package contains just two functions. One to query terms from the Plant Phenology Ontology (PPO) and another to query the data global plant phenology data portal (PPO data portal). Following are three examples which illustrate use of these functions: the first two sections illustrate the `ppo_data` and `ppo_terms` functions and the third section illustrates how to use the functions together. ### ppo_terms function It is frequently useful to look through the list of present and absent terms contained in the PPO. The `ppo_terms` function returns present terms, absent terms, or both, with columns containing a termID, label, definition and full URI for each term. Use the termIDs returned from this function to query terms in the `ppo_data` function. The following example returns the present terms into a "present_terms" data frame and a sample slice from the dataframe. ```r present_terms <- ppo_terms(present = TRUE) # print the first five rows, with just the termIDs and labels print(present_terms[1:5,c("termID","label")]) #> termID label #> 1 obo:PPO_0002359 abscised cones or seeds present #> 2 obo:PPO_0002358 abscised fruits or seeds present #> 3 obo:PPO_0002357 abscised leaves present #> 4 obo:PPO_0002311 breaking leaf buds present #> 5 obo:PPO_0002346 cones present ``` ### ppo_data function The `ppo_data` function queries the PPO Data Portal, passing values to the database and extracting matching results. The results of the `ppo_data` function are returned as a list with five elements: 1) a data frame containing data, 2) a readme string containing usage information and some statistics about the query itself, 3) a citation string containing information about proper citation, 4) a number_possible integer indicating the total number of results if a limit has been specified, and 5) a status code returned from the service. The "df" variable below is populated with results from the data element in the results list, with an example slice of data showing the first record. ```r results <- ppo_data(genus = "Quercus", fromYear = 2013, toYear = 2013, fromDay = 100, toDay = 110, termID = 'obo:PPO_0002313', limit = 10, timeLimit = 2) df <- results$data print(df[1:1,]) #> dayOfYear year genus specificEpithet eventRemarks latitude longitude #> 1 101 2013 Quercus douglasii Leaves 37.98884 -122.1296 #> termID #> 1 obo:PPO_0002313,obo:PPO_0002312,obo:PPO_0002315,obo:PPO_0002000,obo:PPO_0002014,obo:PPO_0002017,obo:PPO_0002015 #> source eventId #> 1 USA-NPN urn:phenologicalObservingProcess/4578221 ``` The readme and citation files returned by the list of results can be accessed by calling the readme and citation elements. Note that the the file "citation_and_data_use_policies.txt" that is referred to in the readme file can be accessed using `cat(results$citation)` ```r cat(results$readme) #> The following contains information about your download from the Global Plant #> Phenology Database. Please refer to the citation_and_data_use_policies.txt #> file for important information about data usage policies, licensing, and #> citation protocols for each dataset. This file contains summary information #> about the query that was run. #> #> data file = data.csv #> date query ran = Mon Jan 31 2022 15:08:45 GMT+0000 (Coordinated Universal Time) #> query = +genus:Quercus AND +plantStructurePresenceTypes:"http://purl.obolibrary.org/obo/PPO_0002313" AND +year:>=2013 AND +year:<=2013 AND +dayOfYear:>=100 AND +dayOfYear:<=110 AND source:USA-NPN #> fields returned = dayOfYear,year,genus,specificEpithet,eventRemarks,latitude,longitude,source,eventId #> user specified limit = 10 #> total results possible = 518 #> total results returned = 10 ``` The results lists also shows the number of possible results in the results set, which is useful if the submitted query had a limit. For example, in the query above, the limit is set to 10 but we want to know how many records were possible if the limit was not set. ```r cat(results$number_possible) #> 518 ``` ### working with terms and data together Here we will generate a data frame showing the frequency of "present" and "absent" terms for a particular query. The query is for genus = "Quercus" and latitude > 47. For each row in the returned data frame `ppo_data` will typically return multiple terms in the termID field, corresponding to phenological stages as defined by the PPO. For our example, we will generate a frequency table of the number of times "present" or "absent" term occur in the entire returned dataset. Note that the termID field returned by `ppo_data` will return "presence" terms in addition to "present" and "absent" terms, while the `ppo_terms` function only returns "present" and "absent" terms. Thus, our frequency distribution only counts the number of "present" and "absent" terms [For an in-depth discussion of the difference between "presence", "present", and "absent", see https://www.frontiersin.org/articles/10.3389/fpls.2018.00517/full]. Finally, since termIDs are returned as URI identifiers and not easily readable text, this example maps termIDs to labels. The resulting data frame shows two columns: 1) a column of term labels, and 2) a frequency of the number of times this label appeared in the result set. ```r ############################################################################### # Generate a frequency data frame showing the number of times each termID # is populated for genus equals "Quercus" above latitude of 47 # Note that all latitude/longitude queries need to be in the format of a # bounding box ############################################################################### df <- ppo_data( genus = "Quercus", limit="10", timeLimit = 4) #> sending request for data ... #> https://www.plantphenology.org/api/v3/download/_search?q=%2Bgenus:Quercus+AND+source:USA-NPN&source=latitude,longitude,year,dayOfYear,termID&limit=10 # return just the termID column t1 <- df$data[,c('termID')] # paste each cell into one string t2<-paste(t1, collapse = ",") # split strings at , t3<-strsplit(t2, ",") # create a frequency table as a data frame freqFrame <- as.data.frame(table(t3)) # create a new data frame that we want to populate resultFrame <- data.frame( label = character(), frequency = integer(), stringsAsFactors = FALSE) ############################################################################### # Replace termIDs with labels in frequency frame ############################################################################### # fetch "present" and "absent" terms using `ppo_terms` termList <- ppo_terms(absent = TRUE, present = TRUE, timeLimit = 2); #> sending request for terms ... #> No encoding supplied: defaulting to UTF-8. # loop all "present"" and "absent" terms if (!is.null(termList)) { for (term in 1:nrow(termList)) { termListTermID<-termList[term,'termID']; termListLabel<-termList[term,'label']; # loop all rows that have a frequency generated for (row in 1:nrow(freqFrame)) { freqFrameTermID = freqFrame[row,'t3'] freqFrameFrequency = freqFrame[row,'Freq'] # Populate resultFrame with matching "present" or "absent" labels. # In this step, we will ignore "presence" terms # found in the frequency frame since the ppo_terms only returns # "present" and "absent" terms. if (freqFrameTermID == termListTermID) { resultFrame[nrow(resultFrame)+1,] <- c(termListLabel,freqFrameFrequency) } } } } else { message("termList is null, likely due to a server response issue. Try increasing the timeLimit or try again later. If the problem persists email the authors.") } # print results, showing term labels and a frequency count print(resultFrame) #> label frequency #> 1 breaking leaf buds absent 2 #> 2 breaking leaf buds present 8 #> 3 expanding true leaves present 8 #> 4 leaf buds present 8 #> 5 new above-ground shoot-borne shoot systems present 8 #> 6 new shoot system present 8 #> 7 non-dormant leaf buds present 8 #> 8 true leaves present 8 #> 9 unfolding true leaves present 8 #> 10 vascular leaves present 8 ```