---
title: "SPARQL with RNeXML"
author:
- Carl Boettiger
- Scott Chamberlain
- Rutger Vos
- Hilmar Lapp
output: html_vignette
---
## SPARQL Queries
Rich, semantically meaningful metadata lies at the heart of the NeXML
standard. R provides a rich environment to unlock this information.
While our previous examples have relied on the user knowing exactly
what metadata they intend to extract (title, publication date, citation
information, and so forth), _semantic_ metadata has meaning that a
computer can make use of, allowing us to make much more conceptually
rich queries than those simple examples. The SPARQL query language
is a powerful way to make use of such semantic information in making
complex queries.
While users should consult a formal introduction to SPARQL for further
background, here we illustrate how SPARQL can be used in combination
with R functions in ways that would be much more tedious to assemble
with only traditional/non-semantic queries. The SPARQL query language
is provided for the R environment through the `rrdf` package [@Willighagen_2014],
so we start by loading that package. We will also make use of functions
from `phytools` and `RNeXML`.
```r
library("rdflib")
library("xml2")
library("phytools")
library("RNeXML")
```
We read in an example file that contains semantic metadata annotations
describing the taxonomic units (OTUs) used in the tree.
```r
nexml <- nexml_read(system.file("examples/primates.xml", package="RNeXML"))
```
In particular, this example declares the taxon rank, NCBI identifier and parent taxon
for each OTU, such as:
```xml
```
In this example, we will construct a cladogram by using this information
to identify the taxonomic rank of each OTU, and its shared parent
taxonomic rank. (If this example looks complex, try writing down the
steps to do this without the aid of the SPARQL queries). These examples
show the manipulation of semantic triples, Unique Resource Identifiers
(URIs) and use of the SPARQL "Join" operator.
Note that this example can be run using `demo("sparql", "RNeXML")` to see the
code displayed in the R terminal and to avoid character errors that
can occur in having to copy and paste from PDF files.
We begin by extracting the RDF graph from the NeXML,
```r
rdf <- get_rdf(system.file("examples/primates.xml", package="RNeXML"))
tmp <- tempfile() # so we must write the XML out first
xml2::write_xml(rdf, tmp)
graph <- rdf_parse(tmp)
```
We then fetch the NCBI URI for the taxon that has rank 'Order', i.e. the
root of the primates phylogeny. The dot operator `.` between clauses
implies a join, in this case
```r
root <- rdf_query(graph,
"SELECT ?uri WHERE {
?id .
?id ?uri
}")
```
This makes use of the SPARQL query language provided by the `rrdf`
package. We will also define some helper functions that use SPARQL
queries. Here we define a function to get the name
```r
get_name <- function(id) {
max <- length(nexml@otus[[1]]@otu)
for(i in 1:max) {
if ( nexml@otus[[1]]@otu[[i]]@id == id ) {
label <- nexml@otus[[1]]@otu[[i]]@label
label <- gsub(" ","_",label)
return(label)
}
}
}
```
Next, we define a recursive function to build a newick tree from the taxonomic rank information.
```r
recurse <- function(node){
# fetch the taxonomic rank and id string
rank_query <- paste0(
"SELECT ?rank ?id WHERE {
?id <",node,"> .
?id ?rank
}")
result <- rdf_query(graph, rank_query)
# get the local ID, strip URI part
id <- result$id
id <- gsub("^.+#", "", id, perl = TRUE)
# if rank is terminal, return the name
if (result$rank == "http://rs.tdwg.org/ontology/voc/TaxonRank#Species") {
return(get_name(id))
}
# recurse deeper
else {
child_query <- paste0(
"SELECT ?uri WHERE {
?id <",node,"> .
?id ?uri
}")
children <- rdf_query(graph, child_query)
return(paste("(",
paste(sapply(children$uri, recurse),
sep = ",", collapse = "," ),
")",
get_name(id), # label interior nodes
sep = "", collapse = ""))
}
}
```
With these functions in place, it is straight forward to build the tree
from the semantic RDFa data and then visualize it
```r
newick <- paste(recurse(root), ";", sep = "", collapse = "")
tree <- read.newick(text = newick)
collapsed <- collapse.singles(tree)
plot(collapsed,
type='cladogram',
show.tip.label=FALSE,
show.node.label=TRUE,
cex=0.75,
edge.color='grey60',
label.offset=-9)
```
![plot of chunk unnamed-chunk-8](sparql-unnamed-chunk-8-1.png)