Title: | General Purpose 'Oai-PMH' Services Client |
---|---|
Description: | A general purpose client to work with any 'OAI-PMH' (Open Archives Initiative Protocol for 'Metadata' Harvesting) service. The 'OAI-PMH' protocol is described at <http://www.openarchives.org/OAI/openarchivesprotocol.html>. Functions are provided to work with the 'OAI-PMH' verbs: 'GetRecord', 'Identify', 'ListIdentifiers', 'ListMetadataFormats', 'ListRecords', and 'ListSets'. |
Authors: | Scott Chamberlain [aut], Michal Bojanowski [aut, cre] , National Science Centre [fnd] (Supported MB through grant 2012/07/D/HS6/01971, <https://ncn.gov.pl>) |
Maintainer: | Michal Bojanowski <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.1 |
Built: | 2024-12-27 03:13:02 UTC |
Source: | https://github.com/ropensci/oai |
oai is an R client to work with OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) services, a protocol developed by the Open Archives Initiative (https://en.wikipedia.org/wiki/Open_Archives_Initiative). OAI-PMH uses XML data format transported over HTTP.
See the OAI-PMH V2 specification at http://www.openarchives.org/OAI/openarchivesprotocol.html
oai is built on xml2 and httr. In addition, we give back data.frame's whenever possible to make data comprehension, manipulation, and visualization easier. We also have functions to fetch a large directory of OAI-PMH services - it isn't exhaustive, but does contain a lot.
Instead of paging with e.g., page
and per_page
parameters,
OAI-PMH uses (optionally) resumptionTokens
, with an optional
expiration date. These tokens can be used to continue on to the next chunk
of data, if the first request did not get to the end. Often, OAI-PMH
services limit each request to 50 records, but this may vary by provider,
I don't know for sure. The API of this package is such that we while
loop for you internally until we get all records. We may in the future
expose e.g., a limit
parameter so you can say how many records
you want, but we haven't done this yet.
Michal Bojanowski contributions were supported by (Polish) National Science Center (NCN) through grant 2012/07/D/HS6/01971.
Scott Chamberlain [email protected]
Michal Bojanowski [email protected]
Useful links:
Report bugs at https://github.com/ropensci/oai/issues
Count OAI-PMH identifiers for a data provider.
count_identifiers(url = "http://export.arxiv.org/oai2", prefix = "oai_dc", ...)
count_identifiers(url = "http://export.arxiv.org/oai2", prefix = "oai_dc", ...)
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
prefix |
Specifies the metadata format that the records will be returned in |
... |
Curl options passed on to |
Note that some OAI providers do not include the entry
completeListSize
(http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl)
in which case we return an NA - which does not mean 0, but rather we don't
know.
## Not run: count_identifiers() # curl options # library("httr") # count_identifiers(config = verbose()) ## End(Not run)
## Not run: count_identifiers() # curl options # library("httr") # count_identifiers(config = verbose()) ## End(Not run)
Result dumpers are functions allowing to handle the chunks of results from OAI-PMH service "on the fly". Handling can include processing, writing to files, databases etc.
dump_raw_to_txt( res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".xml" ) dump_to_rds( res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".rds" ) dump_raw_to_db(res, args, as, dbcon, table_name, field_name, ...)
dump_raw_to_txt( res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".xml" ) dump_to_rds( res, args, as, file_pattern = "oaidump", file_dir = ".", file_ext = ".rds" ) dump_raw_to_db(res, args, as, dbcon, table_name, field_name, ...)
res |
results, depends on |
args |
list, query arguments, not to be specified by the user |
as |
character, type of result to return, not to be specified by the user |
file_pattern , file_dir , file_ext
|
character respectively: initial part of
the file name, directory name, and file extension used to create file
names. These arguments are passed to |
dbcon |
DBI-compliant database connection |
table_name |
character, name of the database table to write into |
field_name |
character, name of the field in database table to write into |
... |
arguments passed to/from other functions |
Often the result of a request to a OAI-PMH service are so large that it is
split into chunks that need to be requested separately using
resumptionToken
. By default functions like
list_identifiers()
or list_records()
request these
chunks under the hood and return all concatenated in a single R object. It
is convenient but insufficient when dealing with large result sets that
might not fit into RAM. A result dumper is a function that is called on
each result chunk. Dumper functions can write chunks to files or databases,
include initial pre-processing or extraction, and so on.
A result dumper needs to be function that accepts at least the arguments:
res
, args
, as
. They will get values by the enclosing
function internally. There may be additional arguments, including ...
.
Dumpers should return NULL
or a value that will
be collected and returned by the function calling the dumper (e.g.
list_records()
).
Currently result dumpers can be used with functions:
list_identifiers()
, list_records()
, and list_sets()
.
To use a dumper with one of these functions you need to:
Pass it as an additional argument dumper
Pass optional addtional arguments to the dumper function in a list
as the dumper_args
argument
See Examples. Below we provide more details on the dumpers currently implemented.
dump_raw_to_txt
writes raw XML to text files. It requires
as=="raw"
. File names are created using tempfile()
. By
default they are written in the current working directory and have a format
oaidump*.xml
where *
is a random string in hex.
dump_to_rds
saves results in an .rds
file via saveRDS()
.
Type of object being saved is determined by the as
argument. File names
are generated in the same way as by dump_raw_to_txt
, but with default
extension .rds
dump_xml_to_db
writes raw XML to a single text column of a table in a
database. Requires as == "raw"
. Database connection dbcon
should be a connection object as created by DBI::dbConnect()
from
package DBI. As such, it can connect to any database supported by
DBI. The records are written to a field field_name
in a table
table_name
using DBI::dbWriteTable()
. If the table does not
exist, it is created. If it does, the records are appended. Any additional
arguments are passed to DBI::dbWriteTable()
Dumpers should return NULL
or a value that will be collected
and returned by the function using the dumper.
dump_raw_to_txt
returns the name of the created file.
dump_to_rds
returns the name of the created file.
dump_xml_to_db
returns NULL
OAI-PMH specification https://www.openarchives.org/OAI/openarchivesprotocol.html
Functions supporting the dumpers:
list_identifiers()
, list_sets()
, and list_records()
## Not run: ### Dumping raw XML to text files # This will write a set of XML files to a temporary directory fnames <- list_identifiers(from="2018-06-01T", until="2018-06-14T", as="raw", dumper=dump_raw_to_txt, dumper_args=list(file_dir=tempdir())) # vector of file names created str(fnames) all( file.exists(fnames) ) # clean-up unlink(fnames) ### Dumping raw XML to a database # Connect to in-memory SQLite database con <- DBI::dbConnect(RSQLite::SQLite(), dbname=":memory:") # Harvest and dump the results into field "bar" of table "foo" list_identifiers(from="2018-06-01T", until="2018-06-14T", as="raw", dumper=dump_raw_to_db, dumper_args=list(dbcon=con, table_name="foo", field_name="bar") ) # Count records, should be 101 DBI::dbGetQuery(con, "SELECT count(*) as no_records FROM foo") DBI::dbDisconnect(con) ## End(Not run)
## Not run: ### Dumping raw XML to text files # This will write a set of XML files to a temporary directory fnames <- list_identifiers(from="2018-06-01T", until="2018-06-14T", as="raw", dumper=dump_raw_to_txt, dumper_args=list(file_dir=tempdir())) # vector of file names created str(fnames) all( file.exists(fnames) ) # clean-up unlink(fnames) ### Dumping raw XML to a database # Connect to in-memory SQLite database con <- DBI::dbConnect(RSQLite::SQLite(), dbname=":memory:") # Harvest and dump the results into field "bar" of table "foo" list_identifiers(from="2018-06-01T", until="2018-06-14T", as="raw", dumper=dump_raw_to_db, dumper_args=list(dbcon=con, table_name="foo", field_name="bar") ) # Count records, should be 101 DBI::dbGetQuery(con, "SELECT count(*) as no_records FROM foo") DBI::dbDisconnect(con) ## End(Not run)
Get records
get_records( ids, prefix = "oai_dc", url = "http://api.gbif.org/v1/oai-pmh/registry", as = "parsed", ... )
get_records( ids, prefix = "oai_dc", url = "http://api.gbif.org/v1/oai-pmh/registry", as = "parsed", ... )
ids |
The OAI-PMH identifier for the record. One or more. Required. |
prefix |
specifies the metadata format that the records will be
returned in. Default: |
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
as |
(character) What to return. One of "parsed" (default), or "raw" (raw text) |
... |
Curl options passed on to |
There are some finite set of results based on the OAI prefix.
We will provide parsers as we have time, and as users express interest.
For prefix types we have parsers for we return a list of data.frame's,
for each identifier, one data.frame for the header
bits of data, and
one data.frame for the metadata
bits of data.
For prefixes we don't have parsers for, we fall back to returning raw XML, so you can at least parse the XML yourself.
Because some XML nodes are duplicated, we join values together of
duplicated node names, separated by a semicolon (;
) with no
spaces. You can seprarate them yourself easily.
a named list of data.frame's, or lists, or raw text
## Not run: get_records("87832186-00ea-44dd-a6bf-c2896c4d09b4") ids <- c("87832186-00ea-44dd-a6bf-c2896c4d09b4", "d981c07d-bc43-40a2-be1f-e786e25106ac") (res <- get_records(ids)) lapply(res, "[[", "header") lapply(res, "[[", "metadata") do.call(rbind, lapply(res, "[[", "header")) do.call(rbind, lapply(res, "[[", "metadata")) # Get raw text get_records("d981c07d-bc43-40a2-be1f-e786e25106ac", as = "raw") # from arxiv.org get_records("oai:arXiv.org:0704.0001", url = "http://export.arxiv.org/oai2") ## End(Not run)
## Not run: get_records("87832186-00ea-44dd-a6bf-c2896c4d09b4") ids <- c("87832186-00ea-44dd-a6bf-c2896c4d09b4", "d981c07d-bc43-40a2-be1f-e786e25106ac") (res <- get_records(ids)) lapply(res, "[[", "header") lapply(res, "[[", "metadata") do.call(rbind, lapply(res, "[[", "header")) do.call(rbind, lapply(res, "[[", "metadata")) # Get raw text get_records("d981c07d-bc43-40a2-be1f-e786e25106ac", as = "raw") # from arxiv.org get_records("oai:arXiv.org:0704.0001", url = "http://export.arxiv.org/oai2") ## End(Not run)
Identify the OAI-PMH service for each data provider.
id(url, as = "parsed", ...)
id(url, as = "parsed", ...)
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
as |
(character) What to return. One of "parsed" (default), or "raw" (raw text) |
... |
Curl options passed on to |
## Not run: # arxiv id("http://export.arxiv.org/oai2") # GBIF - http://www.gbif.org/ id("http://api.gbif.org/v1/oai-pmh/registry") # get back text instead of parsed id("http://export.arxiv.org/oai2", as = "raw") id("http://api.gbif.org/v1/oai-pmh/registry", as = "raw") # curl options library("httr") id("http://export.arxiv.org/oai2", config = verbose()) ## End(Not run)
## Not run: # arxiv id("http://export.arxiv.org/oai2") # GBIF - http://www.gbif.org/ id("http://api.gbif.org/v1/oai-pmh/registry") # get back text instead of parsed id("http://export.arxiv.org/oai2", as = "raw") id("http://api.gbif.org/v1/oai-pmh/registry", as = "raw") # curl options library("httr") id("http://export.arxiv.org/oai2", config = verbose()) ## End(Not run)
List OAI-PMH identifiers
list_identifiers( url = "http://api.gbif.org/v1/oai-pmh/registry", prefix = "oai_dc", from = NULL, until = NULL, set = NULL, token = NULL, as = "df", ... )
list_identifiers( url = "http://api.gbif.org/v1/oai-pmh/registry", prefix = "oai_dc", from = NULL, until = NULL, set = NULL, token = NULL, as = "df", ... )
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
prefix |
Specifies the metadata format that the records will be returned in. |
from |
specifies that records returned must have been created/update/deleted on or after this date. |
until |
specifies that records returned must have been created/update/deleted on or before this date. |
set |
specifies the set that returned records must belong to. |
token |
a token previously provided by the server to resume a request where it last left off. |
as |
(character) What to return. One of "df" (for data.frame; default), "list", or "raw" (raw text) |
... |
Curl options passed on to |
## Not run: # from recently <- format(Sys.Date() - 1, "%Y-%m-%d") list_identifiers(from = recently) # from and until list_identifiers(from = '2018-06-01T', until = '2018-06-14T') # set parameter - here, using ANDS - Australian National Data Service list_identifiers(from = '2018-09-01T', until = '2018-09-05T', set = "dataset_type:CHECKLIST") ## End(Not run)
## Not run: # from recently <- format(Sys.Date() - 1, "%Y-%m-%d") list_identifiers(from = recently) # from and until list_identifiers(from = '2018-06-01T', until = '2018-06-14T') # set parameter - here, using ANDS - Australian National Data Service list_identifiers(from = '2018-09-01T', until = '2018-09-05T', set = "dataset_type:CHECKLIST") ## End(Not run)
List available metadata formats from various providers.
list_metadataformats( url = "http://api.gbif.org/v1/oai-pmh/registry", id = NULL, ... )
list_metadataformats( url = "http://api.gbif.org/v1/oai-pmh/registry", id = NULL, ... )
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
id |
The OAI-PMH identifier for the record. Optional. |
... |
Curl options passed on to |
## Not run: list_metadataformats() # no metadatformats for an identifier list_metadataformats(id = "9da8a65a-1b9b-487c-a564-d184a91a2705") # metadatformats available for an identifier list_metadataformats(id = "ad7295e0-3261-4028-8308-b2047d51d408") ## End(Not run)
## Not run: list_metadataformats() # no metadatformats for an identifier list_metadataformats(id = "9da8a65a-1b9b-487c-a564-d184a91a2705") # metadatformats available for an identifier list_metadataformats(id = "ad7295e0-3261-4028-8308-b2047d51d408") ## End(Not run)
List records
list_records( url = "http://api.gbif.org/v1/oai-pmh/registry", prefix = "oai_dc", from = NULL, until = NULL, set = NULL, token = NULL, as = "df", ... )
list_records( url = "http://api.gbif.org/v1/oai-pmh/registry", prefix = "oai_dc", from = NULL, until = NULL, set = NULL, token = NULL, as = "df", ... )
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
prefix |
specifies the metadata format that the records will be
returned in. Default: |
from |
specifies that records returned must have been created/update/deleted on or after this date. |
until |
specifies that records returned must have been created/update/deleted on or before this date. |
set |
specifies the set that returned records must belong to. |
token |
(character) a token previously provided by the server to resume a request where it last left off. 50 is max number of records returned. We will loop for you internally to get all the records you asked for. |
as |
(character) What to return. One of "df" (for data.frame; default), "list", or "raw" (raw text) |
... |
Curl options passed on to |
## Not run: # By default you get back a single data.frame list_records(from = '2018-05-01T00:00:00Z', until = '2018-05-03T00:00:00Z') list_records(from = '2018-05-01T', until = '2018-05-04T') # Get a list list_records(from = '2018-05-01T', until = '2018-05-04T', as = "list") # Get raw text list_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw") list_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw") # Use a resumption token # list_records(token = # "1443799900201,2015-09-01T00:00:00Z,2015-10-01T23:59:59Z,50,null,oai_dc") ## End(Not run)
## Not run: # By default you get back a single data.frame list_records(from = '2018-05-01T00:00:00Z', until = '2018-05-03T00:00:00Z') list_records(from = '2018-05-01T', until = '2018-05-04T') # Get a list list_records(from = '2018-05-01T', until = '2018-05-04T', as = "list") # Get raw text list_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw") list_records(from = '2018-05-01T', until = '2018-05-04T', as = "raw") # Use a resumption token # list_records(token = # "1443799900201,2015-09-01T00:00:00Z,2015-10-01T23:59:59Z,50,null,oai_dc") ## End(Not run)
List sets
list_sets( url = "http://api.gbif.org/v1/oai-pmh/registry", token = NULL, as = "df", ... )
list_sets( url = "http://api.gbif.org/v1/oai-pmh/registry", token = NULL, as = "df", ... )
url |
(character) OAI-PMH base url. Defaults to the URL for arXiv's OAI-PMH server (http://export.arxiv.org/oai2) or GBIF's OAI-PMH server (http://api.gbif.org/v1/oai-pmh/registry) |
token |
(character) a token previously provided by the server to resume a request where it last left off |
as |
(character) What to return. One of "df" (for data.frame; default), "list", or "raw" (raw text) |
... |
Curl options passed on to |
## Not run: # Get back a data.frame list_sets() # Get back a list list_sets(as = "list") # Get back raw text list_sets(as = "raw") # curl options library("httr") list_sets(config = verbose()) ## End(Not run)
## Not run: # Get back a data.frame list_sets() # Get back a list list_sets(as = "list") # Get back raw text list_sets(as = "raw") # curl options library("httr") list_sets(config = verbose()) ## End(Not run)
Load an updated cache
load_providers(path = NULL, envir = .GlobalEnv)
load_providers(path = NULL, envir = .GlobalEnv)
path |
location where cache is located. Leaving to NULL loads the version in the installed package |
envir |
R environment to load data in to. |
Loads the data object providers into the global workspace.
loads the object providers into the working space.
## Not run: # By default the new providers table goes to directory ".", so just # load from there update_providers() load_providers(path=".") # Loads the version in the package load_providers() ## End(Not run)
## Not run: # By default the new providers table goes to directory ".", so just # load from there update_providers() load_providers(path=".") # Loads the version in the package load_providers() ## End(Not run)
Silently test if OAI-PMH service is available under the URL provided.
oai_available(u, ...)
oai_available(u, ...)
u |
base URL to OAI-PMH service |
... |
other arguments passed to |
TRUE
or FALSE
if the service is available.
## Not run: url_list <- list( archivesic="http://archivesic.ccsd.cnrs.fr/oai/oai.php", datacite = "http://oai.datacite.org/oai", # No OAI-PMH here google = "http://google.com" ) sapply(url_list, oai_available) ## End(Not run)
## Not run: url_list <- list( archivesic="http://archivesic.ccsd.cnrs.fr/oai/oai.php", datacite = "http://oai.datacite.org/oai", # No OAI-PMH here google = "http://google.com" ) sapply(url_list, oai_available) ## End(Not run)
Metadata providers data.frame.
A data.frame of three columns:
repo_name - Name of the OAI repository
base_url - Base URL of the OAI repository
oai_identifier - OAI identifier for the OAI repository
Data comes from http://www.openarchives.org/Register/BrowseSites. It includes the oai-identifier (if they have one) and the base URL. The website has the name of the data provider too, but not provided in the data pulled down here, but you can grab the name using the example below.
update_providers(path = ".", ...)
update_providers(path = ".", ...)
path |
Path to put data in. |
... |
Curl options passed on to |
This table is scraped from http://www.openarchives.org/Register/BrowseSites. I would get it from http://www.openarchives.org/pmh/registry/ListFriends, but it does not include repository names.
This function updates the table for you. Does take a while though, so go get a coffee.
## Not run: update_providers() load_providers() ## End(Not run)
## Not run: update_providers() load_providers() ## End(Not run)