Title: | Access and Search MedRxiv and BioRxiv Preprint Data |
---|---|
Description: | An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv <https://www.medrxiv.org/> and bioRxiv <https://www.biorxiv.org/>, both of which are operated by the Cold Spring Harbor Laboratory. 'medrxivr' provides programmatic access to the 'Cold Spring Harbour Laboratory (CSHL)' API <https://api.biorxiv.org/>, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. 'medrxivr' also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria. |
Authors: | Yaoxiang Li [aut, cre] , Luke McGuinness [aut], Lena Schmidt [aut], Tuija Sonkkila [rev], Najko Jahn [rev] |
Maintainer: | Yaoxiang Li <[email protected]> |
License: | GPL-2 |
Version: | 0.1.1 |
Built: | 2025-01-04 06:12:47 UTC |
Source: | https://github.com/ropensci/medrxivr |
Provides programmatic access to all preprints available through the Cold Spring Harbour Laboratory API, which serves both the medRxiv and bioRxiv preprint repositories.
mx_api_content( from_date = "2013-01-01", to_date = as.character(Sys.Date()), clean = TRUE, server = "medrxiv", include_info = FALSE )
mx_api_content( from_date = "2013-01-01", to_date = as.character(Sys.Date()), clean = TRUE, server = "medrxiv", include_info = FALSE )
from_date |
Earliest date of interest, written as "YYYY-MM-DD". Defaults to 1st Jan 2013 ("2013-01-01"), ~6 months prior to earliest preprint registration date. |
to_date |
Latest date of interest, written as "YYYY-MM-DD". Defaults to current date. |
clean |
Logical, defaulting to TRUE, indicating whether to clean the data returned by the API. If TRUE, variables containing absolute paths to the preprints web-page ("link_page") and PDF ("link_pdf") are generated from the "server", "DOI", and "version" variables returned by the API. The "title", "abstract" and "authors" variables are converted to title case. Finally, the "type" and "server" variables are dropped. |
server |
Specify the server you wish to use: "medrxiv" (default) or "biorxiv" |
include_info |
Logical, indicating whether to include variables containing information returned by the API (e.g. API status, cursor number, total count of papers, etc). Default is FALSE. |
Dataframe with 1 record per row
Other data-source:
mx_api_doi()
,
mx_snapshot()
if (interactive()) { mx_data <- mx_api_content( from_date = "2020-01-01", to_date = "2020-01-07" ) }
if (interactive()) { mx_data <- mx_api_content( from_date = "2020-01-01", to_date = "2020-01-07" ) }
Provides programmatic access to data on a single preprint identified by a unique Digital Object Identifier (DOI).
mx_api_doi(doi, server = "medrxiv", clean = TRUE)
mx_api_doi(doi, server = "medrxiv", clean = TRUE)
doi |
Digital object identifier of the preprint you wish to retrieve data on. |
server |
Specify the server you wish to use: "medrxiv" (default) or "biorxiv" |
clean |
Logical, defaulting to TRUE, indicating whether to clean the data returned by the API. If TRUE, variables containing absolute paths to the preprints web-page ("link_page") and PDF ("link_pdf") are generated from the "server", "DOI", and "version" variables returned by the API. The "title", "abstract" and "authors" variables are converted to title case. Finally, the "type" and "server" variables are dropped. |
Dataframe containing details on the preprint identified by the DOI.
Other data-source:
mx_api_content()
,
mx_snapshot()
if (interactive()) { mx_data <- mx_api_doi("10.1101/2020.02.25.20021568") }
if (interactive()) { mx_data <- mx_api_doi("10.1101/2020.02.25.20021568") }
Inspired by the varying capitalization of "NCOV" during the corona virus pandemic (e.g. ncov, nCoV, NCOV, nCOV), this function allows for all possible configurations of lower- and upper-case letters in your search term.
mx_caps(x)
mx_caps(x)
x |
Search term to be formatted |
The input string is return, but with each non-space character repeated in lower- and upper-case, and enclosed in square brackets. For example, mx_caps("ncov") returns "[Nn][Cc][Oo][Vv]"
Other helper:
mx_crosscheck()
,
mx_download()
,
mx_export()
query <- c("coronavirus", mx_caps("ncov")) mx_search(mx_snapshot("6c4056d2cccd6031d92ee4269b1785c6ec4d555b"), query)
query <- c("coronavirus", mx_caps("ncov")) mx_search(mx_snapshot("6c4056d2cccd6031d92ee4269b1785c6ec4d555b"), query)
Provides information on how up-to-date the maintained medRxiv snapshot provided by 'mx_snapshot()' is by checking whether there have been any records added to, or updated in, the medRxiv repository since the last snapshot was taken.
mx_crosscheck()
mx_crosscheck()
Other helper:
mx_caps()
,
mx_download()
,
mx_export()
mx_crosscheck()
mx_crosscheck()
Download PDF's of all the papers in your search results
mx_download( mx_results, directory, create = TRUE, name = c("ID", "DOI"), print_update = 10 )
mx_download( mx_results, directory, create = TRUE, name = c("ID", "DOI"), print_update = 10 )
mx_results |
Vector containing the links to the medRxiv PDFs |
directory |
The location you want to download the PDF's to |
create |
TRUE or FALSE. If TRUE, creates the directory if it doesn't exist |
name |
How to name the downloaded PDF. By default, both the ID number of the record and the DOI are used. |
print_update |
How frequently to print an update |
Other helper:
mx_caps()
,
mx_crosscheck()
,
mx_export()
mx_results <- mx_search(mx_snapshot(), query = "10.1101/2020.02.25.20021568") mx_download(mx_results, directory = tempdir())
mx_results <- mx_search(mx_snapshot(), query = "10.1101/2020.02.25.20021568") mx_download(mx_results, directory = tempdir())
Export references for preprints returning by a search to a .bib file
mx_export(data, file = "medrxiv_export.bib")
mx_export(data, file = "medrxiv_export.bib")
data |
Dataframe returned by mx_search() or mx_api_*() functions |
file |
File location to save to. Must have the .bib file extension |
Exports a formatted .BIB file, for import into a reference manager
Other helper:
mx_caps()
,
mx_crosscheck()
,
mx_download()
mx_results <- mx_search(mx_snapshot(), query = "brain") mx_export(mx_results, tempfile(fileext = ".bib"))
mx_results <- mx_search(mx_snapshot(), query = "brain") mx_export(mx_results, tempfile(fileext = ".bib"))
Search and print output for individual search items
mx_reporter(mx_data, num_results, query, fields, deduplicate, NOT)
mx_reporter(mx_data, num_results, query, fields, deduplicate, NOT)
mx_data |
The mx_dataset filtered for the date limits |
num_results |
The number of results returned by the overall search |
query |
Character string, vector or list |
fields |
Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI. |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
NOT |
Vector of regular expressions to exclude from the search. Default is "". |
Other main:
mx_search()
,
print_full_results()
,
run_search()
Search preprint data
mx_search( data = NULL, query = NULL, fields = c("title", "abstract", "authors", "category", "doi"), from_date = NULL, to_date = NULL, auto_caps = FALSE, NOT = "", deduplicate = TRUE, report = FALSE )
mx_search( data = NULL, query = NULL, fields = c("title", "abstract", "authors", "category", "doi"), from_date = NULL, to_date = NULL, auto_caps = FALSE, NOT = "", deduplicate = TRUE, report = FALSE )
data |
The preprint dataset that is to be searched, created either using mx_api_content() or mx_snapshot() |
query |
Character string, vector or list |
fields |
Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI. |
from_date |
Defines earliest date of interest. Written in the format "YYYY-MM-DD". Note, records published on the date specified will also be returned. |
to_date |
Defines latest date of interest. Written in the format "YYYY-MM-DD". Note, records published on the date specified will also be returned. |
auto_caps |
As the search is case sensitive, this logical specifies whether the search should automatically allow for differing capitalisation of search terms. For example, when TRUE, a search for "dementia" would find both "dementia" but also "Dementia". Note, that if your term is multi-word (e.g. "systematic review"), only the first word is automatically capitalised (e.g your search will find both "systematic review" and "Systematic review" but won't find "Systematic Review". Note that this option will format terms in the query and NOT arguments (if applicable). |
NOT |
Vector of regular expressions to exclude from the search. Default is "". |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
report |
Logical. Run mx_reporter. Default is FALSE. |
Other main:
mx_reporter()
,
print_full_results()
,
run_search()
# Using the daily snapshot mx_results <- mx_search(data = mx_snapshot(), query = "dementia")
# Using the daily snapshot mx_results <- mx_search(data = mx_snapshot(), query = "dementia")
[Available for medRxiv only] This function allows users to import a maintained static snapshot of the medRxiv repository, instead of downloading a copy from the API, which can become unavailable during peak usage times. The function dynamically retrieves multiple snapshot parts from the specified repository and combines them into a single dataframe.
mx_snapshot(commit = "main")
mx_snapshot(commit = "main")
commit |
Commit hash or branch name for the snapshot, taken from https://github.com/yaoxiangli/medrxivr-data. Allows for reproducible searching by specifying the exact snapshot used to perform the searches. Defaults to "main", which will return the most recent snapshot from the main branch. |
A formatted dataframe containing the combined data from the snapshot parts, with reconstructed 'link_page' and 'link_pdf' columns.
Other data-source:
mx_api_content()
,
mx_api_doi()
mx_data <- mx_snapshot() mx_data_specific <- mx_snapshot(commit = "specific_commit_hash")
mx_data <- mx_snapshot() mx_data_specific <- mx_snapshot(commit = "specific_commit_hash")
Search for terms in the dataset
print_full_results(num_results, deduplicate)
print_full_results(num_results, deduplicate)
num_results |
number of searched terms returned |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
Other main:
mx_reporter()
,
mx_search()
,
run_search()
Search for terms in the dataset
run_search(mx_data, query, fields, deduplicate, NOT = "")
run_search(mx_data, query, fields, deduplicate, NOT = "")
mx_data |
The mx_dataset filtered for the date limits |
query |
Character string, vector or list |
fields |
Fields of the database to search - default is Title, Abstract, Authors, Category, and DOI. |
deduplicate |
Logical. Only return the most recent version of a record. Default is TRUE. |
NOT |
Vector of regular expressions to exclude from the search. Default is NULL. |
Other main:
mx_reporter()
,
mx_search()
,
print_full_results()