Title: | Parse Full Text XML Documents from PubMed Central |
---|---|
Description: | Parse XML documents from the Open Access subset of Europe PubMed Central <https://europepmc.org> including section paragraphs, tables, captions and references. |
Authors: | Chris Stubben [aut, cre] |
Maintainer: | Chris Stubben <[email protected]> |
License: | GPL-3 |
Version: | 1.8 |
Built: | 2024-12-21 04:56:26 UTC |
Source: | https://github.com/ropensci/tidypmc |
Collapse rows into a semi-colon delimited list with column names and cell values
collapse_rows(pmc, na.string)
collapse_rows(pmc, na.string)
pmc |
a list of tables, usually from |
na.string |
additional cell values to skip, default is NA and "" |
A tibble with table and row number and collapsed text
Chris Stubben
x <- data.frame( genes = c("aroB", "glnP", "ndhA", "pyrF"), fold_change = c(2.5, 1.7, -3.1, -2.6) ) collapse_rows(list(`Table 1` = x))
x <- data.frame( genes = c("aroB", "glnP", "ndhA", "pyrF"), fold_change = c(2.5, 1.7, -3.1, -2.6) ) collapse_rows(list(`Table 1` = x))
Get a list of journal and article metadata in /front tag
pmc_metadata(doc)
pmc_metadata(doc)
doc |
|
a list
Chris Stubben
# doc <- pmc_xml("PMC2231364") # OR doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) pmc_metadata(doc)
# doc <- pmc_xml("PMC2231364") # OR doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) pmc_metadata(doc)
Format references cited
pmc_reference(doc)
pmc_reference(doc)
doc |
|
a tibble with id, pmid, authors, year, title, journal, volume, pages, and doi.
Mixed citations without any child tags are added to the author column.
Chris Stubben
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) x <- pmc_reference(doc) x
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) x <- pmc_reference(doc) x
Convert PubMed Central table nodes into a list of tibbles
pmc_table(doc)
pmc_table(doc)
doc |
|
a list of tibbles
Saves the caption and footnotes as attributes and collapses multiline headers, expands all rowspan and colspan attributes and adds subheadings to column one.
Chris Stubben
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) x <- pmc_table(doc) sapply(x, dim) x attributes(x[[1]])
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) x <- pmc_table(doc) sapply(x, dim) x attributes(x[[1]])
Split section paragraph tags into a table with subsection titles and
sentences using tokenize_sentences
pmc_text(doc)
pmc_text(doc)
doc |
|
a tibble with section, paragraph and sentence number and text
Subsections may be nested to arbitrary depths and this function will return the entire path to the subsection title as a delimited string like "Results; Predicted functions; Pathogenicity". Tables, figures and formulas that are nested in section paragraphs are removed, superscripted references are replaced with brackets, and any other superscripts or subscripts are separared with ^ and _.
Chris Stubben
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) txt <- pmc_text(doc) txt dplyr::count(txt, section, sort = TRUE)
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc" )) txt <- pmc_text(doc) txt dplyr::count(txt, section, sort = TRUE)
Download XML from PubMed Central
pmc_xml(id)
pmc_xml(id)
id |
a PMC id starting with 'PMC' |
xml_document
https://europepmc.org/RestfulWebService
## Not run: doc <- pmc_xml("PMC2231364") ## End(Not run)
## Not run: doc <- pmc_xml("PMC2231364") ## End(Not run)
Separate genes and operons mentioned in full text into multiple rows
separate_genes(txt, pattern = "\\b[A-Za-z][a-z]{2}[A-Z0-9]+\\b", genes, operon = 6, column = "text")
separate_genes(txt, pattern = "\\b[A-Za-z][a-z]{2}[A-Z0-9]+\\b", genes, operon = 6, column = "text")
txt |
a table |
pattern |
regular expression to match genes, default is to match microbial genes like AbcD, default [A-Za-z][a-z]2[A-Z0-9]+ |
genes |
an optional vector of genes, set pattern to NA to only match this list. |
operon |
operon length, default 6. Split genes with 6 or more letters into separate genes, for example AbcDEF is split into abcD, abcE and abcF. |
column |
column name to search, default "text" |
a tibble with gene name, matching text and rows.
Check for genes in italics using xml_text(xml_find_all(doc,
"//sec//p//italic"))
and update the pattern or add additional genes as an
optional vector if needed
Chris Stubben
x <- data.frame(row = 1, text = "Genes like YacK, hmu and sufABC") separate_genes(x) separate_genes(x, genes = "hmu")
x <- data.frame(row = 1, text = "Genes like YacK, hmu and sufABC") separate_genes(x) separate_genes(x, genes = "hmu")
Separates references cited in brackets or parentheses into multiple rows and splits the comma-delimited numeric strings and expands ranges like 7-9 into new rows
separate_refs(txt, column = "text")
separate_refs(txt, column = "text")
txt |
a table |
column |
column name, default "text" |
a tibble
Chris Stubben
x <- data.frame(row = 1, text = "some important studies [7-9,15]") separate_refs(x)
x <- data.frame(row = 1, text = "some important studies [7-9,15]") separate_refs(x)
Separates locus tags mentioned in full text and expands ranges like YPO1970-74 into new rows
separate_tags(txt, pattern, column = "text")
separate_tags(txt, pattern, column = "text")
txt |
a table |
pattern |
regular expression to match locus tags like YPO[0-9-]+ or the locus tag prefix like YPO. |
column |
column name to search, default "text" |
a tibble with locus tag, matching text and rows.
Chris Stubben
x <- data.frame(row = 1, text = "some genes like YPO1002 and YPO1970-74") separate_tags(x, "YPO")
x <- data.frame(row = 1, text = "some genes like YPO1002 and YPO1970-74") separate_tags(x, "YPO")
Separate all matching text into multiple rows
separate_text(txt, pattern, column = "text")
separate_text(txt, pattern, column = "text")
txt |
a tibble, usually results from |
pattern |
either a regular expression or a vector of words to find in text |
column |
column name, default "text" |
a tibble
passed to grepl
and str_extract_all
Chris Stubben
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc")) txt <- pmc_text(doc) separate_text(txt, "[ATCGN]{5,}") separate_text(txt, "\\([A-Z]{3,6}s?\\)") # pattern can be a vector of words separate_text(txt, c("hmu", "ybt", "yfe", "yfu")) # wrappers for separate_text with extra step to expand matched ranges separate_refs(txt) separate_genes(txt) separate_tags(txt, "YPO")
# doc <- pmc_xml("PMC2231364") doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml", package = "tidypmc")) txt <- pmc_text(doc) separate_text(txt, "[ATCGN]{5,}") separate_text(txt, "\\([A-Z]{3,6}s?\\)") # pattern can be a vector of words separate_text(txt, c("hmu", "ybt", "yfe", "yfu")) # wrappers for separate_text with extra step to expand matched ranges separate_refs(txt) separate_genes(txt) separate_tags(txt, "YPO")