xml_parse()
processes a PubMedCentral XML file by
extracting XML elements, transforming them to text or integer
representations, and loading the result into a database.
Value
xml_parse()
returns the pmcbioc_db
database passed as argument
db
, updated to include the tables described here. The tables are
as follows:
article
includes columnspmcid
,title
,journal
,year
, andpmid
.author
is a one-to-many map betweenpmcid
and authorsurname
andgivenname
.keyword
is a one-to-many map betweenpmcid
andkeyword
.refpmid
is a one-to-many map betweenpmcid
and therefpmid
PubMed identifiers of cited references.
Details
xml_parse()
can be slow, e.g., 1000 records per minute, and grow
to consume a large amount of memory, e.g., 18 Gb.
xml_parse()
uses XML::xmlEventParse()
to iterate through the
XML file. Each //article
branch is queried using XPath
expressions. The expressions and subsequent transformations are
meant to extract the following information; currently, not all
records are processed correctly.
pmcid
: The PubMedCentral identifier associated with the record.title
: The article title.journal
: The journal in which the article was published.year
: The year of publication, represented as an integer. Articles may have several publication dates (e.g., electronically before physically).year
is the earliest date in the record.pmid
: PubMed identifier of the record.surname
: Surname of each author. Some authors have only a surname.givenname
: Given name(s) of each author. Some authors have only given names.keyword
: Keywords associated with the publication. Keywords are not standardizedrefpmid
: PubMed identifiers of each reference in the record.
pmcid
is used as a key across database tables.