Skip to contents

xml_parse() processes a PubMedCentral XML file by extracting XML elements, transforming them to text or integer representations, and loading the result into a database.

Usage

xml_parse(xml_file, db)

Value

xml_parse() returns the pmcbioc_db database passed as argument db, updated to include the tables described here. The tables are as follows:

  • article includes columns pmcid, title, journal, year, and pmid.

  • author is a one-to-many map between pmcid and author surname and givenname.

  • keyword is a one-to-many map between pmcid and keyword.

  • refpmid is a one-to-many map between pmcid and the refpmid PubMed identifiers of cited references.

Details

xml_parse() can be slow, e.g., 1000 records per minute, and grow to consume a large amount of memory, e.g., 18 Gb.

xml_parse() uses XML::xmlEventParse() to iterate through the XML file. Each //article branch is queried using XPath expressions. The expressions and subsequent transformations are meant to extract the following information; currently, not all records are processed correctly.

  • pmcid: The PubMedCentral identifier associated with the record.

  • title: The article title.

  • journal: The journal in which the article was published.

  • year: The year of publication, represented as an integer. Articles may have several publication dates (e.g., electronically before physically). year is the earliest date in the record.

  • pmid: PubMed identifier of the record.

  • surname: Surname of each author. Some authors have only a surname.

  • givenname: Given name(s) of each author. Some authors have only given names.

  • keyword: Keywords associated with the publication. Keywords are not standardized

  • refpmid: PubMed identifiers of each reference in the record.

pmcid is used as a key across database tables.