Skip to contents

Instal and load

Install the development version of pmcbioc from GitHub with:

# install.packages("devtools")
devtools::install_github("mtmorgan/pmcbioc")

Load the package with:

Retrieve and process PubMedCentral data

pmcbioc requires that the user visit the PubMedCentral web site, search for terms or citations of interested, and chooses to save the results to their local disk as an XML file. For the analysis in this document, I visited the Bioconductor web site, located the Publications page, followed the ‘PubMed Central’ link to a predefined search, and saved the results as an XML file. These steps can be automated through the NIH EUtils API, e.g., using renterez.

The XML contains full text of all articles satisfying the query, which means that it can be quite large; 72,446 citations to the term ‘Bioconductor’ retrieved on 2023-10-18 occupy about 8.5GB of disk space. Suppose that the result file is pmc_result.xml, in the current working directory.

Data processing involves four steps

  1. Create a DuckDB database, to be accessed using the duckdb R package.
  2. Parse article metadata (PubMedCentral id, title, journal, year, pmid, authors, keywords, citation PubMed ids) into tables in the database.
  3. Index the XML file to allow fast access to the full record of individual articles, and record the index in the database.
  4. Closing and perhaps renaming the database.

Database creation

Create a new DuckDB database in the current working directory. Use a temporary file name to avoid accidentally over-writing existing work.

db_file <- tempfile(tmpdir = ".", fileext = ".duckdb")
db <- pmcbioc_db(db_file)
db
#> pmcbioc_db: ./file90845f7a26f2.duckdb
#> connected: TRUE
#> read_only: FALSE
#> db_tables():

Connection to a new (empty) database allows for updates (read_only: FALSE). Connecting to an existing database is by default read-only (read_only: TRUE); an existing database can be opened to allow updates with the pmcbioc_db() argument read_only = TRUE.

Article metadata

Parse metadata about each article in the XML file into the database using xml_parse(). The arguments are the path to the XML file, and the database created in the previous step.

## not evaluated
xml_parse("pmc_result.xml", db)

This process is expensive in both time and memory. The Bioconductor query took about 1 minute per 1000 records, and ultimately consumed about 18Gb of memory. Processing using XML::xmlEventParse(), with ‘XPath’ queries used to extract metadata from each article. Details of the paths used are visble using the (non-exported) function

## not evaluated
pmcbioc:::xml_branches(db)

When the process completes, the database will contain four tables

db_tables(db)
#> [1] "article" "author"  "keyword" "refpmid"

XML index

The function xml_index() creates an index on the large XML file downloaded from PubMedCentral, making it easy to query subsets of articles for information not extracted to the database with xml_parse(). Create and add the index with

## not evaluated
xml_index("pmc_result.xml", db)

This process is faster and less memory-intensive than xml_parse(), but still takes several minutes. On completion, the database contains an additional table index.

db_tables(db)
#> [1] "article" "author"  "index"   "keyword" "refpmid"

Final steps

Remember to disconnect from the database, so any buffered data is written to disk.

db_disconnect(db)
#> pmcbioc_db: ./file90845f7a26f2.duckdb
#> connected: FALSE
#> read_only: FALSE

Optionally, rename the temporary file to something permanent.

## not evaluated
file.rename(db_file, "pmc_result.duckdb")

Explore our data

Start by connecting to the database. Since the database already exists, it opens as ‘read only’ so one does not have to worry about accidentally corrupting it.

db <- pmcbioc_db("~/a/github/pmcbioc/pmc_result.duckdb")
db
#> pmcbioc_db: ~/a/github/pmcbioc/pmc_result.duckdb
#> connected: TRUE
#> read_only: TRUE
#> db_tables(): article, author, index, keyword, refpmid

Article metadata

article

Use dplyr functions to open and explore tables. Start with the article table.

tbl(db, "article")
#> # Source:   table<article> [?? x 6]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>       id pmcid    title                                      journal  year pmid 
#>    <int> <chr>    <chr>                                      <chr>   <int> <chr>
#>  1     1 10580555 Coexpression analysis of lncRNAs and mRNA… BMC Ve…  2023 3784…
#>  2     2 10578971 Unbiased gene expression analysis of the … Bone &…  2023 NA   
#>  3     3 10579997 Clinical and biological significance of c… Clinic…  2023 3784…
#>  4     4 10580377 Katdetectr: an R/bioconductor package uti… GigaSc…  2023 NA   
#>  5     5 10580564 Differential transcriptome response follo… Virolo…  2023 NA   
#>  6     6 10577922 Maternal adverse childhood experiences (A… Clinic…  2023 3784…
#>  7     7 10578035 Data-driven identification of total RNA e… Genome…  2023 NA   
#>  8     8 10578040 Analysis of transcriptomic features revea… Genome…  2023 3784…
#>  9     9 10577904 Single-cell RNA sequencing distinctly cha… Genome…  2023 3784…
#> 10    10 10580862 Parvimonas micra, an oral pathobiont asso… Gut Mi…  2023 3784…
#> # ℹ more rows
tbl(db, "article") |>
    count()
#> # Source:   SQL [1 x 1]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>       n
#>   <dbl>
#> 1 72446

Each row represents a result returned by the original PubMedCentral query. The fields are

  • id is an internal identifier, representing the sequence of the article in the original XML file. It is used in conjunction with the index table.
  • pmcid (PubMedCentral identifier), title, journal, and pmid (PubMed identifier) are self-explanatory.
  • year is the earliest year recorded in the PubMedCentral record. Some articles have multiple publication dates associated with, e.g., electronic publication, submission to NIH, etc.

Missing values occur when the XPath used to extract the data failed to identify relevant information. I believe that the content of the XML associated with each record is partly at the discretion of the journal.

Use the article table to plot citations as a function of year. Start by loading the ggplot2 package

Summarize the number of publications per year. Remember to collect() the result so that the tibble is fully in-memory.

pubs_per_year <-
    tbl(db, "article") |>
    count(year) |>
    collect()

Plot the relationship. One row is removed, corresponding to records for which a year was not extracted from the XML. The final year is incomplete, and there is a lag between publication and indexing by PubMedCentral, so the last two points are an anomaly.

ggplot(pubs_per_year) +
    aes(x = year, y = n) +
    geom_point() +
    scale_y_log10()

Journals with more than 100 articles mentioning Bioconductor are summarized as (the unnamed journal corresponds to records for which the XPath did not extract a name).

tbl(db, "article") |>
    count(journal, sort = TRUE) |>
    filter(n >= 100) |>
    collect() |>
    DT::datatable()

author

The author table associates author surname and given names with PubMedCentral identifiers. In some instances individual authors were not extracted (e.g., because the journal reported the author using a format the XPath did not capture). These entries are not reported as NA, but are simply omitted from the table. Thus some authors do not receive credit for all their publications.

tbl(db, "author")
#> # Source:   table<author> [?? x 3]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>    pmcid    surname    givenname
#>    <chr>    <chr>      <chr>    
#>  1 10580555 Xia        Xiaojing 
#>  2 10580555 Hou        Jie      
#>  3 10580555 Ren        Pengfei  
#>  4 10580555 Liu        Mingcheng
#>  5 10580555 Wang       Lei      
#>  6 10580555 Wei        Xiaobing 
#>  7 10580555 Teng       Zhanwei  
#>  8 10580555 Kasianenko Oksana   
#>  9 10580555 Cheng      Likun    
#> 10 10580555 Hu         Jianhe   
#> # ℹ more rows
tbl(db, "author") |>
    count()
#> # Source:   SQL [1 x 1]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>        n
#>    <dbl>
#> 1 746369

The most prolific authors (or at least unique combinations of surname and given names) are found as:

tbl(db, "author") |>
    count(surname, givenname, sort = TRUE) |>
    head(100) |>
    collect() |>
    DT::datatable()

Data inconsistencies are not easily resolved, e.g., are these all me?

tbl(db, "author") |>
    filter(surname == "Morgan", givenname %like% "Martin%") |>
    count(surname, givenname)
#> # Source:   SQL [2 x 3]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>   surname givenname     n
#>   <chr>   <chr>     <dbl>
#> 1 Morgan  Martin       27
#> 2 Morgan  Martin T.     5

Publications of individual authors can be found by joining the author and article tables using PubMedCentral id.

tbl(db, "author") |>
    filter(surname == "Morgan", givenname %like% "Martin%") |>
    left_join(tbl(db, "article"), by = "pmcid") |>
    collect() |>
    mutate(
        author = paste(surname, givenname, sep = ", "),
        ## remove "\n" in one record
        title = gsub("[[:space:]]+", " ", title)
    ) |>
    select(author, year, title, journal, pmcid, pmid) |>
    DT::datatable()

keyword and refpmid

The keyword table summarizes keywords associated with each publication.

tbl(db, "keyword")
#> # Source:   table<keyword> [?? x 2]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>    pmcid    keyword                       
#>    <chr>    <chr>                         
#>  1 10580555 Mastitis                      
#>  2 10580555 lncRNA                        
#>  3 10580555 mRNA                          
#>  4 10580555 Mammary gland epithelial cells
#>  5 10580555 RNA-seq                       
#>  6 10578971 Fracture healing              
#>  7 10578971 Fracture nonunion             
#>  8 10578971 Zucker rats                   
#>  9 10578971 Type 2 diabetes mellitus      
#> 10 10578971 fracture and of the healing   
#> # ℹ more rows

Keywords are not standardized, e.g, bioinformatic and bioinformatics are both used. Nonetheless, this provides some insight into areas where Bioconductor is useful; an analysis could associate keyword with year to model shifting areas of interest.

tbl(db, "keyword") |>
    count(keyword = tolower(keyword)) |>
    filter(n >= 100) |>
    arrange(desc(n)) |>
    collect() |>
    DT::datatable()

The refpmid table summarizes the PubMed identifiers of citations in each article.

tbl(db, "refpmid")
#> # Source:   table<refpmid> [?? x 2]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>    pmcid    refpmid 
#>    <chr>    <chr>   
#>  1 10580555 32051050
#>  2 10580555 32257646
#>  3 10580555 17936907
#>  4 10580555 35021119
#>  5 10580555 31153885
#>  6 10580555 31606213
#>  7 10580555 34406447
#>  8 10580555 30683918
#>  9 10580555 29230038
#> 10 10580555 28626066
#> # ℹ more rows

One can ask which publications mentioning ‘Bioconductor’ are most cited by other publications mentioning ‘Bioconductor’

## FIXME: is this correct?
tbl(db, "refpmid") |>
    count(refpmid, sort = TRUE) |>
    right_join(tbl(db, "article"), by = c(refpmid = "pmid")) |>
    arrange(desc(n)) |>
    head(100) |>
    collect() |>
    select(n_citn = "n", title, journal, year) |>
    DT::datatable()
#> Warning: ORDER BY is ignored in subqueries without LIMIT
#>  Do you need to move arrange() later in the pipeline or use window_order() instead?

Querying individual records

The XML records are extremely rich. Individual records can be retrieved and queried. For instance for the fifth record

tbl(db, "article") |>
    filter(id == 5)
#> # Source:   SQL [1 x 6]
#> # Database: DuckDB v0.9.1 [root@Darwin 21.6.0:R 4.3.1/~/a/github/pmcbioc/pmc_result.duckdb]
#>      id pmcid    title                                       journal  year pmid 
#>   <int> <chr>    <chr>                                       <chr>   <int> <chr>
#> 1     5 10580564 Differential transcriptome response follow… Virolo…  2023 NA

we can use the index table to query the full XML record using XPath to count the number references

xml_file <- "~/a/github/pmcbioc/pmc_result.xml"
tbl(db, "index") |>
    filter(id == 5) |>
    xml_xpath(
        "count(//article/back/ref-list/ref)",
        xml_file = xml_file
    )
#> [1] 79

The full XML of the first five records is

first_five <-
    tbl(db, "index") |>
    filter(id <= 5) |>
    xml_xpath(xml_file = xml_file)

These records can be queried using, e.g, XPath or the xmlChildren() and [[ of the XML package.

first_five["//article-meta//article-title"] |>
    XML::xmlValue()
#> [1] "Coexpression analysis of lncRNAs and mRNAs identifies potential regulatory long noncoding RNAs involved in the inflammatory effects of lipopolysaccharide on bovine mammary epithelial cells"
#> [2] "Unbiased gene expression analysis of the delayed fracture healing observed in Zucker diabetic fatty rats"                                                                                    
#> [3] "Clinical and biological significance of circulating miRNAs in chronic pancreatitis patients undergoing total pancreatectomy with islet autotransplantation"                                  
#> [4] "Katdetectr: an R/bioconductor package utilizing unsupervised changepoint analysis for robust kataegis detection"                                                                             
#> [5] "Differential transcriptome response following infection of porcine ileal enteroids with species A and C rotaviruses"

Final steps

Remember to disconnect from the database.

db_disconnect(db)
#> pmcbioc_db: ~/a/github/pmcbioc/pmc_result.duckdb
#> connected: FALSE
#> read_only: TRUE

Session information

sessionInfo()
#> R version 4.3.1 Patched (2023-10-10 r85312)
#> Platform: aarch64-apple-darwin21.6.0 (64-bit)
#> Running under: macOS Monterey 12.6.8
#> 
#> Matrix products: default
#> BLAS:   /Users/ma38727/bin/R-4-3-branch/lib/libRblas.dylib 
#> LAPACK: /Users/ma38727/bin/R-4-3-branch/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.4.4 pmcbioc_0.0.4 dplyr_1.1.3  
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.7        utf8_1.2.3        generics_0.1.3    spdl_0.0.5       
#>  [5] stringi_1.7.12    digest_0.6.33     magrittr_2.0.3    evaluate_0.22    
#>  [9] grid_4.3.1        fastmap_1.1.1     blob_1.2.4        rprojroot_2.0.3  
#> [13] jsonlite_1.8.7    DBI_1.1.3         purrr_1.0.2       fansi_1.0.5      
#> [17] crosstalk_1.2.0   scales_1.2.1      XML_3.99-0.14     textshaping_0.3.7
#> [21] jquerylib_0.1.4   duckdb_0.9.1      cli_3.6.1         rlang_1.1.1      
#> [25] dbplyr_2.3.4      ellipsis_0.3.2    munsell_0.5.0     withr_2.5.1      
#> [29] cachem_1.0.8      yaml_2.3.7        tools_4.3.1       memoise_2.0.1    
#> [33] colorspace_2.1-0  DT_0.30           vctrs_0.6.4       R6_2.5.1         
#> [37] lifecycle_1.0.3   RcppSpdlog_0.0.14 stringr_1.5.0     htmlwidgets_1.6.2
#> [41] fs_1.6.3          ragg_1.2.6        pkgconfig_2.0.3   desc_1.4.2       
#> [45] pkgdown_2.0.7     pillar_1.9.0      bslib_0.5.1       gtable_0.3.4     
#> [49] glue_1.6.2        Rcpp_1.0.11       systemfonts_1.0.5 xfun_0.40        
#> [53] tibble_3.2.1      tidyselect_1.2.0  knitr_1.44        farver_2.1.1     
#> [57] htmltools_0.5.6.1 labeling_0.4.3    rmarkdown_2.25    compiler_4.3.1