Compiled: 2023-12-12
The ‘grantpubcite’ package can be used to query the NIH Reporter database for funded grants and the publications associated with those grants. The citation history of publications can be discovered using iCite.
The ‘grantpubcite’ package help pages and resources make extensive use of ‘tidyverse’ concepts. Core tidyverse functions used in the articles include:
-
tibble()
– representation of adata.frame
, with better display of long and wide data frames.tribble()
constructs a tibble in a way that makes the relationship between data across rows more transparent. -
glimpse()
– providing a quick look into the columns and data in the tibble by transposing the tibble and display each ‘column’ on a single line. -
select()
– column selection. -
filter()
,slice()
– row selection. -
pull()
– extract a single column as a vector. -
mutate()
– column transformation. -
count()
– count occurences in one or more columns. -
arrange()
– order rows by values in one or more columns. -
distinct()
– reduce a tibble to only unique rows. -
group_by()
– perform computations on groups defined by one or several columns. -
summarize()
– calculate summary statstics for groups. -
left_join()
,right_join()
– merge two tibbles based on shared columns, preserving all rows in the first (left_join()
) or second (right_join()
) tibble.
In an interactive session, a useful way to visually navigate the
sometimes large tibbles is to use the DT package, e.g.,
DT::datatable(projects)
.
Installation and loading
Install the development version of grantpubcite from github.
if (!nzchar(system.file(package = "remotes")))
install.package("remotes", repos = "https://cran.r-project.org")
remotes::install_github("mtmorgan/grantpubcite")
Load the library and other packages to be used in the vignette.
NIH Reporter projects
reporter_projects()
queries the ‘projects’ endpoint; see
the technical description of the NIH Reporter project search
API for details, paying particular attention to the ‘schema’ present
in the executable example..
For illustration we start by identifying Funding Opportunity Announcemments (FOA) that might be of interest, in this case current FOA under the Information Technology in Cancer Research (ITCR) program.
foas <- c( # one or more criteria, e.g., foa number(s)
"PAR-15-334", # ITCR (R21)”
"PAR-15-332", # ITCR Early-Stage Development (U01)
"PAR-15-331", # ITCR Advanced Development (U24)
"PAR-15-333" # ITCR Sustained Support (U24)
)
Use reporter_projects()
by providing criteria to be used
to query the NIH Reporter projects endpoint. The criteria are provided
as named arguments to reporter_projects()
. To query for
projects awarded under the FOA of interest, use the argument
foa =
. Initially, use the limit = 1
argument
to retrieve just a single record.
reporter_projects(foa = foas, limit = 1L) |>
glimpse()
#> Rows: 1
#> Columns: 44
#> $ appl_id <int> 9676260
#> $ subproject_id <lgl> NA
#> $ fiscal_year <int> 2019
#> $ project_num <chr> "5U24CA220242-02"
#> $ project_serial_num <chr> "CA220242"
#> $ organization <df[,17]> <data.frame[1 x 17]>
#> $ award_type <chr> "5"
#> $ activity_code <chr> "U24"
#> $ award_amount <int> 550158
#> $ is_active <lgl> FALSE
#> $ project_num_split <df[,7]> <data.frame[1 x 7]>
#> $ principal_investigators <list> [<data.frame[1 x 7]>]
#> $ contact_pi_name <chr> "ABYZOV, ALEXEJ"
#> $ program_officers <list> [<data.frame[1 x 4]>]
#> $ agency_ic_admin <df[,3]> <data.frame[1 x 3]>
#> $ agency_ic_fundings <list> [<data.frame[1 x 5]>]
#> $ cong_dist <chr> "MN-01"
#> $ spending_categories <list> <108, 132, 1393, 276, 320, 3070>
#> $ project_start_date <date> 2018-05-01
#> $ project_end_date <date> 2023-04-30
#> $ organization_type <df[,3]> <data.frame[1 x 3]>
#> $ opportunity_number <chr> "PAR-15-331"
#> $ full_study_section <df[,6]> <data.frame[1 x 6]>
#> $ award_notice_date <date> 2019-05-01
#> $ is_new <lgl> FALSE
#> $ mechanism_code_dc <chr> "OR"
#> $ core_project_num <chr> "U24CA220242"
#> $ terms <chr> "<Aftercare><post treatment><After-Treatment>…
#> $ pref_terms <chr> "Address;Aftercare;Area;Attention;Basic Scien…
#> $ abstract_text <chr> "Project Summary/Abstract\n Progress in techn…
#> $ project_title <chr> "Detection of somatic, subclonal and mosaic C…
#> $ phr_text <chr> "Narrative\nThe analytical tools that will be…
#> $ spending_categories_desc <chr> "Biotechnology; Cancer; Cancer Genomics; …
#> $ agency_code <chr> "NIH"
#> $ covid_response <lgl> NA
#> $ arra_funded <chr> "N"
#> $ budget_start <chr> "2019-05-01T12:05:00Z"
#> $ budget_end <chr> "2020-04-30T12:04:00Z"
#> $ cfda_code <chr> "399"
#> $ funding_mechanism <chr> "Other Research-Related"
#> $ direct_cost_amt <int> 348999
#> $ indirect_cost_amt <int> 201159
#> $ project_detail_url <chr> "https://reporter.nih.gov/project-details/967…
#> $ date_added <chr> "2019-05-04T07:05:16Z"
This shows that the return value of reporter_projects()
is a tibble
(the tidyverse representation of a
data.frame
) with a single row corresponding to the
requested record, and all possible fields returned by the query.
Inspect the fields for those that might be of interest, and define a variable to reference these.
include_fields <- c(
"opportunity_number",
"core_project_num",
"fiscal_year",
"award_amount",
"contact_pi_name",
"project_title",
"project_start_date",
"project_end_date"
)
Now execute the reporter_projects()
query on all of our
FOA of interest, including only the fields of interest in the
response.
projects <- reporter_projects(foa = foas, include_fields = include_fields)
projects
#> # A tibble: 189 × 8
#> opportunity_number core_project_num fiscal_year award_amount contact_pi_name
#> <chr> <chr> <int> <int> <chr>
#> 1 PAR-15-331 U24CA220242 2019 550158 ABYZOV, ALEXEJ
#> 2 PAR-15-331 U24CA220242 2022 561369 ABYZOV, ALEXEJ
#> 3 PAR-15-331 U24CA220242 2021 572828 ABYZOV, ALEXEJ
#> 4 PAR-15-331 U24CA220242 2018 559916 ABYZOV, ALEXEJ
#> 5 PAR-15-331 U24CA220242 2020 383715 ABYZOV, ALEXEJ
#> 6 PAR-15-334 R21CA220352 2019 156788 ARNOLD, COREY W…
#> 7 PAR-15-334 R21CA220352 2018 195568 ARNOLD, COREY W…
#> 8 PAR-15-332 U01CA242871 2019 378519 BAKAS, SPYRIDON
#> 9 PAR-15-332 U01CA242871 2020 360393 BAKAS, SPYRIDON
#> 10 PAR-15-332 U01CA242871 2021 357972 BAKAS, SPYRIDON
#> # ℹ 179 more rows
#> # ℹ 3 more variables: project_title <chr>, project_start_date <date>,
#> # project_end_date <date>
NIH Reporter publications
The NIH Reporter publication
search API provides a way to retrieve publications reported as
grant-supported. The search criteria are complicated, but the return
value is always a tibble with columns coreproject
,
pmid
, and applid
. Here we search for all
citations from the projects funded by current ITCR FOAs.
core_project_nums <-
pull(projects, "core_project_num") |>
unique()
publications <- reporter_publications(core_project_nums = core_project_nums)
publications
#> # A tibble: 982 × 3
#> coreproject pmid applid
#> <chr> <int> <int>
#> 1 U24CA237719 31907209 10620674
#> 2 U24CA237719 31779674 10620674
#> 3 U24CA237719 35072136 10620674
#> 4 U24CA237719 35366592 10620674
#> 5 U24CA237719 36949070 10620674
#> 6 U24CA237719 31645350 10620674
#> 7 U24CA237719 31462330 10620674
#> 8 U24CA237719 31796060 10620674
#> 9 U24CA237719 32665297 10620674
#> 10 U24CA237719 32644817 10620674
#> # ℹ 972 more rows
Note that some projects were funded by previous ITCR FOA, so that publications can appear to be from ‘before’ funding under the current FOA.
iCite publication citations
The NIH iCite resource and API can be used to map PMID ids from NIH Reporter to detailed information about publications, including citations and derived citation measures that account for, e.g., time since publication.
Discover available fields by querying iCite using the first row of
the publications
tibble obtained from NIH Reporter.
iCite()
takes as its first argument any tibble, provided it
has a column pmid
.
## which fields are available in icite?
icite(slice(publications, 1L)) |>
glimpse()
#> Rows: 1
#> Columns: 25
#> $ pmid <dbl> 31907209
#> $ year <dbl> 2020
#> $ title <chr> "pVACtools: A Computational Toolkit to Ide…
#> $ authors <chr> "Jasreet Hundal, Susanna Kiwala, Joshua Mc…
#> $ journal <chr> "Cancer Immunol Res"
#> $ is_research_article <chr> "Yes"
#> $ relative_citation_ratio <dbl> 5.18
#> $ nih_percentile <dbl> 93.6
#> $ human <dbl> 1
#> $ animal <dbl> 0
#> $ molecular_cellular <dbl> 0
#> $ apt <dbl> 0.95
#> $ is_clinical <chr> "No"
#> $ citation_count <dbl> 95
#> $ citations_per_year <dbl> 31.66667
#> $ expected_citations_per_year <dbl> 6.117195
#> $ field_citation_rate <dbl> 11.1856
#> $ provisional <chr> "No"
#> $ x_coord <dbl> 0
#> $ y_coord <dbl> 1
#> $ cited_by_clin <chr> "37563240 37739939"
#> $ cited_by <chr> "35646870 34927080 33262196 34529669 35611…
#> $ references <chr> "23396013 29170503 31243155 19906713 28694…
#> $ doi <chr> "10.1158/2326-6066.CIR-19-0401"
#> $ last_modified <chr> "11/25/2023, 16:43:52"
Identify fields of interest, and query for all publications associated the
include_fields <- c(
"pmid", "year", "citation_count", "relative_citation_ratio",
"doi"
)
icite(publications, include_fields)
#> # A tibble: 924 × 5
#> pmid year citation_count relative_citation_ratio doi
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 19898898 2010 32 0.78 10.1245/s10434-009-079…
#> 2 24925914 2014 2728 80.7 10.1126/science.1254257
#> 3 24931973 2014 42 1.25 10.1093/bioinformatics…
#> 4 25086664 2014 93 2.46 10.1038/ng.3051
#> 5 25714012 2015 10 0.35 10.18632/oncotarget.29…
#> 6 26083491 2015 26 0.8 10.1371/journal.pone.0…
#> 7 26463000 2016 34 1.12 10.1093/bib/bbv080
#> 8 26594663 2015 151 4.68 10.1016/j.cels.2015.10…
#> 9 26638175 2015 155 4.33 10.1016/j.molcel.2015.…
#> 10 26644347 2015 28 0.81 10.1038/ncomms9726
#> # ℹ 914 more rows
Session information
The following summarizes packages in use when this article was compiled.
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] grantpubcite_0.0.3 dplyr_1.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] bit_4.0.5 jsonlite_1.8.8 crayon_1.5.2 compiler_4.3.2
#> [5] tidyselect_1.2.0 stringr_1.5.1 parallel_4.3.2 jquerylib_0.1.4
#> [9] systemfonts_1.0.5 textshaping_0.3.7 yaml_2.3.7 fastmap_1.1.1
#> [13] readr_2.1.4 R6_2.5.1 rjsoncons_1.0.1 generics_0.1.3
#> [17] curl_5.2.0 knitr_1.45 htmlwidgets_1.6.4 tibble_3.2.1
#> [21] desc_1.4.3 tzdb_0.4.0 bslib_0.6.1 pillar_1.9.0
#> [25] rlang_1.1.2 utf8_1.2.4 DT_0.31 cachem_1.0.8
#> [29] stringi_1.8.2 xfun_0.41 fs_1.6.3 sass_0.4.8
#> [33] bit64_4.0.5 memoise_2.0.1 cli_3.6.1 withr_2.5.2
#> [37] pkgdown_2.0.7 magrittr_2.0.3 digest_0.6.33 vroom_1.6.5
#> [41] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 evaluate_0.23
#> [45] glue_1.6.2 ragg_1.2.6 fansi_1.0.6 httr_1.4.7
#> [49] rmarkdown_2.25 purrr_1.0.2 tools_4.3.2 pkgconfig_2.0.3
#> [53] htmltools_0.5.7