Introduction to 'grantpubcite' • grantpubcite

Compiled: 2023-12-12

The ‘grantpubcite’ package can be used to query the NIH Reporter database for funded grants and the publications associated with those grants. The citation history of publications can be discovered using iCite.

The ‘grantpubcite’ package help pages and resources make extensive use of ‘tidyverse’ concepts. Core tidyverse functions used in the articles include:

tibble() – representation of a data.frame, with better display of long and wide data frames. tribble() constructs a tibble in a way that makes the relationship between data across rows more transparent.
glimpse() – providing a quick look into the columns and data in the tibble by transposing the tibble and display each ‘column’ on a single line.
select() – column selection.
filter(), slice() – row selection.
pull() – extract a single column as a vector.
mutate() – column transformation.
count() – count occurences in one or more columns.
arrange() – order rows by values in one or more columns.
distinct() – reduce a tibble to only unique rows.
group_by() – perform computations on groups defined by one or several columns.
summarize() – calculate summary statstics for groups.
left_join(), right_join() – merge two tibbles based on shared columns, preserving all rows in the first (left_join()) or second (right_join()) tibble.

In an interactive session, a useful way to visually navigate the sometimes large tibbles is to use the DT package, e.g., DT::datatable(projects).

Installation and loading

Install the development version of grantpubcite from github.

if (!nzchar(system.file(package = "remotes")))
    install.package("remotes", repos = "https://cran.r-project.org")
remotes::install_github("mtmorgan/grantpubcite")

Load the library and other packages to be used in the vignette.

library(grantpubcite)

NIH Reporter projects

reporter_projects() queries the ‘projects’ endpoint; see the technical description of the NIH Reporter project search API for details, paying particular attention to the ‘schema’ present in the executable example..

For illustration we start by identifying Funding Opportunity Announcemments (FOA) that might be of interest, in this case current FOA under the Information Technology in Cancer Research (ITCR) program.

foas <- c(        # one or more criteria, e.g., foa number(s)
    "PAR-15-334", # ITCR (R21)”
    "PAR-15-332", # ITCR Early-Stage Development (U01)
    "PAR-15-331", # ITCR Advanced Development (U24)
    "PAR-15-333"  # ITCR Sustained Support (U24)
)

Use reporter_projects() by providing criteria to be used to query the NIH Reporter projects endpoint. The criteria are provided as named arguments to reporter_projects(). To query for projects awarded under the FOA of interest, use the argument foa =. Initially, use the limit = 1 argument to retrieve just a single record.

reporter_projects(foa = foas, limit = 1L) |>
    glimpse()
#> Rows: 1
#> Columns: 44
#> $ appl_id                  <int> 9676260
#> $ subproject_id            <lgl> NA
#> $ fiscal_year              <int> 2019
#> $ project_num              <chr> "5U24CA220242-02"
#> $ project_serial_num       <chr> "CA220242"
#> $ organization             <df[,17]> <data.frame[1 x 17]>
#> $ award_type               <chr> "5"
#> $ activity_code            <chr> "U24"
#> $ award_amount             <int> 550158
#> $ is_active                <lgl> FALSE
#> $ project_num_split        <df[,7]> <data.frame[1 x 7]>
#> $ principal_investigators  <list> [<data.frame[1 x 7]>]
#> $ contact_pi_name          <chr> "ABYZOV, ALEXEJ"
#> $ program_officers         <list> [<data.frame[1 x 4]>]
#> $ agency_ic_admin          <df[,3]> <data.frame[1 x 3]>
#> $ agency_ic_fundings       <list> [<data.frame[1 x 5]>]
#> $ cong_dist                <chr> "MN-01"
#> $ spending_categories      <list> <108, 132, 1393, 276, 320, 3070>
#> $ project_start_date       <date> 2018-05-01
#> $ project_end_date         <date> 2023-04-30
#> $ organization_type        <df[,3]> <data.frame[1 x 3]>
#> $ opportunity_number       <chr> "PAR-15-331"
#> $ full_study_section       <df[,6]> <data.frame[1 x 6]>
#> $ award_notice_date        <date> 2019-05-01
#> $ is_new                   <lgl> FALSE
#> $ mechanism_code_dc        <chr> "OR"
#> $ core_project_num         <chr> "U24CA220242"
#> $ terms                    <chr> "<Aftercare><post treatment><After-Treatment>…
#> $ pref_terms               <chr> "Address;Aftercare;Area;Attention;Basic Scien…
#> $ abstract_text            <chr> "Project Summary/Abstract\n Progress in techn…
#> $ project_title            <chr> "Detection of somatic, subclonal and mosaic C…
#> $ phr_text                 <chr> "Narrative\nThe analytical tools that will be…
#> $ spending_categories_desc <chr> "Biotechnology; Cancer; Cancer Genomics; …
#> $ agency_code              <chr> "NIH"
#> $ covid_response           <lgl> NA
#> $ arra_funded              <chr> "N"
#> $ budget_start             <chr> "2019-05-01T12:05:00Z"
#> $ budget_end               <chr> "2020-04-30T12:04:00Z"
#> $ cfda_code                <chr> "399"
#> $ funding_mechanism        <chr> "Other Research-Related"
#> $ direct_cost_amt          <int> 348999
#> $ indirect_cost_amt        <int> 201159
#> $ project_detail_url       <chr> "https://reporter.nih.gov/project-details/967…
#> $ date_added               <chr> "2019-05-04T07:05:16Z"

This shows that the return value of reporter_projects() is a tibble (the tidyverse representation of a data.frame) with a single row corresponding to the requested record, and all possible fields returned by the query.

Inspect the fields for those that might be of interest, and define a variable to reference these.

include_fields <- c(
    "opportunity_number",
    "core_project_num",
    "fiscal_year",
    "award_amount",
    "contact_pi_name",
    "project_title",
    "project_start_date",
    "project_end_date"
)

Now execute the reporter_projects() query on all of our FOA of interest, including only the fields of interest in the response.

projects <- reporter_projects(foa = foas, include_fields = include_fields)
projects
#> # A tibble: 189 × 8
#>    opportunity_number core_project_num fiscal_year award_amount contact_pi_name 
#>    <chr>              <chr>                  <int>        <int> <chr>           
#>  1 PAR-15-331         U24CA220242             2019       550158 ABYZOV, ALEXEJ  
#>  2 PAR-15-331         U24CA220242             2022       561369 ABYZOV, ALEXEJ  
#>  3 PAR-15-331         U24CA220242             2021       572828 ABYZOV, ALEXEJ  
#>  4 PAR-15-331         U24CA220242             2018       559916 ABYZOV, ALEXEJ  
#>  5 PAR-15-331         U24CA220242             2020       383715 ABYZOV, ALEXEJ  
#>  6 PAR-15-334         R21CA220352             2019       156788 ARNOLD, COREY W…
#>  7 PAR-15-334         R21CA220352             2018       195568 ARNOLD, COREY W…
#>  8 PAR-15-332         U01CA242871             2019       378519 BAKAS, SPYRIDON 
#>  9 PAR-15-332         U01CA242871             2020       360393 BAKAS, SPYRIDON 
#> 10 PAR-15-332         U01CA242871             2021       357972 BAKAS, SPYRIDON 
#> # ℹ 179 more rows
#> # ℹ 3 more variables: project_title <chr>, project_start_date <date>,
#> #   project_end_date <date>

NIH Reporter publications

The NIH Reporter publication search API provides a way to retrieve publications reported as grant-supported. The search criteria are complicated, but the return value is always a tibble with columns coreproject, pmid, and applid. Here we search for all citations from the projects funded by current ITCR FOAs.

core_project_nums <-
    pull(projects, "core_project_num") |>
    unique()
publications <- reporter_publications(core_project_nums = core_project_nums)
publications
#> # A tibble: 982 × 3
#>    coreproject     pmid   applid
#>    <chr>          <int>    <int>
#>  1 U24CA237719 31907209 10620674
#>  2 U24CA237719 31779674 10620674
#>  3 U24CA237719 35072136 10620674
#>  4 U24CA237719 35366592 10620674
#>  5 U24CA237719 36949070 10620674
#>  6 U24CA237719 31645350 10620674
#>  7 U24CA237719 31462330 10620674
#>  8 U24CA237719 31796060 10620674
#>  9 U24CA237719 32665297 10620674
#> 10 U24CA237719 32644817 10620674
#> # ℹ 972 more rows

Note that some projects were funded by previous ITCR FOA, so that publications can appear to be from ‘before’ funding under the current FOA.

iCite publication citations

The NIH iCite resource and API can be used to map PMID ids from NIH Reporter to detailed information about publications, including citations and derived citation measures that account for, e.g., time since publication.

Discover available fields by querying iCite using the first row of the publications tibble obtained from NIH Reporter. iCite() takes as its first argument any tibble, provided it has a column pmid.

## which fields are available in icite?
icite(slice(publications, 1L)) |>
    glimpse()
#> Rows: 1
#> Columns: 25
#> $ pmid                        <dbl> 31907209
#> $ year                        <dbl> 2020
#> $ title                       <chr> "pVACtools: A Computational Toolkit to Ide…
#> $ authors                     <chr> "Jasreet Hundal, Susanna Kiwala, Joshua Mc…
#> $ journal                     <chr> "Cancer Immunol Res"
#> $ is_research_article         <chr> "Yes"
#> $ relative_citation_ratio     <dbl> 5.18
#> $ nih_percentile              <dbl> 93.6
#> $ human                       <dbl> 1
#> $ animal                      <dbl> 0
#> $ molecular_cellular          <dbl> 0
#> $ apt                         <dbl> 0.95
#> $ is_clinical                 <chr> "No"
#> $ citation_count              <dbl> 95
#> $ citations_per_year          <dbl> 31.66667
#> $ expected_citations_per_year <dbl> 6.117195
#> $ field_citation_rate         <dbl> 11.1856
#> $ provisional                 <chr> "No"
#> $ x_coord                     <dbl> 0
#> $ y_coord                     <dbl> 1
#> $ cited_by_clin               <chr> "37563240 37739939"
#> $ cited_by                    <chr> "35646870 34927080 33262196 34529669 35611…
#> $ references                  <chr> "23396013 29170503 31243155 19906713 28694…
#> $ doi                         <chr> "10.1158/2326-6066.CIR-19-0401"
#> $ last_modified               <chr> "11/25/2023, 16:43:52"

Identify fields of interest, and query for all publications associated the

include_fields <- c(
    "pmid", "year", "citation_count", "relative_citation_ratio",
    "doi"
)
icite(publications, include_fields)
#> # A tibble: 924 × 5
#>        pmid  year citation_count relative_citation_ratio doi                    
#>       <dbl> <dbl>          <dbl>                   <dbl> <chr>                  
#>  1 19898898  2010             32                    0.78 10.1245/s10434-009-079…
#>  2 24925914  2014           2728                   80.7  10.1126/science.1254257
#>  3 24931973  2014             42                    1.25 10.1093/bioinformatics…
#>  4 25086664  2014             93                    2.46 10.1038/ng.3051        
#>  5 25714012  2015             10                    0.35 10.18632/oncotarget.29…
#>  6 26083491  2015             26                    0.8  10.1371/journal.pone.0…
#>  7 26463000  2016             34                    1.12 10.1093/bib/bbv080     
#>  8 26594663  2015            151                    4.68 10.1016/j.cels.2015.10…
#>  9 26638175  2015            155                    4.33 10.1016/j.molcel.2015.…
#> 10 26644347  2015             28                    0.81 10.1038/ncomms9726     
#> # ℹ 914 more rows

Next steps

See the Case Studies articles for examples of working with this data.

Session information

The following summarizes packages in use when this article was compiled.

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] grantpubcite_0.0.3 dplyr_1.1.4       
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5         jsonlite_1.8.8    crayon_1.5.2      compiler_4.3.2   
#>  [5] tidyselect_1.2.0  stringr_1.5.1     parallel_4.3.2    jquerylib_0.1.4  
#>  [9] systemfonts_1.0.5 textshaping_0.3.7 yaml_2.3.7        fastmap_1.1.1    
#> [13] readr_2.1.4       R6_2.5.1          rjsoncons_1.0.1   generics_0.1.3   
#> [17] curl_5.2.0        knitr_1.45        htmlwidgets_1.6.4 tibble_3.2.1     
#> [21] desc_1.4.3        tzdb_0.4.0        bslib_0.6.1       pillar_1.9.0     
#> [25] rlang_1.1.2       utf8_1.2.4        DT_0.31           cachem_1.0.8     
#> [29] stringi_1.8.2     xfun_0.41         fs_1.6.3          sass_0.4.8       
#> [33] bit64_4.0.5       memoise_2.0.1     cli_3.6.1         withr_2.5.2      
#> [37] pkgdown_2.0.7     magrittr_2.0.3    digest_0.6.33     vroom_1.6.5      
#> [41] hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.23    
#> [45] glue_1.6.2        ragg_1.2.6        fansi_1.0.6       httr_1.4.7       
#> [49] rmarkdown_2.25    purrr_1.0.2       tools_4.3.2       pkgconfig_2.0.3  
#> [53] htmltools_0.5.7