Processing NDJSON

Installation & setup

This article assumes that the rjsoncons, listviewer (for interactively exploring JSON), dplyr (for manipulating results as tibble) tidyr (for unnesting columns in a tibble) and cli (providing a progress indicator) are installed.

pkgs <- c("rjsoncons", "dplyr", "tidyr", "cli")
needed <- pkgs[!pkgs %in% rownames(installed.packages())]
install.packages(needed, repos = "https://CRAN.R-project.org")

Start by loading the rjsoncons and dplyr packages into the current session.

library(rjsoncons)
library(dplyr)

We use data from GH Archive, a project to record activity on public GitHub repositories. Create a location for this file in the system-wide ‘cache’ directory for the rjsoncons package.

cache <- tools::R_user_dir("rjsoncons", "cache")
if (!dir.exists(cache))
    dir.create(cache, recursive = TRUE)

If necessary, download a single file (1 hour of activity, about 170,000 events, 100 Mb) from the GH Archive.

archive_file <- "https://data.gharchive.org/2023-02-08-0.json.gz"
ndjson_file <- file.path(cache, "2023-02-08-0.json.gz")
if (!file.exists(ndjson_file))
    download.file(archive_file, ndjson_file)

Data exploration

Ensure that the ndjson_file defined above exists

stopifnot(
    file.exists(ndjson_file)
)

To get a sense of the data, read and visualize the first record

j_query(ndjson_file, n_records = 1) |>
    listviewer::jsonedit()

This query uses the default path = "@", a JMESpath expression that returns the current element. The n_records = argument is available when processing NDJSON, and restricts the number of records input. This is very useful when exploring the data.

The record contains the information below. Records have this general structure, but the information can differ, e.g., not all actions have an org field.

{
  "id": "26939254345",
  "type": "DeleteEvent",
  "actor": {
    "id": 19908762,
    "login": "lucianHymer",
    "display_login": "lucianHymer",
    "gravatar_id": "",
    "url": "https://api.github.com/users/lucianHymer",
    "avatar_url": "https://avatars.githubusercontent.com/u/19908762?"
  },
  "repo": {
    "id": 469847426,
    "name": "gitcoinco/passport",
    "url": "https://api.github.com/repos/gitcoinco/passport"
  },
  "payload": {
    "ref": "format-alert-messages",
    "ref_type": "branch",
    "pusher_type": "user"
  },
  "public": true,
  "created_at": "2023-02-08T00:00:00Z",
  "org": {
    "id": 30044474,
    "login": "gitcoinco",
    "gravatar_id": "",
    "url": "https://api.github.com/orgs/gitcoinco",
    "avatar_url": "https://avatars.githubusercontent.com/u/30044474?"
  }
}

We will work with the "id" and "type" top-level fields, available using JMESpath as

j_query(ndjson_file, '{id: id, type: type}', n_records = 5)
## [1] "{\"id\":\"26939254345\",\"type\":\"DeleteEvent\"}"
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\"}"  
## [3] "{\"id\":\"26939254361\",\"type\":\"CreateEvent\"}"
## [4] "{\"id\":\"26939254365\",\"type\":\"CreateEvent\"}"
## [5] "{\"id\":\"26939254366\",\"type\":\"PushEvent\"}"

A more elaborate query might combine these with other, nested, elements, e.g.,

j_query(ndjson_file, '{id: id, type: type, "org.id": org.id}', n_records = 5)
## [1] "{\"id\":\"26939254345\",\"type\":\"DeleteEvent\",\"org.id\":30044474}"
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\",\"org.id\":123667276}" 
## [3] "{\"id\":\"26939254361\",\"type\":\"CreateEvent\",\"org.id\":null}"    
## [4] "{\"id\":\"26939254365\",\"type\":\"CreateEvent\",\"org.id\":null}"    
## [5] "{\"id\":\"26939254366\",\"type\":\"PushEvent\",\"org.id\":null}"

Note that records 3-5 do not have an organization.

Use JMESpath for queries

JMESpath seems to be most appropriate when working with NDJSON files.

Here’s a JMESpath query extracting just the org information; the query processes five records and returns five results; records 3-5 do not have this key, and are "null".

j_query(ndjson_file, 'org', n_records = 5)
## [1] "{\"id\":30044474,\"login\":\"gitcoinco\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/gitcoinco\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/30044474?\"}"                    
## [2] "{\"id\":123667276,\"login\":\"johnbieren-testing\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/johnbieren-testing\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/123667276?\"}"
## [3] "null"                                                                                                                                                                                                     
## [4] "null"                                                                                                                                                                                                     
## [5] "null"

JSONpointer path cannot be used, because it is an error if the key does not exist, and the third record cannot be processed

try(
    ## fails: 'b' does not exist
    j_query('{"a": 1}', '/b')
)
## Error : Key not found

try(
    ## fails: record 3 does not have 'org' key
    j_query(ndjson_file, '/org', n_records = 5)
)
## Error : Key not found

Also, JSONpointer does not allow one to create new objects from components of the data, so one could not assemble the id and type keys of the original object into a new object.

JSONpath allows for missing keys

j_query(ndjson_file, "$.org", n_records = 5)
## [1] "[{\"id\":30044474,\"login\":\"gitcoinco\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/gitcoinco\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/30044474?\"}]"                    
## [2] "[{\"id\":123667276,\"login\":\"johnbieren-testing\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/johnbieren-testing\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/123667276?\"}]"
## [3] "[]"                                                                                                                                                                                                         
## [4] "[]"                                                                                                                                                                                                         
## [5] "[]"

but it is not straight-forward to assemble new objects, e.g., placing the top-level "id" and "type" keys into a single object.

Use `tibble` with `j_pivot()`

j_pivot() is very useful for extracting tabular data from JSON and NDJSON representations. Recall that j_pivot() transforms a JSON array or file records of objects to an object of arrays

path <- '{id: id, type: type}'
j_pivot(ndjson_file, path, n_records = 5, as = "R") |>
    str()
## List of 2
##  $ id  : chr [1:5] "26939254345" "26939254358" "26939254361" "26939254365" ...
##  $ type: chr [1:5] "DeleteEvent" "PushEvent" "CreateEvent" "CreateEvent" ...

This can be represented as a data.frame or tibble

j_pivot(ndjson_file, path, n_records = 5, as = "data.frame")
##            id        type
## 1 26939254345 DeleteEvent
## 2 26939254358   PushEvent
## 3 26939254361 CreateEvent
## 4 26939254365 CreateEvent
## 5 26939254366   PushEvent

‘Under the hood’, j_pivot() is simply calling as = "R" and then as.data.frame() on the result. Unfortunately, as.data.frame() fails when some keys are translated to NULL, e.g., when org is absent

path <- '{id: id, type: type, org: org}'
try(
    j_pivot(ndjson_file, path, n_records = 5, as = "data.frame")
)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
##   arguments imply differing number of rows: 1, 0

The coercion of the R representation to a tibble is robust to this missing data

tbl <- j_pivot(ndjson_file, path, n_records = 5, as = "tibble")
tbl
## # A tibble: 5 × 3
##   id          type        org             
##   <chr>       <chr>       <list>          
## 1 26939254345 DeleteEvent <named list [5]>
## 2 26939254358 PushEvent   <named list [5]>
## 3 26939254361 CreateEvent <NULL>          
## 4 26939254365 CreateEvent <NULL>          
## 5 26939254366 PushEvent   <NULL>

The Hierarchical data chapter of R for Data Science suggests using tidyr::unnest_wider() and tidyr::unnest_longer()` for working with nested data. The result of the pivot can be flattened with

tbl |>
    tidyr::unnest_wider("org", names_sep = ".")
## # A tibble: 5 × 7
##   id          type       org.id org.login org.gravatar_id org.url org.avatar_url
##   <chr>       <chr>       <int> <chr>     <chr>           <chr>   <chr>         
## 1 26939254345 DeleteEv…  3.00e7 gitcoinco ""              https:… https://avata…
## 2 26939254358 PushEvent  1.24e8 johnbier… ""              https:… https://avata…
## 3 26939254361 CreateEv… NA      NA        NA              NA      NA            
## 4 26939254365 CreateEv… NA      NA        NA              NA      NA            
## 5 26939254366 PushEvent NA      NA        NA              NA      NA

If one were interested in only some of the keys in the nested org element, these could be incorporated directly into the path. Note that keys containing . need to be quoted "org.id": org.id.

path <- '{id: id, type: type, "org.id": org.id}'
j_pivot(ndjson_file, path, n_records = 5, as = "tibble")
## # A tibble: 5 × 3
##   id          type        org.id   
##   <chr>       <chr>       <list>   
## 1 26939254345 DeleteEvent <int [1]>
## 2 26939254358 PushEvent   <int [1]>
## 3 26939254361 CreateEvent <NULL>   
## 4 26939254365 CreateEvent <NULL>   
## 5 26939254366 PushEvent   <NULL>

Filters with JMESpath

The strategy for filtering NDJSON with JMESpath is to create a length 1 array containing the object of interest, and then filter the array. Thus to discover PushEvents from organizations, form an array with an object containing relevant information [{id: id, type: type, org: org}] and then filter the array using JMESpath’s query syntax [?@.type == 'PushEvent' && org != null]. The type of quotation (single-quote, ') is important in the query, so use double quotes to define the path

path <-
    "[{id: id, type: type, org: org}]
         [?@.type == 'PushEvent' && org != null] |
         [0]"
j_query(ndjson_file, path, n_records = 5)
## [1] "null"                                                                                                                                                                                                                                                             
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\",\"org\":{\"id\":123667276,\"login\":\"johnbieren-testing\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/johnbieren-testing\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/123667276?\"}}"
## [3] "null"                                                                                                                                                                                                                                                             
## [4] "null"                                                                                                                                                                                                                                                             
## [5] "null"

j_pivot() removes empty records

path <-
    "[{id: id, type: type, org: org}]
         [?@.type == 'PushEvent' && org != null] |
         [0]"
j_pivot(ndjson_file, path, n_records = 5, as = "tibble")
## # A tibble: 1 × 3
##   id          type      org             
##   <chr>       <chr>     <list>          
## 1 26939254358 PushEvent <named list [5]>

Performance

rjsoncons is relatively performant when processing large files. Use verbose = TRUE to get a progress indicators.

system.time({
    tbl <- j_pivot(
        ndjson_file, '{id: id, type: type}',
        as = "tibble", verbose = TRUE
    )
})
## processing 33464 records
## processing 68962 records
## processing 107092 records
## processing 144248 records
##    user  system elapsed 
##  13.656   0.110  13.766
tbl
## # A tibble: 172,049 × 2
##    id          type       
##    <chr>       <chr>      
##  1 26939254345 DeleteEvent
##  2 26939254358 PushEvent  
##  3 26939254361 CreateEvent
##  4 26939254365 CreateEvent
##  5 26939254366 PushEvent  
##  6 26939254367 PushEvent  
##  7 26939254379 PushEvent  
##  8 26939254380 IssuesEvent
##  9 26939254382 PushEvent  
## 10 26939254383 PushEvent  
## # ℹ 172,039 more rows
tbl |>
    count(type, sort = TRUE)
## # A tibble: 15 × 2
##    type                              n
##    <chr>                         <int>
##  1 PushEvent                     90250
##  2 CreateEvent                   25311
##  3 PullRequestEvent              18326
##  4 IssueCommentEvent              9610
##  5 DeleteEvent                    9065
##  6 WatchEvent                     5620
##  7 PullRequestReviewEvent         3823
##  8 IssuesEvent                    2744
##  9 PullRequestReviewCommentEvent  2098
## 10 ForkEvent                      1900
## 11 CommitCommentEvent             1257
## 12 ReleaseEvent                    917
## 13 PublicEvent                     491
## 14 MemberEvent                     388
## 15 GollumEvent                     249

On my system, this takes approximately 13s. Memory use is not extensive, because at the R level the file is processed in chunks and only the final result is represented in R data structures.

The performance of rjsoncons is comparable to the purpose-built jq command-line tool. jq takes about 9s when run at the command line. An additional 3s is required to input the command-line output to R. jq provides greater flexibility than JMESpath, and is widely used.

The CRAN package jqr provides an R interface to the jq library. Linux and macOS users are required to have the jq library installed. A straight-forward use of the library takes about 22 seconds; additional steps are required to translate the result to an R data.frame.

system.time({
    jqr <-
        jqr::jq(gzfile(ndjson_file), '{id, type}') |>
        j_pivot(as = "tibble")
})
##    user  system elapsed 
##  20.812   0.020  20.833

The use case outlined here compares very favorably to the performance of the ndjson CRAN package, which took more than 600s to complete the task above. ndjson reads the entire data set into R, whereas rjsoncons only represents the final object with columns id and type in R.

DuckDB offers a CRAN package that supports their SQL interface to JSON, and this is very performant. The following code takes just 3.7s to deliver a data.frame to R.

library(glue)
library(duckdb)
library(DBI)

con <- dbConnect(duckdb())
dbExecute(con, "INSTALL 'json';") # only required once
dbExecute(con, "LOAD 'json';")

sql <- glue(
    "SELECT id, type
     FROM read_ndjson_auto('{ndjson_file}');"
)

system.time({
    res <- dbGetQuery(con, sql)
}) # 3.7 seconds!

The DuckDB SQL interface allows flexible selection, filtering, and data summary. It also treats a collection of JSON files as a single ‘database’, and scales favorably and automatically with number of files being processed. DuckDB does not require additional software, other than the duckdb CRAN package.

A blog post provides additional details of the comparison between solutions, including discussion of design decisions that rjsoncons adopted to achieve reasonable performance.

Other packages

There are two very fast JSON parsers available via CRAN, RcppSimdJson and yyjsonr. RcppSimdJson supports JSONpointer for queries, but as noted for NDJSON this is only useful when all records contain the endpoint. yyjsonr does not support queries or NDJSON at time of writing (18 February, 2024).

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4          rjsoncons_1.3.1.9100 BiocStyle_2.32.1    
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.8      compiler_4.4.1      BiocManager_1.30.25
##  [4] tidyselect_1.2.1    tidyr_1.3.1         jquerylib_0.1.4    
##  [7] systemfonts_1.1.0   textshaping_0.4.0   yaml_2.3.10        
## [10] fastmap_1.2.0       R6_2.5.1            generics_0.1.3     
## [13] knitr_1.48          tibble_3.2.1        bookdown_0.40      
## [16] desc_1.4.3          bslib_0.8.0         pillar_1.9.0       
## [19] rlang_1.1.4         utf8_1.2.4          cachem_1.1.0       
## [22] xfun_0.47           fs_1.6.4            sass_0.4.9         
## [25] lazyeval_0.2.2      cli_3.6.3           pkgdown_2.1.0      
## [28] withr_3.0.1         magrittr_2.0.3      digest_0.6.37      
## [31] lifecycle_1.0.4     jqr_1.3.4           vctrs_0.6.5        
## [34] evaluate_0.24.0     glue_1.7.0          ragg_1.3.2         
## [37] fansi_1.0.6         rmarkdown_2.28      purrr_1.0.2        
## [40] tools_4.4.1         pkgconfig_2.0.3     htmltools_0.5.8.1

Martin Morgan