Processing NDJSON
Martin Morgan
Roswell Park Comprehensive Cancer Center, Buffalo, NY, USSource:
vignettes/articles/b_ndjson_extended.Rmd
b_ndjson_extended.Rmd
Installation & setup
This article assumes that the rjsoncons, listviewer (for
interactively exploring JSON), dplyr (for
manipulating results as tibble
) tidyr (for unnesting
columns in a tibble) and cli (providing a
progress indicator) are installed.
pkgs <- c("rjsoncons", "dplyr", "tidyr", "cli")
needed <- pkgs[!pkgs %in% rownames(installed.packages())]
install.packages(needed, repos = "https://CRAN.R-project.org")
Start by loading the rjsoncons and dplyr packages into the current session.
We use data from GH Archive, a project to record activity on public GitHub repositories. Create a location for this file in the system-wide ‘cache’ directory for the rjsoncons package.
cache <- tools::R_user_dir("rjsoncons", "cache")
if (!dir.exists(cache))
dir.create(cache, recursive = TRUE)
If necessary, download a single file (1 hour of activity, about 170,000 events, 100 Mb) from the GH Archive.
archive_file <- "https://data.gharchive.org/2023-02-08-0.json.gz"
ndjson_file <- file.path(cache, "2023-02-08-0.json.gz")
if (!file.exists(ndjson_file))
download.file(archive_file, ndjson_file)
Data exploration
Ensure that the ndjson_file
defined above exists
stopifnot(
file.exists(ndjson_file)
)
To get a sense of the data, read and visualize the first record
j_query(ndjson_file, n_records = 1) |>
listviewer::jsonedit()
This query uses the default path = "@"
, a JMESpath
expression that returns the current element. The
n_records =
argument is available when processing NDJSON,
and restricts the number of records input. This is very useful when
exploring the data.
The record contains the information below. Records have this general
structure, but the information can differ, e.g., not all actions have an
org
field.
{
"id": "26939254345",
"type": "DeleteEvent",
"actor": {
"id": 19908762,
"login": "lucianHymer",
"display_login": "lucianHymer",
"gravatar_id": "",
"url": "https://api.github.com/users/lucianHymer",
"avatar_url": "https://avatars.githubusercontent.com/u/19908762?"
},
"repo": {
"id": 469847426,
"name": "gitcoinco/passport",
"url": "https://api.github.com/repos/gitcoinco/passport"
},
"payload": {
"ref": "format-alert-messages",
"ref_type": "branch",
"pusher_type": "user"
},
"public": true,
"created_at": "2023-02-08T00:00:00Z",
"org": {
"id": 30044474,
"login": "gitcoinco",
"gravatar_id": "",
"url": "https://api.github.com/orgs/gitcoinco",
"avatar_url": "https://avatars.githubusercontent.com/u/30044474?"
}
}
We will work with the "id"
and "type"
top-level fields, available using JMESpath as
j_query(ndjson_file, '{id: id, type: type}', n_records = 5)
## [1] "{\"id\":\"26939254345\",\"type\":\"DeleteEvent\"}"
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\"}"
## [3] "{\"id\":\"26939254361\",\"type\":\"CreateEvent\"}"
## [4] "{\"id\":\"26939254365\",\"type\":\"CreateEvent\"}"
## [5] "{\"id\":\"26939254366\",\"type\":\"PushEvent\"}"
A more elaborate query might combine these with other, nested, elements, e.g.,
j_query(ndjson_file, '{id: id, type: type, "org.id": org.id}', n_records = 5)
## [1] "{\"id\":\"26939254345\",\"type\":\"DeleteEvent\",\"org.id\":30044474}"
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\",\"org.id\":123667276}"
## [3] "{\"id\":\"26939254361\",\"type\":\"CreateEvent\",\"org.id\":null}"
## [4] "{\"id\":\"26939254365\",\"type\":\"CreateEvent\",\"org.id\":null}"
## [5] "{\"id\":\"26939254366\",\"type\":\"PushEvent\",\"org.id\":null}"
Note that records 3-5 do not have an organization.
Use JMESpath for queries
JMESpath seems to be most appropriate when working with NDJSON files.
Here’s a JMESpath query extracting just the org
information; the query processes five records and returns five results;
records 3-5 do not have this key, and are "null"
.
j_query(ndjson_file, 'org', n_records = 5)
## [1] "{\"id\":30044474,\"login\":\"gitcoinco\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/gitcoinco\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/30044474?\"}"
## [2] "{\"id\":123667276,\"login\":\"johnbieren-testing\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/johnbieren-testing\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/123667276?\"}"
## [3] "null"
## [4] "null"
## [5] "null"
JSONpointer path cannot be used, because it is an error if the key does not exist, and the third record cannot be processed
try(
## fails: 'b' does not exist
j_query('{"a": 1}', '/b')
)
## Error : Key not found
try(
## fails: record 3 does not have 'org' key
j_query(ndjson_file, '/org', n_records = 5)
)
## Error : Key not found
Also, JSONpointer does not allow one to create new objects from
components of the data, so one could not assemble the id
and type
keys of the original object into a new object.
JSONpath allows for missing keys
j_query(ndjson_file, "$.org", n_records = 5)
## [1] "[{\"id\":30044474,\"login\":\"gitcoinco\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/gitcoinco\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/30044474?\"}]"
## [2] "[{\"id\":123667276,\"login\":\"johnbieren-testing\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/johnbieren-testing\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/123667276?\"}]"
## [3] "[]"
## [4] "[]"
## [5] "[]"
but it is not straight-forward to assemble new objects, e.g., placing
the top-level "id"
and "type"
keys into a
single object.
Use tibble
with j_pivot()
j_pivot()
is very useful for extracting tabular data
from JSON and NDJSON representations. Recall that j_pivot()
transforms a JSON array or file records of objects to an object of
arrays
path <- '{id: id, type: type}'
j_pivot(ndjson_file, path, n_records = 5, as = "R") |>
str()
## List of 2
## $ id : chr [1:5] "26939254345" "26939254358" "26939254361" "26939254365" ...
## $ type: chr [1:5] "DeleteEvent" "PushEvent" "CreateEvent" "CreateEvent" ...
This can be represented as a data.frame
or
tibble
j_pivot(ndjson_file, path, n_records = 5, as = "data.frame")
## id type
## 1 26939254345 DeleteEvent
## 2 26939254358 PushEvent
## 3 26939254361 CreateEvent
## 4 26939254365 CreateEvent
## 5 26939254366 PushEvent
‘Under the hood’, j_pivot()
is simply calling
as = "R"
and then as.data.frame()
on the
result. Unfortunately, as.data.frame()
fails when some keys
are translated to NULL
, e.g., when org
is
absent
path <- '{id: id, type: type, org: org}'
try(
j_pivot(ndjson_file, path, n_records = 5, as = "data.frame")
)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
## arguments imply differing number of rows: 1, 0
The coercion of the R representation to a tibble is robust to this missing data
tbl <- j_pivot(ndjson_file, path, n_records = 5, as = "tibble")
tbl
## # A tibble: 5 × 3
## id type org
## <chr> <chr> <list>
## 1 26939254345 DeleteEvent <named list [5]>
## 2 26939254358 PushEvent <named list [5]>
## 3 26939254361 CreateEvent <NULL>
## 4 26939254365 CreateEvent <NULL>
## 5 26939254366 PushEvent <NULL>
The Hierarchical
data chapter of R for Data
Science suggests using tidyr::unnest_wider()
and
tidyr::unnest_longer()` for working with nested data. The result of the
pivot can be flattened with
tbl |>
tidyr::unnest_wider("org", names_sep = ".")
## # A tibble: 5 × 7
## id type org.id org.login org.gravatar_id org.url org.avatar_url
## <chr> <chr> <int> <chr> <chr> <chr> <chr>
## 1 26939254345 DeleteEv… 3.00e7 gitcoinco "" https:… https://avata…
## 2 26939254358 PushEvent 1.24e8 johnbier… "" https:… https://avata…
## 3 26939254361 CreateEv… NA NA NA NA NA
## 4 26939254365 CreateEv… NA NA NA NA NA
## 5 26939254366 PushEvent NA NA NA NA NA
If one were interested in only some of the keys in the nested
org
element, these could be incorporated directly into the
path
. Note that keys containing .
need to be
quoted "org.id": org.id
.
path <- '{id: id, type: type, "org.id": org.id}'
j_pivot(ndjson_file, path, n_records = 5, as = "tibble")
## # A tibble: 5 × 3
## id type org.id
## <chr> <chr> <list>
## 1 26939254345 DeleteEvent <int [1]>
## 2 26939254358 PushEvent <int [1]>
## 3 26939254361 CreateEvent <NULL>
## 4 26939254365 CreateEvent <NULL>
## 5 26939254366 PushEvent <NULL>
Filters with JMESpath
The strategy for filtering NDJSON with JMESpath is to create a length
1 array containing the object of interest, and then filter the array.
Thus to discover PushEvents from organizations, form an array with an
object containing relevant information
[{id: id, type: type, org: org}]
and then filter the array
using JMESpath’s query syntax
[?@.type == 'PushEvent' && org != null]
. The type
of quotation (single-quote, '
) is important in the query,
so use double quotes to define the path
path <-
"[{id: id, type: type, org: org}]
[?@.type == 'PushEvent' && org != null] |
[0]"
j_query(ndjson_file, path, n_records = 5)
## [1] "null"
## [2] "{\"id\":\"26939254358\",\"type\":\"PushEvent\",\"org\":{\"id\":123667276,\"login\":\"johnbieren-testing\",\"gravatar_id\":\"\",\"url\":\"https://api.github.com/orgs/johnbieren-testing\",\"avatar_url\":\"https://avatars.githubusercontent.com/u/123667276?\"}}"
## [3] "null"
## [4] "null"
## [5] "null"
j_pivot()
removes empty records
path <-
"[{id: id, type: type, org: org}]
[?@.type == 'PushEvent' && org != null] |
[0]"
j_pivot(ndjson_file, path, n_records = 5, as = "tibble")
## # A tibble: 1 × 3
## id type org
## <chr> <chr> <list>
## 1 26939254358 PushEvent <named list [5]>
Performance
rjsoncons is
relatively performant when processing large files. Use
verbose = TRUE
to get a progress indicators.
system.time({
tbl <- j_pivot(
ndjson_file, '{id: id, type: type}',
as = "tibble", verbose = TRUE
)
})
## processing 33464 records
## processing 68962 records
## processing 107092 records
## processing 144248 records
## user system elapsed
## 13.656 0.110 13.766
tbl
## # A tibble: 172,049 × 2
## id type
## <chr> <chr>
## 1 26939254345 DeleteEvent
## 2 26939254358 PushEvent
## 3 26939254361 CreateEvent
## 4 26939254365 CreateEvent
## 5 26939254366 PushEvent
## 6 26939254367 PushEvent
## 7 26939254379 PushEvent
## 8 26939254380 IssuesEvent
## 9 26939254382 PushEvent
## 10 26939254383 PushEvent
## # ℹ 172,039 more rows
tbl |>
count(type, sort = TRUE)
## # A tibble: 15 × 2
## type n
## <chr> <int>
## 1 PushEvent 90250
## 2 CreateEvent 25311
## 3 PullRequestEvent 18326
## 4 IssueCommentEvent 9610
## 5 DeleteEvent 9065
## 6 WatchEvent 5620
## 7 PullRequestReviewEvent 3823
## 8 IssuesEvent 2744
## 9 PullRequestReviewCommentEvent 2098
## 10 ForkEvent 1900
## 11 CommitCommentEvent 1257
## 12 ReleaseEvent 917
## 13 PublicEvent 491
## 14 MemberEvent 388
## 15 GollumEvent 249
On my system, this takes approximately 13s. Memory use is not extensive, because at the R level the file is processed in chunks and only the final result is represented in R data structures.
The performance of rjsoncons is comparable to the purpose-built jq command-line tool. jq takes about 9s when run at the command line. An additional 3s is required to input the command-line output to R. jq provides greater flexibility than JMESpath, and is widely used.
The CRAN package jqr
provides an R interface to the jq library. Linux and macOS
users are required to have the jq library installed. A straight-forward
use of the library takes about 22 seconds; additional steps are required
to translate the result to an R data.frame
.
system.time({
jqr <-
jqr::jq(gzfile(ndjson_file), '{id, type}') |>
j_pivot(as = "tibble")
})
## user system elapsed
## 20.812 0.020 20.833
The use case outlined here compares very favorably to the performance
of the ndjson
CRAN package, which took more than 600s to complete the task above. ndjson reads the
entire data set into R, whereas rjsoncons only
represents the final object with columns id
and
type
in R.
DuckDB offers
a CRAN package
that supports their SQL
interface to JSON, and this is very performant. The following code
takes just 3.7s to deliver a data.frame
to R.
library(glue)
library(duckdb)
library(DBI)
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL 'json';") # only required once
dbExecute(con, "LOAD 'json';")
sql <- glue(
"SELECT id, type
FROM read_ndjson_auto('{ndjson_file}');"
)
system.time({
res <- dbGetQuery(con, sql)
}) # 3.7 seconds!
The DuckDB SQL interface allows flexible selection, filtering, and data summary. It also treats a collection of JSON files as a single ‘database’, and scales favorably and automatically with number of files being processed. DuckDB does not require additional software, other than the duckdb CRAN package.
A blog post provides additional details of the comparison between solutions, including discussion of design decisions that rjsoncons adopted to achieve reasonable performance.
Other packages
There are two very fast JSON parsers available via CRAN, RcppSimdJson and yyjsonr. RcppSimdJson supports JSONpointer for queries, but as noted for NDJSON this is only useful when all records contain the endpoint. yyjsonr does not support queries or NDJSON at time of writing (18 February, 2024).
Session information
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 rjsoncons_1.3.1.9100 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.8 compiler_4.4.1 BiocManager_1.30.25
## [4] tidyselect_1.2.1 tidyr_1.3.1 jquerylib_0.1.4
## [7] systemfonts_1.1.0 textshaping_0.4.0 yaml_2.3.10
## [10] fastmap_1.2.0 R6_2.5.1 generics_0.1.3
## [13] knitr_1.48 tibble_3.2.1 bookdown_0.40
## [16] desc_1.4.3 bslib_0.8.0 pillar_1.9.0
## [19] rlang_1.1.4 utf8_1.2.4 cachem_1.1.0
## [22] xfun_0.47 fs_1.6.4 sass_0.4.9
## [25] lazyeval_0.2.2 cli_3.6.3 pkgdown_2.1.0
## [28] withr_3.0.1 magrittr_2.0.3 digest_0.6.37
## [31] lifecycle_1.0.4 jqr_1.3.4 vctrs_0.6.5
## [34] evaluate_0.24.0 glue_1.7.0 ragg_1.3.2
## [37] fansi_1.0.6 rmarkdown_2.28 purrr_1.0.2
## [40] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1