Support for NDJSON in rjsoncons
https://mtmorgan.github.io/rjsoncons
A DuckDB blog post on ‘Shredding Deeply Nested JSON, One Vector at a Time’ shows some pretty amazing JSON parsing abilities, including the ‘GitHub Archive Examples’. It inspired exploration of NDJSON in rjsoncons, a first iteration of which is available as version 1.1.0.9400.
remotes::install_github("mtmorgan/rjsoncons")
Since NDJSON files can be large, one wants to iterate through these in
chunks. So in addition to NDJSON parsign, the package gains the
ability to read from connections (files or URLs). Support includes
both j_query()
and j_pivot()
. The path
argument refers to
selection in individual records. So one might query a single record
ndjson_file <- "data/gharchive_gz/2023-02-08-0.json.gz"
j_query(ndjson_file, n_records = 1) |>
listviewer::jsonedit()
Extract relevant infromation into a tibble
result <- j_pivot(
ndjson_file,
"{id: id, type: type}", # path on individual records
as = "tibble", # output format
verbose = TRUE # progress bar
)
The result is a tibble that can be easily manipulated using dplyr.
> result
# A tibble: 172,049 × 2
id type
<chr> <chr>
1 26939254345 DeleteEvent
2 26939254358 PushEvent
3 26939254361 CreateEvent
4 26939254365 CreateEvent
5 26939254366 PushEvent
6 26939254367 PushEvent
7 26939254379 PushEvent
8 26939254380 IssuesEvent
9 26939254382 PushEvent
10 26939254383 PushEvent
# ℹ 172,039 more rows
# ℹ Use `print(n = ...)` to see more rows
Current performance is about 10x slower than DuckDB, but that is for subsequent iterations.
Some commentary
Connections
I had hoped to use R’s connection interface at the C level, to avoid realization of R data structures, but the public interface is just too limited.
I also explored implementing input using C++ iostreams, but wanted
support for compression. There seems to be a relatively old but
functional and light-weight gzstream library. But this approach
actually didn’t seem particularly performant (maybe there is a buffer
size that is too small?) and would not be as flexible as R-level
connections (e.g., handling https://
).
At the R level, I currently chunk through connections using
readLines()
. It is much faster to readBin()
, avoiding the cost of
creating R character vectors. One idea is to read raw vectors at the
R level, then pass to C++ for conversion to std::string
and jsoncons
processing.
External pointers
For chunk processing it makes sense to have C++ data structures that
persist across calls from R, so I created a C++ class r_json
and
used cpp11’s templated external_pointer<>
class to return this
to R. r_json
is templated
template<class Json>
r_json
{ ... }
so actually the cpp11 data structure is either external_pointer<
r_json<json> >
or external_pointer< rjson<ojson> >
. This leads to
the unfortunate situation where calls from R with the external
pointer also require the class, tempclass specification, and a
switch()
for dispatch
template<class Json>
sexp do_something_impl(sexp extp)
{
external_pointer< r_json<Json> > extp(ext);
...
}
sexp do_something(sexp ext, String template_class)
{
switch(template_class) {
case 'json': return do_something_impl<json>(ext);
case 'ojson': return do_something_impl<ojson>(ext);
}};
}
or similar. It is unfortunate that the template class information can
not be more tightly coupled with the external pointer, and that
dispatch is being re-invented with a switch
statement. Maybe I am
not clever enough with C++…