Improving NDJSON performance
https://mtmorgan.github.io/rjsoncons
Some of my work this week was on improving the performance of NDJSON parsing in rjsoncons. NDJSON is a JSON format where each line represents a complete JSON record, corresponding approximately to a row in a table. My inspiration came from a DuckDB blog post on ‘Shredding Deeply Nested JSON, One Vector at a Time’. The example data is from GH Archive, a project to record activity on public GitHub repositories. I use a single file, with one hour of data.
ndjson_file <- "data/gharchive_gz/2023-02-08-0.json.gz"
download.file(
"https://data.gharchive.org/2023-02-08-0.json.gz",
ndjson_file
)
We can view a single record interactively, using rjsoncons and listviewer.
j_query(ndjson_file, n_records = 1) |>
listviewer::jsonedit()
Here is one record; structure and content of individual records can vary.
{
"id": "26939254345",
"type": "DeleteEvent",
"actor": {
"id": 19908762,
"login": "lucianHymer",
"display_login": "lucianHymer",
"gravatar_id": "",
"url": "https://api.github.com/users/lucianHymer",
"avatar_url": "https://avatars.githubusercontent.com/u/19908762?"
},
"repo": {
"id": 469847426,
"name": "gitcoinco/passport",
"url": "https://api.github.com/repos/gitcoinco/passport"
},
"payload": {
"ref": "format-alert-messages",
"ref_type": "branch",
"pusher_type": "user"
},
"public": true,
"created_at": "2023-02-08T00:00:00Z",
"org": {
"id": 30044474,
"login": "gitcoinco",
"gravatar_id": "",
"url": "https://api.github.com/orgs/gitcoinco",
"avatar_url": "https://avatars.githubusercontent.com/u/30044474?"
}
}
One use for rjsoncons is to query JSON, rather than to process the
entire data, and to represent the result in a convenient way in
R. So one might extract the "id"
and "type"
fields as a tibble
with
j_pivot(ndjson_file, '{id: id, type: type}', as = "tibble", n_records = 5)
## # A tibble: 5 × 2
## id type
## <chr> <chr>
## 1 26939254345 DeleteEvent
## 2 26939254358 PushEvent
## 3 26939254361 CreateEvent
## 4 26939254365 CreateEvent
## 5 26939254366 PushEvent
But of course one would like to do this for the entire file.
Alternatives / the competition
DuckDB
DuckDB shows some pretty amazing JSON parsing abilities, including the ‘GitHub Archive Examples’.
library(glue)
library(duckdb)
library(DBI)
con <- dbConnect(duckdb())
dbExecute(con, "INSTALL 'json';")
dbExecute(con, "LOAD 'json';")
sql <- glue(
"SELECT id, type
FROM read_ndjson_auto('{ndjson_file}');"
)
The performance is killer, and sets the bar to which we aspire.
system.time({
res <- dbGetQuery(con, sql)
}) # 3.7 seconds!
res
is an R data.frame
with 172049 rows.
The SELECT
statement is very flexible, allowing e.g., extraction of
arbitrarily nested data.
jq
Probably most command-line JSON people use jq for transforming JSON. This shows that it takes about 0.62s to decompress the file
$ time gzcat data/gharchive_gz/2023-02-08-0.json.gz > /dev/null
0.59s user 0.02s system 97% cpu 0.623 total
And another 8.3s to transform the JSON
$ gzcat data/gharchive_gz/2023-02-08-0.json.gz | \
time jq --compact-output '{id, type}' > /dev/null
8.20s user 0.06s system 99% cpu 8.303 total
Not as fast as DuckDB (though the query language of jq is much richer), and not yet in R.
Adding steps to transforrm the result to R using rjsoncons
increases the time but not too much; the code uses an R pipe()
to
evaluate the system command, and then forwards the result (a character
vector) to rjsoncons::j_pivot()
. This takes about 11.8s.
system.time({
res <-
readLines(pipe(
"gzcat data/gharchive_gz/2023-02-08-0.json.gz | \
jq --compact-output '{id, type}'")
) |>
j_pivot(as = "tibble")
}) # 11.8s
This is quite a flexible solution, but does require the command line
tools gzcat
and jq
.
The ndjson R package
ndjson advertises itself as “‘Wicked-Fast Streaming ‘JSON’ (‘ndjson’) Reader”. Performance was not compelling.
> system.time(df <- ndjson::stream_in(ndjson_file))
user system elapsed
583.550 12.727 604.492
> dim(df)
[1] 172049 7738
Note that one cannot filter the data, so all of it is read into R.
The ndjson github page implies better performance. It
might be that the complexity of the sample NDJSON file used here is
problematic. Also, the data is input to a data.table
, and on my
machine (aarch64-apple-darwin23.2.0
) I know that data.table
requires special steps to exploit multiple cores; I have not taken
those steps so this is single-core performance.
The ndjson github page suggests much better performance on a 100,000 record file with simpler structure, but the description of the source of the file was not sufficient for me to easily reconstruct it.
The large number of columns recognized by ndjson
(7738) suggests
another possible performance hit: the JSON records are fully
‘flattened’, and the large number of columns does not really make
sense in the context of the GH Archive data.
Performance in rjsoncons
My starting point with rjsoncons (version 1.1.0.9400) was not too promising:
> system.time(j_pivot(ndjson_file, '{id: id, type: type}',as = "tibble"))
user system elapsed
78.538 0.574 79.435
Better than ndjson
, but 20 times (!) slower than DuckDB.
Original implementation
I want to use R’s connections like gzfile()
, url()
, etc., so
that I don’t have to re-invent the data input step and so that I have
flexibility (e.g., to read from https://
, or indeed a pipe
like
the jq example above). So the basic strategy is to read data at the
R level, then pass to jsoncons for process in C++.
My use cases mean that the output data (records of {"id": id, "type":
type}
) is much smaller than the input data. So to manage memory I
chose to iterate through the input in chunks. The basic strategy is
## set up C++ data structures, then...
con <- gzfile(ndjson_file, "rb")
repeat {
lines <- readLines(con, chunk_size)
if (length(lines) == 0)
break
## process 'lines' at the C++ level
}
## retrieve final data from C++
This shows two distinct parts to the code: (1) R data input; and (2) C++ data transformation.
R data input
It takes almost 1/2 the total time just to read through the data, doing nothing else
chunk_size <- 10000
system.time({
con <- gzfile(ndjson_file, "rb")
repeat {
lines <- readLines(con, chunk_size)
if (length(lines) == 0L)
break
}
close(con)
})
## user system elapsed
## 33.871 0.082 34.077
I know that R’s handling of character vectors can be expensive, so I
tried reading the data using readBin()
; the binary data can be
coerced to strings later, and in C++.
chunk_size <- 2^20
system.time({
con <- gzfile(ndjson_file, "rb")
repeat {
lines <- readBin(con, raw(), chunk_size)
if (length(lines) == 0L)
break
## cpp_buffer(lines) # coerce RAW() to std::string; return size()
}
close(con)
})
## user system elapsed
## 0.814 0.079 0.899
Wow, that’s less than 1 second to read the entire file! This is
comparable to the gzcat
facility at the command line.
I modified the code to read the data as an R raw
vector, passing
this to C++. In C++ I wrote a class to parse this into
newline-delimited records to be fed to jsoncons.
C++ data transformation
To investigate costs of C++ data transformation, I wrote a simplifed data input and parsing function callable from R. The code uses the zlib compression library to directly read the data into C, and then forwards to jsoncons for processing. The code, written for the cpp11 R / C++ interface, looked something like this:
[[cpp11::register]]
int cpp_test(std::string file_name)
{
auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
"{id: id, type: type}");
char buffer[buffer_size];
gzFile in_file = gzopen(file_name, "rb");
int n_lines = 0;
while (true) {
const char *line = gzgets(in_file, buffer, buffer_size);
if (line == nullptr)
break;
++n_lines; // 0.77s
const auto p = jsoncons::json::parse(line); // 29.0s
const auto q = expr.evaluate(p); // 31.3s
}
gzclose(in_file);
return n_lines;
}
The comments represent the execution times for reading the data (0.77s), reading and parsing the data (29.0s), and reading, parsing, and evaluating the JMESpath expression on the parsed data (31.3s).
This was very discouraging. It suggested that jsoncons was just too slow to be competitive. So I prepared a stand-alone version of this program, thinking about opening an issue on the jsoncons repository. Here’s the program, with updated timings.
#include "jsoncons/json.hpp"
#include "jsoncons_ext/jmespath/jmespath.hpp"
#include <zlib.h>
const int buffer_size = 1048576; // 2^20
int main(int argc, char *argv[])
{
auto expr = jsoncons::jmespath::make_expression<jsoncons::json>(
"{id: id, type: type}");
char buffer[buffer_size];
gzFile in_file = gzopen(argv[1], "rb");
int n_lines = 0;
while (true) {
const char *line = gzgets(in_file, buffer, buffer_size);
if (line == nullptr)
break;
++n_lines; // 0.78s
const auto p = jsoncons::json::parse(line); // 7.01s
const auto q = expr.evaluate(p); // 7.94s
}
gzclose(in_file);
std::cout << n_lines << std::endl;
return 0;
}
Say what? the stand-alone program is about 5x faster than the R / cpp11 code. This doesn’t make real sense; the interface with R is very light-weight (just passing a file path), and the C++ code is very comparable. Somehow I managed to spot the problem…
When compiling the stand-along program, I used the -O3
optimization
flag, because I wanted to maximize performance optimizations.
g++ -std=gnu++17 -lz -I. -O3 j_zlib.cpp
In R I create a file ~/.R/Makevars
that sets the same -O3
flag. I was using devtools::load_all()
, and it generated the
following compilation commands (I wrapped the lines for easy reading)
> devtools::load_all()
...
- installing *source* package 'rjsoncons' ... (637ms)
** using staged installation
** libs
using C++ compiler: 'Apple clang version 15.0.0 (clang-1500.1.0.2.5)'
using SDK: 'MacOSX14.2.sdk'
...
g++ -std=gnu++17
-I"/Users/ma38727/bin/R-devel/include" -DNDEBUG
-I../inst/include/
-I'/Users/ma38727/Library/R/arm64/4.4-3.19/cpp11/include'
-I/opt/R/arm64/include -fPIC
-g -O3 -UNDEBUG -Wall -pedantic
-g -O0 -fdiagnostics-color=always -c rjsoncons.cpp -o rjsoncons.o
...
Note that the -O3
flag set in the Makevars file is present. But note
also the later -O0
flag, telling the compiler to forget the -O3
flag, and do no optimization. Where did that -O0
flag come from????
It turns out that devtools::load_all()
compiles the DLL with
pkgbuild::compile_dll()
, and this function has an argument debug =
TRUE
. This inserts the -O0
flag. Aargh.
The solution is not as easy as setting an argument debug = FALSE
in
devtools::load_all()
(there is no such argument). Instead, make sure
to start with just the source files, compile the DLL, and then load the package
pkgbuild::clean_dll()
pkgbuild::compile_dll(debug = FALSE) # no unexpected -O0 flag
devtools::load_all()
Finally, we’re ready to benchmark our code
> system.time(res <- j_pivot(ndjson_file, '{id: id, type: type}', as = "tibble"))
user system elapsed
12.597 0.199 12.835
This is comparable to the jq
+ R solution. It is about three times
slower than DuckDB, but still quite respectable. Using verbose =
TRUE
adds a counter that makes it seem like the time goes by
faster…
Conclusions
This has been quite an interesting journey.
The performance and flexibility of DuckDB is really amazing.
Even though there is a small performance hit, I still like my decision
to use R connections, giving robustness and flexibility for
different data sources. I learned quite a bit about R, especially
the heavy cost of working with character vectors; I wonder if I am
doing something not correct in using readBin()
rather than
readLines()
, maybe restricting myself to particular encodings?
The need to use pkgbuild::compile_dll()
to have better control over
compilation was definitely a lesson learned.
A couple of things in cpp11 made me a little cautious, maybe in part reflecting my comparative ignorance of C++. For instance, code like this
std::vector<uint8_t> foo(100);
cpp11::raws raw;
std::copy(foo.begin(), foo.end(), raw.begin())
lead to a segfault, because the iterator returned by raw.begin()
did
not ‘stop’ at raw.end()
, but rather continued to write data beyond
the end of the allocated array. Investigating this a little lead me to
an issue suggesting that functions returning writable
types
could leak memory. These two issues point to the underlying complexity
of cpp11, and it’s relative newness in the R / C++ world.