Skip to contents

This article summarizes examples drawn from StackOverflow and elsewhere. Start by loading the rjsoncons and dplyr packages

Query and pivot

Data frame columns as NDJSON

https://stackoverflow.com/questions/76447100

This question presented a tibble where one column contained JSON expressions.

df
#> # A tibble: 2 × 2
#>   subject          response                                                     
#> * <chr>            <chr>                                                        
#> 1 dtv85251vucquc45 "{\"P0_Q0\":{\"aktiv\":2,\"bekümmert\":3,\"interessiert\":4,…
#> 2 mcj8vdqz7sxmjcr0 "{\"P0_Q0\":{\"aktiv\":1,\"bekümmert\":3,\"interessiert\":1,…

The goal was to extract the fields of each P0_Q0 element as new columns. The response column can be viewed as NDJSON, so we can use pivot(df$response, "P0_Q0") for the hard work, and bind_cols() to prepend subject

bind_cols(
    df |> select(subject),
    df |> pull(response) |> j_pivot("P0_Q0", as = "tibble")
)
#> # A tibble: 2 × 11
#>   subject aktiv bekümmert interessiert `freudig erregt` verärgert stark schuldig
#>   <chr>   <int>     <int>        <int>            <int>     <int> <int>    <int>
#> 1 dtv852…     2         3            4                2         2     0        1
#> 2 mcj8vd…     1         3            1                1         0     0        2
#> # ℹ 3 more variables: erschrocken <int>, feindselig <int>, angeregt <int>

My initial response was early in package development, and motivated j_pivot() as an easier way to perform the common operation of transforming a JSON array-of-objects to and R data frame.

Constructing a pivot object using JMESPath

https://stackoverflow.com/questions/78029215

This question had an array of objects, each of which had a single unique key-value pair.

json <- '{
    "encrypted_values":[
        {"name_a":"value_a"}, 
        {"name_b":"value_b"}, 
        {"name_c":"value_c"}
    ]
}'

The goal was to create a tibble with keys in one column, and values in another. jsonlite::fromJSON() or j_pivot(json, "encrypted_values") simplify the result to a tibble with a column for each object key, which is not desired.

jsonlite::fromJSON(json)$encrypted_values
#>    name_a  name_b  name_c
#> 1 value_a    <NA>    <NA>
#> 2    <NA> value_b    <NA>
#> 3    <NA>    <NA> value_c
j_pivot(json, "encrypted_values", as = "tibble")
#> # A tibble: 3 × 3
#>   name_a    name_b    name_c   
#>   <list>    <list>    <list>   
#> 1 <chr [1]> <NULL>    <NULL>   
#> 2 <NULL>    <chr [1]> <NULL>   
#> 3 <NULL>    <NULL>    <chr [1]>

Instead, write a JMESPath query that extracts an object with the keys as one element, and values as another. This uses @ to represent the current mode, and keys() and values() functions to extract associated elements. The trailing [] converts an array-of-arrays of keys (for example) to a simple array of keys.

query <- '{
    name : encrypted_values[].keys(@)[],
    value: encrypted_values[].values(@)[]
}'
j_pivot(json, query, as = "tibble")
#> # A tibble: 3 × 2
#>   name   value  
#>   <chr>  <chr>  
#> 1 name_a value_a
#> 2 name_b value_b
#> 3 name_c value_c

Constructing a pivot object using JMESPath: a second example

https://stackoverflow.com/questions/78727724

The question asks about creating a tibble from a complex JSON data structure. Here is a reproducible example to retrieve JSON; unfortunately the host does not resolve when run as a GitHub action, so the example here is not fully evaluated.

start_date <- end_date <- Sys.Date()

res <- httr::GET(
       url = "https://itc.aeso.ca/itc/public/api/v2/interchange",
       query = list(
           beginDate = format(start_date, "%Y%m%d"),
           endDate = format(end_date, "%Y%m%d"),
           Accept = "application/json"
       )
   )
json <- httr::content(res, as = "text", encoding = "UTF-8")

Explore the JSON using listviewer.

listviewer::jsonedit(json)

Write a query that extracts, directly from the JSON, some of the fields of interest. The queries are written using JMESPath. A simple example extracts the ‘date’ from the path return.BcIntertie.Allocations[].date

path <- 'return.BcIntertie.Allocations[].date'
j_query(json, path) |>
    str()
## chr "[\"2024-07-10\",\"2024-07-10\",\"2024-07-10\",\"2024-07-10\",\"2024-07-10\",\"2024-07-10\",\"2024-07-10\",\"202"| __truncated__

Expand on this by querying several different fields, and then re-formating the query into a new JSON object. Develop the code by querying / viewing until things look like a JSON array-of-objects

path <- paste0(
    'return.{',
        'date: BcIntertie.Allocations[].date,',
        'he: BcIntertie.Allocations[].he,',
        'bc_import: BcIntertie.Allocations[].import.atc,',
        'bc_export:  BcIntertie.Allocations[].export.atc,',
        'matl_import: MatlIntertie.Allocations[].import.atc,',
        'matl_export:  MatlIntertie.Allocations[].export.atc',
    '}'
)
j_query(json, path) |>
    listviewer::jsonedit()

Finally, run j_pivot() to transform the JSON to a tibble.

j_pivot(json, path, as = "tibble")

The result is

## # A tibble: 48 × 6
##    date       he    bc_import bc_export matl_import matl_export
##    <chr>      <chr>     <int>     <int>       <int>       <int>
##  1 2024-07-10 4           750       950         295         300
##  2 2024-07-10 5           750       950         295         300
##  3 2024-07-10 6           750       950         295         300
##  4 2024-07-10 7           750       950         295         300
##  5 2024-07-10 8           750       950         295         300
##  6 2024-07-10 9           750       950         295         300
##  7 2024-07-10 10          750       950         295         300
##  8 2024-07-10 11          750       950         295         300
##  9 2024-07-10 12          750       950         295         300
## 10 2024-07-10 13          750       950         295         300
## # ℹ 38 more rows
## # ℹ Use `print(n = ...)` to see more rows

Reshaping nested records

https://stackoverflow.com/questions/78952424

This question wants to transform JSON into a data.table. A previous answer uses rbindlist() (similar to dplyr::bind_rows()) to transform structured lists to data.tables. Here is the sample data

json <- '[
  {
    "version_id": "123456",
    "data": [
      {
        "review_id": "1",
        "rating": 5,
        "review": "This app is great",
        "date": "2024-09-01"
      },
      {
        "review_id": "2",
        "rating": 1,
        "review": "This app is terrible",
        "date": "2024-09-01"
      }
    ]
  },
  {
    "version_id": "789101",
    "data": [
      {
        "review_id": "3",
        "rating": 3,
        "review": "This app is OK",
        "date": "2024-09-01"
      }
    ]
  }
]'

The desired data.table is flattened to include version_id and each field of data[] as columns in the table, with the complication that version_id needs to be replicated for each element of data[].

The rjsoncons answer illustrates several approaches. The approach most cleanly separating data transformation and data.table construction using JMESPath creates an array of objects where each version_id is associated with vectors of review_id, etc., corresponding to that version.

query <-
    "[].{
        version_id: version_id,
        review_id: data[].review_id,
        rating: data[].rating,
        review: data[].review,
        date: data[].date
    }"

As an R object, this is exactly handled by rbindlist(). Note that a pivot is not involved.

records <- j_query(json, query, as = "R")
data.table::rbindlist(records)
#>    version_id review_id rating               review       date
#>        <char>    <char>  <int>               <char>     <char>
#> 1:     123456         1      5    This app is great 2024-09-01
#> 2:     123456         2      1 This app is terrible 2024-09-01
#> 3:     789101         3      3       This app is OK 2024-09-01

dplyr’s bind_rows() behaves similarly:

dplyr::bind_rows(records)
#> # A tibble: 3 × 5
#>   version_id review_id rating review               date      
#>   <chr>      <chr>      <int> <chr>                <chr>     
#> 1 123456     1              5 This app is great    2024-09-01
#> 2 123456     2              1 This app is terrible 2024-09-01
#> 3 789101     3              3 This app is OK       2024-09-01

Reading from URLs

https://stackoverflow.com/questions/78023560

This question illustrates rjsoncons ability to read URLs; the query itself extracts from the fixtures array of objects specific nested elements, and is similar to the previous question. In practice, I used json <- readLines(url) to create a local copy of the data to use while developing the query.

url <- "https://www.nrl.com/draw//data?competition=111&season=2024"
query <- 'fixtures[].{
    homeTeam: homeTeam.nickName, 
    awayTeam: awayTeam.nickName
}'
j_pivot(url, query, as = "tibble")
#> # A tibble: 4 × 2
#>   homeTeam awayTeam  
#>   <chr>    <chr>     
#> 1 Panthers Roosters  
#> 2 Storm    Sharks    
#> 3 Cowboys  Knights   
#> 4 Bulldogs Sea Eagles

The easiest path to a more general answer (extract all members of ‘homeTeam’ and ‘awayTeam’ as a tibble) might, like the posted answer, combine JSON extraction and tidyr.

query <- 'fixtures[].{ homeTeam: homeTeam, awayTeam: awayTeam }'
j_pivot(url, query, as = "tibble") |>
    tidyr::unnest_wider(c("homeTeam", "awayTeam"), names_sep = "_")
#> # A tibble: 4 × 8
#>   homeTeam_teamId homeTeam_nickName homeTeam_teamPosition homeTeam_theme  
#>             <int> <chr>             <chr>                 <list>          
#> 1          500014 Panthers          2nd                   <named list [2]>
#> 2          500021 Storm             1st                   <named list [2]>
#> 3          500012 Cowboys           5th                   <named list [2]>
#> 4          500010 Bulldogs          6th                   <named list [2]>
#> # ℹ 4 more variables: awayTeam_teamId <int>, awayTeam_nickName <chr>,
#> #   awayTeam_teamPosition <chr>, awayTeam_theme <list>

Deeply nested objects

https://stackoverflow.com/questions/77998013

The details of this question are on StackOverflow, and the following code chunks are not evaluated directly. The example has several interesting elements:

  • The JSON is quite large (about 90 Mb), so processing is not immediate. While developing the query, I focused on a subset of the data for a more interactive experience.

    Crime2013 <- j_query(json, "x.calls[9].args")
  • JSON array indexing is 0-based, in contrast to 1-based R indexing.

  • In developing the JSON query, I spent quite a bit of time viewing results using listviewer::jsonedit(), e.g.,

    listviewer::jsonedit(j_query(Crime2013, "[0][*]"))

The objects of interest are polygon coordinates nested deeply in the JSON, at the location implied by JMESPath. One polygon is at

query <- "x.calls[9].args[0][0][0][0]"
j_pivot(json, query, as = "tibble")
## # A tibble: 27 × 2
##      lng   lat
##    <dbl> <dbl>
##  1 -43.3 -22.9
##  2 -43.3 -22.9
##  3 -43.3 -22.9
##  4 -43.3 -22.9
##  5 -43.3 -22.9
##  6 -43.3 -22.9
##  7 -43.3 -22.9
##  8 -43.3 -22.9
##  9 -43.3 -22.9
## 10 -43.3 -22.9
## # ℹ 17 more rows
## # ℹ Use `print(n = ...)` to see more rows

There are 3618 of these polygons, and they are extracted by using a wild-card * in place of an index 0 at a particular place in the path.

query <- "x.calls[9].args[0][*][0][0]"
j_pivot(json, query, as = "tibble")
## # A tibble: 3,618 × 2
##    lng        lat       
##    <list>     <list>    
##  1 <dbl [27]> <dbl [27]>
##  2 <dbl [6]>  <dbl [6]> 
##  3 <dbl [5]>  <dbl [5]> 
##  4 <dbl [42]> <dbl [42]>
##  5 <dbl [6]>  <dbl [6]> 
##  6 <dbl [8]>  <dbl [8]> 
##  7 <dbl [4]>  <dbl [4]> 
##  8 <dbl [6]>  <dbl [6]> 
##  9 <dbl [13]> <dbl [13]>
## 10 <dbl [12]> <dbl [12]>
## # ℹ 3,608 more rows
## # ℹ Use `print(n = ...)` to see more rows

tidyr::unnest() could be used to create a ‘long’ version of this result, as illustrated in my response.

While exploring the data, the JMESPath function length() suggested two anomalies in the data (not all paths in Crime2013 have just one polygon; a path used to identify polygons has only 2022 elements, but there are 3618 polygons); this could well be a misunderstanding on my part.

JSONPath wildcards

https://stackoverflow.com/questions/78029215

This question retrieves a JSON representation of the hierarchy of departments within German research institutions. The interest is in finding the path between two departments.

The answer posted on StackOverflow translates the JSON list-of-lists structure describing the hierarchy into an R list-of-lists and uses a series of complicated manipulations to form a tibble suitable for querying.

The approach here recognizes this as a graph-based problem. The goal is to construct a graph from JSON, and then use graph algorithms to identify the shortest path between nodes.

I followed @margusl to extract JSON from the web page.

library(rvest)
html <- read_html("https://www.gerit.org/en/institutiondetail/10282")

## scrape JSON
xpath <- '//script[contains(text(),"window.__PRELOADED_STATE__")]'
json <- 
  html |>
  html_element(xpath = xpath) |>
  html_text() |>
  sub(pattern = "window.__PRELOADED_STATE__ = ", replacement = "", fixed = TRUE)

I then used listviewer to explore the JSON interactively, to get a feel for the structure.

listviewer::jsonedit(json)

Obtain the root of the institutional hierarchy with a simple path traversal, using JSONPath syntax. JSONPath provides ‘wild-card’ syntax, which is convenient when querying hierarchical data representations.

library(rjsoncons)
library(dplyr)
tree <- j_query(json, "$.institutionDetail.institution.tree")

It looks like each institution is a (directed, acyclic) graph, with nodes representing each division in the institution.

The nodes are easy to reconstruct. Look for keys id, name.de, and name.enusing the wild-card .., meaning ‘at any depth’. Each of these queries returns a vector. Use them to construct a tibble. Although the id key is an integer in the JSON, it seems appropriate to think of these as character-valued.

## these are all nodes, including the query node; each node has an
## 'id' and german ('de') and english ('en') name.
nodes <- tibble(
    ## all keys 'id' , 'name.de', and 'name.en', under 'tree' with
    ## wild-card '..' matching
    id =
        j_query(tree, "$..id", as = "R") |>
        as.character(),
    de = j_query(tree, "$..name.de", as = "R"),
    en = j_query(tree, "$..name.en", as = "R")
)

There are 497 nodes in this institution.

nodes
#> # A tibble: 497 × 3
#>    id        de                                                            en   
#>    <chr>     <chr>                                                         <chr>
#>  1 10282     "Universität zu Köln"                                         Univ…
#>  2 14036     "Fakultät 1: Wirtschafts- und Sozialwissenschaftliche Fakult… Facu…
#>  3 176896857 "Cologne Graduate School in Management, \rEconomics and Soci… Colo…
#>  4 555102855 "Cologne Institute for Information Systems (CIIS)"            Colo…
#>  5 555936524 "Chair of Business Analytics"                                 Chai…
#>  6 537375237 "ECONtribute: Markets & Public Policy"                        ECON…
#>  7 439201502 "Fachbereich Volkswirtschaftslehre"                           Econ…
#>  8 202057753 "Center for Macroeconomic Research (CMR)"                     Cent…
#>  9 228480345 "Seminar für Energiewissenschaft"                             Chai…
#> 10 237641982 "Seminar für Experimentelle Wirtschafts- und Verhaltensforsc… Semi…
#> # ℹ 487 more rows

The edges are a little tricky to reconstruct from the nested structure in the JSON. Start with the id and number of children of each node.

id <-
    j_query(tree, "$..id", as = "R") |>
    as.character()
children_per_node <- j_query(tree, "$..children.length", as = "R")

I developed the edgelist() function (in the folded code chunk below) to transform id and children_per_node into a two-column matrix of from-to relations. In the function, parent_id and n_children are stacks used to capture the hierarchical structure of the data. level represents the current level in the hierarchy. The algorithm walks along the id input, records the from/to relationship implied by n, and then completes that level of the hierarchy (if n_children[level] == 0), pushes the next level onto the stack, or continues to the next id.

edgelist <- function(id, n) {
    stopifnot(identical(length(id), length(n)))
    parent_id <- n_children <- integer()
    from <- to <- integer(length(id) - 1L)
    level <- 0L
    for (i in seq_along(id)) {
        if (i > 1) {
            ## record link from parent to child
            from[i - 1] <- tail(parent_id, 1L)
            to[i - 1] <- id[i]
            n_children[level] <- n_children[level] - 1L
        }
        if (level > 0 && n_children[level] == 0L) {
            ## 'pop' level
            level <- level - 1L
            parent_id <- head(parent_id, -1L)
            n_children <- head(n_children, -1L)
        }
        if (n[i] != 0) {
            ## 'push' level
            level <- level + 1L
            parent_id <- c(parent_id, id[i])
            n_children <- c(n_children, n[i])
        }
    }
    tibble(from, to)
}

The matrix of edges, with columns from and to, is then

edges <- edgelist(id, children_per_node)
edges
#> # A tibble: 496 × 2
#>    from      to       
#>    <chr>     <chr>    
#>  1 10282     14036    
#>  2 14036     176896857
#>  3 14036     555102855
#>  4 555102855 555936524
#>  5 14036     537375237
#>  6 14036     439201502
#>  7 439201502 202057753
#>  8 439201502 228480345
#>  9 439201502 237641982
#> 10 439201502 352878120
#> # ℹ 486 more rows

The computation is tricky, but not too inefficient. There are only two queries of the JSON object (for id and children_per_node) and the R iteration over elements of id is not too extensive.

I used the tidygraph package to represent the graph from the nodes and edges.

tg <- tidygraph::tbl_graph(nodes = nodes, edges = edges, node_key = "id")
tg
#> # A tbl_graph: 497 nodes and 496 edges
#> #
#> # A rooted tree
#> #
#> # Node Data: 497 × 3 (active)
#>    id        de                                                            en   
#>    <chr>     <chr>                                                         <chr>
#>  1 10282     "Universität zu Köln"                                         Univ…
#>  2 14036     "Fakultät 1: Wirtschafts- und Sozialwissenschaftliche Fakult… Facu…
#>  3 176896857 "Cologne Graduate School in Management, \rEconomics and Soci… Colo…
#>  4 555102855 "Cologne Institute for Information Systems (CIIS)"            Colo…
#>  5 555936524 "Chair of Business Analytics"                                 Chai…
#>  6 537375237 "ECONtribute: Markets & Public Policy"                        ECON…
#>  7 439201502 "Fachbereich Volkswirtschaftslehre"                           Econ…
#>  8 202057753 "Center for Macroeconomic Research (CMR)"                     Cent…
#>  9 228480345 "Seminar für Energiewissenschaft"                             Chai…
#> 10 237641982 "Seminar für Experimentelle Wirtschafts- und Verhaltensforsc… Semi…
#> # ℹ 487 more rows
#> #
#> # Edge Data: 496 × 2
#>    from    to
#>   <int> <int>
#> 1     1     2
#> 2     2     3
#> 3     2     4
#> # ℹ 493 more rows

I then used convert() to find the graph with the shortest path between two nodes, and extracted the tibble of nodes.

tg |>
    ## find the shortest path between two nodes...
    tidygraph::convert(
        tidygraph::to_shortest_path,
        de == "Fakultät 1: Wirtschafts- und Sozialwissenschaftliche Fakultät",
        de == "Professur Hölzl"
    ) |>
    ## extract the nodes from the resulting graph
    as_tibble("nodes") |>
    select(id, de)
#> # A tibble: 5 × 2
#>   id        de                                                           
#>   <chr>     <chr>                                                        
#> 1 14036     Fakultät 1: Wirtschafts- und Sozialwissenschaftliche Fakultät
#> 2 222785602 Fakultätsbereich Soziologie und Sozialpsychologie            
#> 3 18065     Institut für Soziologie und Sozialpsychologie (ISS)          
#> 4 18068     Lehrstuhl für Wirtschafts- und Sozialpsychologie             
#> 5 309061914 Professur Hölzl

This is the answer to the question posed in the StackOverflow post.

It can be fun to try and visualize the graph, e.g., using ggraph

ggraph::ggraph(tg, "tree", circular = TRUE) +
    ggraph::geom_edge_elbow()

Patch

Moving elements

https://stackoverflow.com/questions/78047988

The example is not completely reproducible, but the challenge is that the igraph package produces JSON like

data <- '
{
  "nodes": [
     {
        "name": "something"
     },
     {
        "name": "something_else"
     }
  ],
  "links": [
      {
         "source": "something",
         "target": "something_else"
      }
   ],
  "attributes": {
      "directed": false
  }
}'

but the desired data moves the ‘directed’ attribute to a top-level field. From the JSON patch documentation, the patch is a single ‘move’ operation from /attributes/directed to the top-level /:

patch <- '[
    {"op": "move", "from": "/attributes/directed", "path": "/directed"}
]'

The patch is accomplished with

patched_data <- j_patch_apply(data, patch)

This JSON string could be visualized with listviwer::jsonedit(patched_data), or patched_data |> as_r() |> str(), or

patched_data |>
    jsonlite::prettify()
#> {
#>     "nodes": [
#>         {
#>             "name": "something"
#>         },
#>         {
#>             "name": "something_else"
#>         }
#>     ],
#>     "links": [
#>         {
#>             "source": "something",
#>             "target": "something_else"
#>         }
#>     ],
#>     "attributes": {
#> 
#>     },
#>     "directed": false
#> }
#> 

Session information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rvest_1.0.4          dplyr_1.1.4          rjsoncons_1.3.1.9100
#> 
#> loaded via a namespace (and not attached):
#>  [1] viridis_0.6.5      sass_0.4.9         utf8_1.2.4         generics_0.1.3    
#>  [5] tidyr_1.3.1        xml2_1.3.6         digest_0.6.37      magrittr_2.0.3    
#>  [9] evaluate_0.24.0    grid_4.4.1         fastmap_1.2.0      jsonlite_1.8.8    
#> [13] ggrepel_0.9.6      gridExtra_2.3      httr_1.4.7         purrr_1.0.2       
#> [17] fansi_1.0.6        viridisLite_0.4.2  scales_1.3.0       tweenr_2.0.3      
#> [21] textshaping_0.4.0  jquerylib_0.1.4    cli_3.6.3          graphlayouts_1.1.1
#> [25] rlang_1.1.4        polyclip_1.10-7    tidygraph_1.3.1    munsell_0.5.1     
#> [29] withr_3.0.1        cachem_1.1.0       yaml_2.3.10        tools_4.4.1       
#> [33] memoise_2.0.1      colorspace_2.1-1   ggplot2_3.5.1      curl_5.2.2        
#> [37] vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4    fs_1.6.4          
#> [41] MASS_7.3-60.2      ragg_1.3.2         ggraph_2.2.1       pkgconfig_2.0.3   
#> [45] desc_1.4.3         pkgdown_2.1.0      pillar_1.9.0       bslib_0.8.0       
#> [49] gtable_0.3.5       data.table_1.16.0  glue_1.7.0         Rcpp_1.0.13       
#> [53] ggforce_0.4.2      systemfonts_1.1.0  highr_0.11         xfun_0.47         
#> [57] tibble_3.2.1       tidyselect_1.2.1   knitr_1.48         farver_2.1.2      
#> [61] htmltools_0.5.8.1  igraph_2.0.3       labeling_0.4.3     rmarkdown_2.28    
#> [65] compiler_4.4.1