Case Study: CELLxGENE Ontologies

This case study illustrates use of OLSr to facilitate data set discovery in the CELLxGENE data portal. The CELLxGENE data portal provides a graphical user interface to collections of single-cell sequence data processed in standard ways to ‘count matrix’ summaries. The cellxgenedp R / Bioconductor package provides an R-based interface, allowing data discovery, viewing, and downloading.

This vignette uses the cellxgenedp, igraph, and visNetwork packages. Install these with the following command

if (!"BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager", repos = "https://cloud.r-project.org")
pkgs <- c("cellxgenedp", "igraph", "visNetwork")
required_pkgs <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(required_pkgs)

Start by loading the OLSr and cellxgenedp packages

library(OLSr)
library(cellxgenedp)

CELLxGENE

A central element of the CELLxGENE resource is the collection of datasets. These are discovered programmatically using

db <- db()
datasets(db)
#> # A tibble: 1,720 × 33
#>    dataset_id   dataset_version_id collection_id donor_id assay  batch_condition
#>    <chr>        <chr>              <chr>         <list>   <list> <list>         
#>  1 1f1c5c14-59… 2afef4bd-99af-41f… 2902f08c-f83… <chr>    <list> <lgl [1]>      
#>  2 f9ad5649-f3… d9db936c-41c6-439… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  3 cd77258f-b0… 000198ac-27c7-4b9… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  4 bdacc907-7c… 017837df-b8be-4a4… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  5 b94e3bdf-a3… 3512034b-9a5c-4db… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  6 9f1049ac-f8… bb8f672d-593d-482… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  7 873ff933-4f… a7d3ba30-903d-44f… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  8 75a881cf-5d… 50d7154f-e5b9-48a… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#>  9 6c600df6-dd… 1f97f38b-2a9e-48a… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#> 10 2727d83a-0a… dd0627a2-4363-46e… 180bff9c-c8a… <chr>    <list> <lgl [1]>      
#> # ℹ 1,710 more rows
#> # ℹ 27 more variables: cell_count <int>, cell_type <list>, citation <chr>,
#> #   default_embedding <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

Data sets are classified using ontologies exposed by cellxgenedp to the user as ‘facets’.

facets <- facets(db)
facets
#> # A tibble: 1,966 × 4
#>    facet label                          ontology_term_id     n
#>    <chr> <chr>                          <chr>            <int>
#>  1 assay 10x 3' v3                      EFO:0009922        807
#>  2 assay 10x 3' v2                      EFO:0009899        364
#>  3 assay Visium Spatial Gene Expression EFO:0010961        298
#>  4 assay Slide-seqV2                    EFO:0030062        240
#>  5 assay 10x 5' v1                      EFO:0011025        118
#>  6 assay Smart-seq2                     EFO:0008931         77
#>  7 assay 10x multiome                   EFO:0030059         72
#>  8 assay 10x 5' v2                      EFO:0009900         58
#>  9 assay sci-RNA-seq3                   EFO:0030028         48
#> 10 assay Drop-seq                       EFO:0008722         27
#> # ℹ 1,956 more rows

Some insight into the ontologies used to construct facets can be gained by extracting the ontology from the ontology_term_id.

facets |>
    ## a small number of terms are not from any ontology; these do not
    ## have a ':' in the ontology_term_id
    filter(grepl(":", ontology_term_id, fixed = TRUE)) |> 
    ## for the remainder, discover the ontology from the ontology_term_id
    mutate(ontology = tolower(sub("(.*):.*", "\\1", ontology_term_id))) |>
    count(facet, ontology)
#> # A tibble: 15 × 3
#>    facet                   ontology                     n
#>    <chr>                   <chr>                    <int>
#>  1 assay                   efo                         42
#>  2 cell_type               cl                         971
#>  3 development_stage       hsapdv                     181
#>  4 development_stage       mmusdv                      68
#>  5 development_stage       uberon                       5
#>  6 disease                 mondo                      147
#>  7 disease                 pato                         1
#>  8 organism                ncbitaxon                    7
#>  9 self_reported_ethnicity hancestro                   34
#> 10 self_reported_ethnicity hancestro:0005,hancestro     1
#> 11 self_reported_ethnicity hancestro:0013,hancestro     1
#> 12 self_reported_ethnicity hancestro:0014,hancestro     1
#> 13 sex                     pato                         2
#> 14 tissue                  cl                          13
#> 15 tissue                  uberon                     487

Our example will work with the ‘disease’ facet, based on the ‘mondo’ ontology. Check that the ontology is in the OLS.

"mondo" %in% get_ontologies()$id
#> [1] TRUE

Retrieve all terms from the ontology.

mondo <- get_terms("mondo")
mondo
#> # A tibble: 26,015 × 14
#>    label     obo_id description iri   synonyms annotation   has_children is_root
#>    <chr>     <chr>  <chr>       <chr> <list>   <list>       <lgl>        <lgl>  
#>  1 disease   MONDO… "A disease… http… <list>   <named list> TRUE         FALSE  
#>  2 adrenoco… MONDO… "An endocr… http… <list>   <named list> TRUE         FALSE  
#>  3 alopecia… MONDO…  NA         http… <NULL>   <named list> TRUE         FALSE  
#>  4 inherite… MONDO…  NA         http… <list>   <named list> TRUE         FALSE  
#>  5 colorbli… MONDO… "Reason of… http… <NULL>   <named list> TRUE         FALSE  
#>  6 classic … MONDO… "A genetic… http… <list>   <named list> TRUE         FALSE  
#>  7 nocturna… MONDO… "Urination… http… <list>   <named list> TRUE         FALSE  
#>  8 infantil… MONDO… "OMIM seri… http… <list>   <named list> TRUE         FALSE  
#>  9 sleep-re… MONDO…  NA         http… <list>   <named list> TRUE         FALSE  
#> 10 febrile … MONDO…  NA         http… <list>   <named list> TRUE         FALSE  
#> # ℹ 26,005 more rows
#> # ℹ 6 more variables: short_form <chr>, in_subset <list>,
#> #   obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> #   `_links` <list>

There are 26015 terms in this ontology; only a small subset of 148 terms are used in CELLxGENE datasets. Annotate the disease facet with information from the OLS.

disease_terms <-
    facets(db, "disease") |>
    ## join based on identity between CxG's 'ontology_term_id' and
    ## OLS's 'obo_id'
    left_join(mondo, by = c("label", ontology_term_id = "obo_id")) |>
    ## for consistency with OLSr convention, rename the
    ## 'ontology_term_id' to 'obo_id'
    rename(obo_id = "ontology_term_id")
disease_terms
#> # A tibble: 148 × 16
#>    facet label obo_id     n description iri   synonyms annotation   has_children
#>    <chr> <chr> <chr>  <int> <chr>       <chr> <list>   <list>       <lgl>       
#>  1 dise… norm… PATO:…  1448 NA          NA    <NULL>   <NULL>       NA          
#>  2 dise… COVI… MONDO…    66 A disease … http… <list>   <named list> FALSE       
#>  3 dise… deme… MONDO…    50 Loss of in… http… <list>   <named list> TRUE        
#>  4 dise… brea… MONDO…    34 A primary … http… <list>   <named list> TRUE        
#>  5 dise… myoc… MONDO…    29 Gross necr… http… <list>   <named list> TRUE        
#>  6 dise… diab… MONDO…    26 Progressiv… http… <list>   <named list> FALSE       
#>  7 dise… Alzh… MONDO…    24 A progress… http… <list>   <named list> TRUE        
#>  8 dise… auto… MONDO…    24 Autosomal … http… <list>   <named list> TRUE        
#>  9 dise… nonp… MONDO…    19 NA          http… <list>   <named list> TRUE        
#> 10 dise… colo… MONDO…    17 A primary … http… <list>   <named list> TRUE        
#> # ℹ 138 more rows
#> # ℹ 7 more variables: is_root <lgl>, short_form <chr>, in_subset <list>,
#> #   obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> #   `_links` <list>

Note that the ‘normal’ term with OBO id ‘PATO:0000461’ used by CELLxGENE is not from the ‘mondo’ ontology; we could discover information about this term by also retrieving the ‘pato’ ontology from the OLS, or using get_term("pato", "PATO:0000461").

Identifying carcinoma datasets

CELLxGENE presents the disease facet as a flat list of terms, although the datasets are presented with an ad hoc filter to group terms and facilitate selection of datasets containing samples from multiple terms. A more principled approach is to use the ontology to identify terms of interest.

Suppose one is interested in all datasets annotated with disease term derived from ‘carcinoma’; this term does not itself appear in the disease facet. Find the obo_id of the ‘carcinoma’ term in the the mondo ontology, and use that to find descendants.

carcinoma_id <-
    mondo |>
    filter(label == "carcinoma") |>
    pull("obo_id")
carcinoma <- get_descendants("mondo", carcinoma_id)

There are 973 terms descending from ‘carcinoma’. Terms used by the disease facet are

disease_terms |>
    inner_join(carcinoma) |>
    select(obo_id, n, label, description)
#> Joining with `by = join_by(label, obo_id, description, iri, synonyms,
#> annotation, has_children, is_root, short_form, in_subset,
#> obo_definition_citation, obo_xref, obo_synonym, `_links`)`
#> # A tibble: 22 × 4
#>    obo_id            n label                             description            
#>    <chr>         <int> <chr>                             <chr>                  
#>  1 MONDO:0007763    19 nonpapillary renal cell carcinoma NA                     
#>  2 MONDO:0005061    12 lung adenocarcinoma               A carcinoma that arise…
#>  3 MONDO:0008433    12 small cell lung carcinoma         Small cell lung cancer…
#>  4 MONDO:0004953    11 invasive ductal breast carcinoma  The most common type o…
#>  5 MONDO:0020804    10 basal cell carcinoma              A carcinoma involving …
#>  6 MONDO:0005051     9 invasive lobular breast carcinoma An infiltrating lobula…
#>  7 MONDO:0005086     9 renal cell carcinoma              A carcinoma that arise…
#>  8 MONDO:0004989     8 breast carcinoma                  A carcinoma that arise…
#>  9 MONDO:0024885     5 malignant ovarian serous tumor    An invasive malignant …
#> 10 MONDO:0005097     4 squamous cell lung carcinoma      A carcinoma arising fr…
#> # ℹ 12 more rows

The facets_filter() function of cellxgenedp makes it easy to identify datasets using terms derived from ‘carcinoma’, perhaps in addition filtering to single type of assay.

datasets_of_interest <-
    datasets(db) |>
    filter(
        facets_filter(disease, "ontology_term_id", carcinoma$obo_id),
        facets_filter(assay, "label", "10x 3' v3")
    )
datasets_of_interest
#> # A tibble: 34 × 33
#>    dataset_id   dataset_version_id collection_id donor_id assay  batch_condition
#>    <chr>        <chr>              <chr>         <list>   <list> <list>         
#>  1 b6b5ea88-e0… 915069db-1df2-49a… 3c34e6f1-682… <chr>    <list> <lgl [1]>      
#>  2 e3a7e927-26… 56bbc63f-88a9-4e8… 4796c91c-9d8… <chr>    <list> <lgl [1]>      
#>  3 b252b015-b4… 73fbcec3-f602-4e1… 4796c91c-9d8… <chr>    <list> <lgl [1]>      
#>  4 97d9238c-1a… bcdad03f-c1d3-4c5… 4796c91c-9d8… <chr>    <list> <lgl [1]>      
#>  5 44c93f2b-dd… c8ceafd0-17c9-4da… 4796c91c-9d8… <chr>    <list> <lgl [1]>      
#>  6 0caedec7-1c… ffc98fde-61c6-43e… 4796c91c-9d8… <chr>    <list> <lgl [1]>      
#>  7 d41f45c1-1b… 32149a2b-b637-481… efd94500-1fd… <chr>    <list> <lgl [1]>      
#>  8 e500acbf-f1… a3a5125b-2e73-4a7… 7dd599c5-d25… <chr>    <list> <chr [1]>      
#>  9 5d3fc988-76… 7fd0b766-0d34-4b8… 7dd599c5-d25… <chr>    <list> <chr [1]>      
#> 10 3e4e2c8e-cd… 5f902c20-a2a0-47c… 7dd599c5-d25… <chr>    <list> <chr [1]>      
#> # ℹ 24 more rows
#> # ℹ 27 more variables: cell_count <int>, cell_type <list>, citation <chr>,
#> #   default_embedding <chr>, development_stage <list>, disease <list>,
#> #   embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> #   feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> #   mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> #   raw_data_location <chr>, schema_version <chr>, …

There are 34 datasets. Further steps, such as visualization using CELLxGENE or downloading datasets for local analysis, are described in the cellxgenedp vignette.

Ontologies as graphs

It can be useful to compute on ontologies as graphs. Use get_descendants_graph() to retrieve, recursively, the children of the ‘carcinoma’ term. This step can take one or two minutes, and requires one API call per child. The API calls are memoised, so the process is time consuming only the first time the graph is constructed.

carcinoma_graph <-
    ## many calls to the OLS API; slow the first time
    get_descendants_graph("mondo", carcinoma_id, mondo)
#> 0 relatives visited; 1 in queue
#> 973 relatives visited; 0 in queue

get_descendants_graph() returns a olsr_graph object, which contains information about the type of graph (ancestors or descendants), the ontology used to construct the graph, and the ids used to seed the query.

carcinoma_graph
#> class: descendants_graph (974 nodes x 1571 edges)
#> ontology: mondo
#> ids (1): MONDO:0004993

The object can be queried from relevant components, for instance annotation of each node

olsr_graph_nodes(carcinoma_graph)
#> # A tibble: 974 × 14
#>    obo_id     label description iri   synonyms annotation   has_children is_root
#>    <chr>      <chr> <chr>       <chr> <list>   <list>       <lgl>        <lgl>  
#>  1 MONDO:000… carc… A malignan… http… <list>   <named list> TRUE         FALSE  
#>  2 MONDO:085… lymp… A lymph no… http… <NULL>   <named list> FALSE        FALSE  
#>  3 MONDO:004… glyc… A carcinom… http… <list>   <named list> TRUE         FALSE  
#>  4 MONDO:004… inva… A carcinom… http… <list>   <named list> TRUE         FALSE  
#>  5 MONDO:002… seco… A carcinom… http… <list>   <named list> TRUE         FALSE  
#>  6 MONDO:002… carc… A carcinom… http… <list>   <named list> TRUE         FALSE  
#>  7 MONDO:002… comb… A malignan… http… <list>   <named list> TRUE         FALSE  
#>  8 MONDO:002… basa… A carcinom… http… <list>   <named list> TRUE         FALSE  
#>  9 MONDO:001… mali… An invasiv… http… <list>   <named list> TRUE         FALSE  
#> 10 MONDO:001… chor… A malignan… http… <list>   <named list> TRUE         FALSE  
#> # ℹ 964 more rows
#> # ℹ 6 more variables: short_form <chr>, in_subset <list>,
#> #   obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> #   `_links` <list>

Graph manipulation

There are 974 nodes in this graph, but only a dozen or so terms used in the CELLxGENE disease facet. We would like to compute on the graph so that it contains only the nodes involved in paths from the ‘carcinoma’ term to terms used by the disease facet.

Start by coercing the olsr_graph to an object that can be used by the igraph package.

ig <- olsr_graph_as_igraph(carcinoma_graph)

olsr_graph_as_igraph() uses the nodes of carcinoma_graph as vertex attributes, and adds information about the relationship, ontology, and id(s) as a graph attribute. Note that the OLS field obo_id has been translated to the igraph attribute name.

igraph::vertex_attr_names(ig)
#>  [1] "name"                    "label"                  
#>  [3] "description"             "iri"                    
#>  [5] "synonyms"                "annotation"             
#>  [7] "has_children"            "is_root"                
#>  [9] "short_form"              "in_subset"              
#> [11] "obo_definition_citation" "obo_xref"               
#> [13] "obo_synonym"             "_links"
igraph::graph_attr(ig, "olsr_graph") |>
    str()
#> List of 3
#>  $ relation: chr "descendants"
#>  $ ontology: chr "mondo"
#>  $ ids     : chr "MONDO:0004993"

Identifying a subgraph of ig corresponds (I think!) to calculating a minimal directed Steiner tree. An algorithm for this does not exist in igraph or other R packages, so we must develop our own. The solution defined in the directed_steiner_tree_approx() function is to find the shortest paths from the ‘carcinoma’ node to each term in the disease facet, and then to create a subgraph containing just those nodes.

directed_steiner_tree_approx <-
    function(graph, from, to)
{
    ## only an approximation! Use the union of all nodes on all
    ## shortest paths from 'from' to each of 'to'

    ## find shortest paths from node 'from' to each 'to' node
    paths <- igraph::shortest_paths(ig, from, to)$vpath

    ## identify nodes on the shortest paths
    shortest_paths_nodes <-
        lapply(paths, names) |>
        unlist() |>
        unique()

    ## create an induced subgraph on the shortest-path nodes
    igraph::induced_subgraph(ig, shortest_paths_nodes)
}

This solution is not necessarily efficient, and does not guarantee a minimal tree, but it is perhaps good enough for the size of data encountered in a typical ontology.

To use the algorithm, identify the starting node and the relevant nodes of the disease facet, and use these as inputs

keep_ids <- c(
    carcinoma_id,
    intersect(disease_terms$obo_id, carcinoma$obo_id)
)
ig1 <- directed_steiner_tree_approx(ig, head(keep_ids, 1), tail(keep_ids, -1))
ig1
#> IGRAPH 4a81c64 DN-- 39 44 -- 
#> + attr: olsr_graph (g/x), name (v/c), label (v/c), description (v/c),
#> | iri (v/c), synonyms (v/x), annotation (v/x), has_children (v/l),
#> | is_root (v/l), short_form (v/c), in_subset (v/x),
#> | obo_definition_citation (v/x), obo_xref (v/x), obo_synonym (v/x),
#> | _links (v/x)
#> + edges from 4a81c64 (vertex names):
#>  [1] MONDO:0004993->MONDO:0040677 MONDO:0004993->MONDO:0020804
#>  [3] MONDO:0004993->MONDO:0018364 MONDO:0004993->MONDO:0006406
#>  [5] MONDO:0004993->MONDO:0006181 MONDO:0004993->MONDO:0005206
#>  [7] MONDO:0004993->MONDO:0005138 MONDO:0004993->MONDO:0004989
#> + ... omitted several edges

The induced graph returned by directed_steiner_tree_approx() includes vertex and graph annotations from the input graph, and can be coerced to a olsr_graph object.

igraph_as_olsr_graph(ig1)
#> class: descendants_graph (39 nodes x 44 edges)
#> ontology: mondo
#> ids (1): MONDO:0004993

Graph visualization

There are 39 nodes in this subgraph, few enough for convenient interactive visualization using the visNetwork package.

Use igraph commands to color nodes used in graph construction (the ‘from’ and ‘to’ nodes) differently from nodes required to represent paths between nodes. Use the igraph ‘name’ attribute (OBO id) as the ‘title’ (i.e., mouse-over tool-tip). The ‘label’ attribute is used to label nodes.

## colorspace::qualitative_hcl(2, "Set 2")
palette <- c("#ED90A4", "#00C1B2")
igraph::V(ig1)$color <- palette[ (igraph::V(ig1)$name %in% keep_ids) + 1L]
igraph::V(ig1)$title <- igraph::V(ig1)$name # obo_id
## V(ig1)$label             # 'label' (short description) from OLS

Create an interactive visualization using visIgraph(). The ‘sugiyama’ layout is appropriate for directed tree-like graphs. idToLabel = FALSE ensures that the label vertex attribute, rather than the internal igraph id, is used to label nodes.

Note that the graph is interactive – zoom in and out, mouse-over and select nodes, etc.

visNetwork::visIgraph(ig1, layout = "layout_with_sugiyama", idToLabel = FALSE)

A more flexible alternative for visualization is to translate an olsr_graph to a visNetwork object, and to transform that for visualization.

carcinoma_subgraph <- igraph_as_olsr_graph(ig1)
olsr_graph_as_visNetwork(carcinoma_subgraph) |>
    visNetwork::visEdges(arrows = "to") |>
    visNetwork::visHierarchicalLayout(
        sortMethod = "directed",
        direction = "LR"
    )

Session information

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cellxgenedp_1.10.0 OLSr_0.0.4         dplyr_1.1.4        BiocStyle_2.34.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] rappdirs_0.3.3       sass_0.4.9           utf8_1.2.4          
#>  [4] generics_0.1.3       tidyr_1.3.1          digest_0.6.37       
#>  [7] magrittr_2.0.3       evaluate_1.0.3       bookdown_0.42       
#> [10] fastmap_1.2.0        jsonlite_1.8.9       promises_1.3.2      
#> [13] BiocManager_1.30.25  httr_1.4.7           purrr_1.0.2         
#> [16] httr2_1.1.0          textshaping_1.0.0    jquerylib_0.1.4     
#> [19] cli_3.6.3            shiny_1.10.0         rlang_1.1.5         
#> [22] visNetwork_2.1.2     withr_3.0.2          cachem_1.1.0        
#> [25] yaml_2.3.10          BiocBaseUtils_1.8.0  tools_4.4.2         
#> [28] memoise_2.0.1        httpuv_1.6.15        DT_0.33             
#> [31] curl_6.1.0           vctrs_0.6.5          rjsoncons_1.3.1.9100
#> [34] R6_2.5.1             mime_0.12            lifecycle_1.0.4     
#> [37] fs_1.6.5             htmlwidgets_1.6.4    ragg_1.3.3          
#> [40] pkgconfig_2.0.3      desc_1.4.3           pkgdown_2.1.1       
#> [43] pillar_1.10.1        bslib_0.8.0          later_1.4.1         
#> [46] glue_1.8.0           Rcpp_1.0.14          systemfonts_1.2.1   
#> [49] xfun_0.50            tibble_3.2.1         tidyselect_1.2.1    
#> [52] knitr_1.49           xtable_1.8-4         igraph_2.1.3        
#> [55] htmltools_0.5.8.1    rmarkdown_2.29       compiler_4.4.2