Case Study: CELLxGENE Ontologies
Martin Morgan
Roswell Park Comprehensive Cancer Center, Buffalo, NY, USSource:
vignettes/b_case_study_cxg.Rmd
b_case_study_cxg.Rmd
This case study illustrates use of OLSr to facilitate data set discovery in the CELLxGENE data portal. The CELLxGENE data portal provides a graphical user interface to collections of single-cell sequence data processed in standard ways to ‘count matrix’ summaries. The cellxgenedp R / Bioconductor package provides an R-based interface, allowing data discovery, viewing, and downloading.
This vignette uses the cellxgenedp, igraph, and visNetwork packages. Install these with the following command
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager", repos = "https://cloud.r-project.org")
pkgs <- c("cellxgenedp", "igraph", "visNetwork")
required_pkgs <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(required_pkgs)
Start by loading the OLSr and cellxgenedp packages
CELLxGENE
A central element of the CELLxGENE resource is the collection of datasets. These are discovered programmatically using
db <- db()
datasets(db)
#> # A tibble: 1,720 × 33
#> dataset_id dataset_version_id collection_id donor_id assay batch_condition
#> <chr> <chr> <chr> <list> <list> <list>
#> 1 1f1c5c14-59… 2afef4bd-99af-41f… 2902f08c-f83… <chr> <list> <lgl [1]>
#> 2 f9ad5649-f3… d9db936c-41c6-439… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 3 cd77258f-b0… 000198ac-27c7-4b9… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 4 bdacc907-7c… 017837df-b8be-4a4… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 5 b94e3bdf-a3… 3512034b-9a5c-4db… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 6 9f1049ac-f8… bb8f672d-593d-482… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 7 873ff933-4f… a7d3ba30-903d-44f… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 8 75a881cf-5d… 50d7154f-e5b9-48a… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 9 6c600df6-dd… 1f97f38b-2a9e-48a… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> 10 2727d83a-0a… dd0627a2-4363-46e… 180bff9c-c8a… <chr> <list> <lgl [1]>
#> # ℹ 1,710 more rows
#> # ℹ 27 more variables: cell_count <int>, cell_type <list>, citation <chr>,
#> # default_embedding <chr>, development_stage <list>, disease <list>,
#> # embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> # feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> # raw_data_location <chr>, schema_version <chr>, …
Data sets are classified using ontologies exposed by cellxgenedp to the user as ‘facets’.
facets <- facets(db)
facets
#> # A tibble: 1,966 × 4
#> facet label ontology_term_id n
#> <chr> <chr> <chr> <int>
#> 1 assay 10x 3' v3 EFO:0009922 807
#> 2 assay 10x 3' v2 EFO:0009899 364
#> 3 assay Visium Spatial Gene Expression EFO:0010961 298
#> 4 assay Slide-seqV2 EFO:0030062 240
#> 5 assay 10x 5' v1 EFO:0011025 118
#> 6 assay Smart-seq2 EFO:0008931 77
#> 7 assay 10x multiome EFO:0030059 72
#> 8 assay 10x 5' v2 EFO:0009900 58
#> 9 assay sci-RNA-seq3 EFO:0030028 48
#> 10 assay Drop-seq EFO:0008722 27
#> # ℹ 1,956 more rows
Some insight into the ontologies used to construct facets can be
gained by extracting the ontology from the
ontology_term_id
.
facets |>
## a small number of terms are not from any ontology; these do not
## have a ':' in the ontology_term_id
filter(grepl(":", ontology_term_id, fixed = TRUE)) |>
## for the remainder, discover the ontology from the ontology_term_id
mutate(ontology = tolower(sub("(.*):.*", "\\1", ontology_term_id))) |>
count(facet, ontology)
#> # A tibble: 15 × 3
#> facet ontology n
#> <chr> <chr> <int>
#> 1 assay efo 42
#> 2 cell_type cl 971
#> 3 development_stage hsapdv 181
#> 4 development_stage mmusdv 68
#> 5 development_stage uberon 5
#> 6 disease mondo 147
#> 7 disease pato 1
#> 8 organism ncbitaxon 7
#> 9 self_reported_ethnicity hancestro 34
#> 10 self_reported_ethnicity hancestro:0005,hancestro 1
#> 11 self_reported_ethnicity hancestro:0013,hancestro 1
#> 12 self_reported_ethnicity hancestro:0014,hancestro 1
#> 13 sex pato 2
#> 14 tissue cl 13
#> 15 tissue uberon 487
Annotating the ‘disease’ facet
Our example will work with the ‘disease’ facet, based on the ‘mondo’ ontology. Check that the ontology is in the OLS.
"mondo" %in% get_ontologies()$id
#> [1] TRUE
Retrieve all terms from the ontology.
mondo <- get_terms("mondo")
mondo
#> # A tibble: 26,015 × 14
#> label obo_id description iri synonyms annotation has_children is_root
#> <chr> <chr> <chr> <chr> <list> <list> <lgl> <lgl>
#> 1 disease MONDO… "A disease… http… <list> <named list> TRUE FALSE
#> 2 adrenoco… MONDO… "An endocr… http… <list> <named list> TRUE FALSE
#> 3 alopecia… MONDO… NA http… <NULL> <named list> TRUE FALSE
#> 4 inherite… MONDO… NA http… <list> <named list> TRUE FALSE
#> 5 colorbli… MONDO… "Reason of… http… <NULL> <named list> TRUE FALSE
#> 6 classic … MONDO… "A genetic… http… <list> <named list> TRUE FALSE
#> 7 nocturna… MONDO… "Urination… http… <list> <named list> TRUE FALSE
#> 8 infantil… MONDO… "OMIM seri… http… <list> <named list> TRUE FALSE
#> 9 sleep-re… MONDO… NA http… <list> <named list> TRUE FALSE
#> 10 febrile … MONDO… NA http… <list> <named list> TRUE FALSE
#> # ℹ 26,005 more rows
#> # ℹ 6 more variables: short_form <chr>, in_subset <list>,
#> # obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> # `_links` <list>
There are 26015 terms in this ontology; only a small subset of 148 terms are used in CELLxGENE datasets. Annotate the disease facet with information from the OLS.
disease_terms <-
facets(db, "disease") |>
## join based on identity between CxG's 'ontology_term_id' and
## OLS's 'obo_id'
left_join(mondo, by = c("label", ontology_term_id = "obo_id")) |>
## for consistency with OLSr convention, rename the
## 'ontology_term_id' to 'obo_id'
rename(obo_id = "ontology_term_id")
disease_terms
#> # A tibble: 148 × 16
#> facet label obo_id n description iri synonyms annotation has_children
#> <chr> <chr> <chr> <int> <chr> <chr> <list> <list> <lgl>
#> 1 dise… norm… PATO:… 1448 NA NA <NULL> <NULL> NA
#> 2 dise… COVI… MONDO… 66 A disease … http… <list> <named list> FALSE
#> 3 dise… deme… MONDO… 50 Loss of in… http… <list> <named list> TRUE
#> 4 dise… brea… MONDO… 34 A primary … http… <list> <named list> TRUE
#> 5 dise… myoc… MONDO… 29 Gross necr… http… <list> <named list> TRUE
#> 6 dise… diab… MONDO… 26 Progressiv… http… <list> <named list> FALSE
#> 7 dise… Alzh… MONDO… 24 A progress… http… <list> <named list> TRUE
#> 8 dise… auto… MONDO… 24 Autosomal … http… <list> <named list> TRUE
#> 9 dise… nonp… MONDO… 19 NA http… <list> <named list> TRUE
#> 10 dise… colo… MONDO… 17 A primary … http… <list> <named list> TRUE
#> # ℹ 138 more rows
#> # ℹ 7 more variables: is_root <lgl>, short_form <chr>, in_subset <list>,
#> # obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> # `_links` <list>
Note that the ‘normal’ term with OBO id ‘PATO:0000461’ used by
CELLxGENE is not from the ‘mondo’ ontology; we could discover
information about this term by also retrieving the ‘pato’ ontology from
the OLS, or using get_term("pato", "PATO:0000461")
.
Identifying carcinoma datasets
CELLxGENE presents the disease facet as a flat list of terms, although the datasets are presented with an ad hoc filter to group terms and facilitate selection of datasets containing samples from multiple terms. A more principled approach is to use the ontology to identify terms of interest.
Suppose one is interested in all datasets annotated with disease term
derived from ‘carcinoma’; this term does not itself appear in the
disease facet. Find the obo_id
of the ‘carcinoma’ term in
the the mondo ontology, and use that to find descendants.
carcinoma_id <-
mondo |>
filter(label == "carcinoma") |>
pull("obo_id")
carcinoma <- get_descendants("mondo", carcinoma_id)
There are 973 terms descending from ‘carcinoma’. Terms used by the disease facet are
disease_terms |>
inner_join(carcinoma) |>
select(obo_id, n, label, description)
#> Joining with `by = join_by(label, obo_id, description, iri, synonyms,
#> annotation, has_children, is_root, short_form, in_subset,
#> obo_definition_citation, obo_xref, obo_synonym, `_links`)`
#> # A tibble: 22 × 4
#> obo_id n label description
#> <chr> <int> <chr> <chr>
#> 1 MONDO:0007763 19 nonpapillary renal cell carcinoma NA
#> 2 MONDO:0005061 12 lung adenocarcinoma A carcinoma that arise…
#> 3 MONDO:0008433 12 small cell lung carcinoma Small cell lung cancer…
#> 4 MONDO:0004953 11 invasive ductal breast carcinoma The most common type o…
#> 5 MONDO:0020804 10 basal cell carcinoma A carcinoma involving …
#> 6 MONDO:0005051 9 invasive lobular breast carcinoma An infiltrating lobula…
#> 7 MONDO:0005086 9 renal cell carcinoma A carcinoma that arise…
#> 8 MONDO:0004989 8 breast carcinoma A carcinoma that arise…
#> 9 MONDO:0024885 5 malignant ovarian serous tumor An invasive malignant …
#> 10 MONDO:0005097 4 squamous cell lung carcinoma A carcinoma arising fr…
#> # ℹ 12 more rows
The facets_filter()
function of cellxgenedp makes it
easy to identify datasets using terms derived from ‘carcinoma’, perhaps
in addition filtering to single type of assay.
datasets_of_interest <-
datasets(db) |>
filter(
facets_filter(disease, "ontology_term_id", carcinoma$obo_id),
facets_filter(assay, "label", "10x 3' v3")
)
datasets_of_interest
#> # A tibble: 34 × 33
#> dataset_id dataset_version_id collection_id donor_id assay batch_condition
#> <chr> <chr> <chr> <list> <list> <list>
#> 1 b6b5ea88-e0… 915069db-1df2-49a… 3c34e6f1-682… <chr> <list> <lgl [1]>
#> 2 e3a7e927-26… 56bbc63f-88a9-4e8… 4796c91c-9d8… <chr> <list> <lgl [1]>
#> 3 b252b015-b4… 73fbcec3-f602-4e1… 4796c91c-9d8… <chr> <list> <lgl [1]>
#> 4 97d9238c-1a… bcdad03f-c1d3-4c5… 4796c91c-9d8… <chr> <list> <lgl [1]>
#> 5 44c93f2b-dd… c8ceafd0-17c9-4da… 4796c91c-9d8… <chr> <list> <lgl [1]>
#> 6 0caedec7-1c… ffc98fde-61c6-43e… 4796c91c-9d8… <chr> <list> <lgl [1]>
#> 7 d41f45c1-1b… 32149a2b-b637-481… efd94500-1fd… <chr> <list> <lgl [1]>
#> 8 e500acbf-f1… a3a5125b-2e73-4a7… 7dd599c5-d25… <chr> <list> <chr [1]>
#> 9 5d3fc988-76… 7fd0b766-0d34-4b8… 7dd599c5-d25… <chr> <list> <chr [1]>
#> 10 3e4e2c8e-cd… 5f902c20-a2a0-47c… 7dd599c5-d25… <chr> <list> <chr [1]>
#> # ℹ 24 more rows
#> # ℹ 27 more variables: cell_count <int>, cell_type <list>, citation <chr>,
#> # default_embedding <chr>, development_stage <list>, disease <list>,
#> # embeddings <list>, explorer_url <chr>, feature_biotype <list>,
#> # feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> # raw_data_location <chr>, schema_version <chr>, …
There are 34 datasets. Further steps, such as visualization using CELLxGENE or downloading datasets for local analysis, are described in the cellxgenedp vignette.
Ontologies as graphs
It can be useful to compute on ontologies as graphs. Use
get_descendants_graph()
to retrieve, recursively, the
children of the ‘carcinoma’ term. This step can take one or two minutes,
and requires one API call per child. The API calls are memoised, so the
process is time consuming only the first time the graph is
constructed.
carcinoma_graph <-
## many calls to the OLS API; slow the first time
get_descendants_graph("mondo", carcinoma_id, mondo)
#> 0 relatives visited; 1 in queue
#> 973 relatives visited; 0 in queue
get_descendants_graph()
returns a
olsr_graph
object, which contains information about the
type of graph (ancestors or descendants), the ontology used to construct
the graph, and the ids used to seed the query.
carcinoma_graph
#> class: descendants_graph (974 nodes x 1571 edges)
#> ontology: mondo
#> ids (1): MONDO:0004993
The object can be queried from relevant components, for instance annotation of each node
olsr_graph_nodes(carcinoma_graph)
#> # A tibble: 974 × 14
#> obo_id label description iri synonyms annotation has_children is_root
#> <chr> <chr> <chr> <chr> <list> <list> <lgl> <lgl>
#> 1 MONDO:000… carc… A malignan… http… <list> <named list> TRUE FALSE
#> 2 MONDO:085… lymp… A lymph no… http… <NULL> <named list> FALSE FALSE
#> 3 MONDO:004… glyc… A carcinom… http… <list> <named list> TRUE FALSE
#> 4 MONDO:004… inva… A carcinom… http… <list> <named list> TRUE FALSE
#> 5 MONDO:002… seco… A carcinom… http… <list> <named list> TRUE FALSE
#> 6 MONDO:002… carc… A carcinom… http… <list> <named list> TRUE FALSE
#> 7 MONDO:002… comb… A malignan… http… <list> <named list> TRUE FALSE
#> 8 MONDO:002… basa… A carcinom… http… <list> <named list> TRUE FALSE
#> 9 MONDO:001… mali… An invasiv… http… <list> <named list> TRUE FALSE
#> 10 MONDO:001… chor… A malignan… http… <list> <named list> TRUE FALSE
#> # ℹ 964 more rows
#> # ℹ 6 more variables: short_form <chr>, in_subset <list>,
#> # obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> # `_links` <list>
Graph manipulation
There are 974 nodes in this graph, but only a dozen or so terms used in the CELLxGENE disease facet. We would like to compute on the graph so that it contains only the nodes involved in paths from the ‘carcinoma’ term to terms used by the disease facet.
Start by coercing the olsr_graph
to an object that can
be used by the igraph package.
ig <- olsr_graph_as_igraph(carcinoma_graph)
olsr_graph_as_igraph()
uses the nodes of
carcinoma_graph
as vertex attributes, and adds information
about the relationship, ontology, and id(s) as a graph attribute. Note
that the OLS field obo_id
has been translated to the igraph
attribute name
.
igraph::vertex_attr_names(ig)
#> [1] "name" "label"
#> [3] "description" "iri"
#> [5] "synonyms" "annotation"
#> [7] "has_children" "is_root"
#> [9] "short_form" "in_subset"
#> [11] "obo_definition_citation" "obo_xref"
#> [13] "obo_synonym" "_links"
igraph::graph_attr(ig, "olsr_graph") |>
str()
#> List of 3
#> $ relation: chr "descendants"
#> $ ontology: chr "mondo"
#> $ ids : chr "MONDO:0004993"
Identifying a subgraph of ig
corresponds (I think!) to
calculating a minimal directed Steiner tree. An algorithm for this does
not exist in igraph or other
R packages, so we must develop our own. The solution defined in
the directed_steiner_tree_approx()
function is to find the
shortest paths from the ‘carcinoma’ node to each term in the disease
facet, and then to create a subgraph containing just those nodes.
directed_steiner_tree_approx <-
function(graph, from, to)
{
## only an approximation! Use the union of all nodes on all
## shortest paths from 'from' to each of 'to'
## find shortest paths from node 'from' to each 'to' node
paths <- igraph::shortest_paths(ig, from, to)$vpath
## identify nodes on the shortest paths
shortest_paths_nodes <-
lapply(paths, names) |>
unlist() |>
unique()
## create an induced subgraph on the shortest-path nodes
igraph::induced_subgraph(ig, shortest_paths_nodes)
}
This solution is not necessarily efficient, and does not guarantee a minimal tree, but it is perhaps good enough for the size of data encountered in a typical ontology.
To use the algorithm, identify the starting node and the relevant nodes of the disease facet, and use these as inputs
keep_ids <- c(
carcinoma_id,
intersect(disease_terms$obo_id, carcinoma$obo_id)
)
ig1 <- directed_steiner_tree_approx(ig, head(keep_ids, 1), tail(keep_ids, -1))
ig1
#> IGRAPH 4a81c64 DN-- 39 44 --
#> + attr: olsr_graph (g/x), name (v/c), label (v/c), description (v/c),
#> | iri (v/c), synonyms (v/x), annotation (v/x), has_children (v/l),
#> | is_root (v/l), short_form (v/c), in_subset (v/x),
#> | obo_definition_citation (v/x), obo_xref (v/x), obo_synonym (v/x),
#> | _links (v/x)
#> + edges from 4a81c64 (vertex names):
#> [1] MONDO:0004993->MONDO:0040677 MONDO:0004993->MONDO:0020804
#> [3] MONDO:0004993->MONDO:0018364 MONDO:0004993->MONDO:0006406
#> [5] MONDO:0004993->MONDO:0006181 MONDO:0004993->MONDO:0005206
#> [7] MONDO:0004993->MONDO:0005138 MONDO:0004993->MONDO:0004989
#> + ... omitted several edges
The induced graph returned by
directed_steiner_tree_approx()
includes vertex and graph
annotations from the input graph, and can be coerced to a
olsr_graph
object.
igraph_as_olsr_graph(ig1)
#> class: descendants_graph (39 nodes x 44 edges)
#> ontology: mondo
#> ids (1): MONDO:0004993
Graph visualization
There are 39 nodes in this subgraph, few enough for convenient interactive visualization using the visNetwork package.
Use igraph commands to color nodes used in graph construction (the ‘from’ and ‘to’ nodes) differently from nodes required to represent paths between nodes. Use the igraph ‘name’ attribute (OBO id) as the ‘title’ (i.e., mouse-over tool-tip). The ‘label’ attribute is used to label nodes.
## colorspace::qualitative_hcl(2, "Set 2")
palette <- c("#ED90A4", "#00C1B2")
igraph::V(ig1)$color <- palette[ (igraph::V(ig1)$name %in% keep_ids) + 1L]
igraph::V(ig1)$title <- igraph::V(ig1)$name # obo_id
## V(ig1)$label # 'label' (short description) from OLS
Create an interactive visualization using visIgraph()
.
The ‘sugiyama’ layout is appropriate for directed tree-like graphs.
idToLabel = FALSE
ensures that the label vertex attribute,
rather than the internal igraph id, is used to label nodes.
Note that the graph is interactive – zoom in and out, mouse-over and select nodes, etc.
visNetwork::visIgraph(ig1, layout = "layout_with_sugiyama", idToLabel = FALSE)
A more flexible alternative for visualization is to translate an
olsr_graph
to a visNetwork
object, and to
transform that for visualization.
carcinoma_subgraph <- igraph_as_olsr_graph(ig1)
olsr_graph_as_visNetwork(carcinoma_subgraph) |>
visNetwork::visEdges(arrows = "to") |>
visNetwork::visHierarchicalLayout(
sortMethod = "directed",
direction = "LR"
)
Session information
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] cellxgenedp_1.10.0 OLSr_0.0.4 dplyr_1.1.4 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] rappdirs_0.3.3 sass_0.4.9 utf8_1.2.4
#> [4] generics_0.1.3 tidyr_1.3.1 digest_0.6.37
#> [7] magrittr_2.0.3 evaluate_1.0.3 bookdown_0.42
#> [10] fastmap_1.2.0 jsonlite_1.8.9 promises_1.3.2
#> [13] BiocManager_1.30.25 httr_1.4.7 purrr_1.0.2
#> [16] httr2_1.1.0 textshaping_1.0.0 jquerylib_0.1.4
#> [19] cli_3.6.3 shiny_1.10.0 rlang_1.1.5
#> [22] visNetwork_2.1.2 withr_3.0.2 cachem_1.1.0
#> [25] yaml_2.3.10 BiocBaseUtils_1.8.0 tools_4.4.2
#> [28] memoise_2.0.1 httpuv_1.6.15 DT_0.33
#> [31] curl_6.1.0 vctrs_0.6.5 rjsoncons_1.3.1.9100
#> [34] R6_2.5.1 mime_0.12 lifecycle_1.0.4
#> [37] fs_1.6.5 htmlwidgets_1.6.4 ragg_1.3.3
#> [40] pkgconfig_2.0.3 desc_1.4.3 pkgdown_2.1.1
#> [43] pillar_1.10.1 bslib_0.8.0 later_1.4.1
#> [46] glue_1.8.0 Rcpp_1.0.14 systemfonts_1.2.1
#> [49] xfun_0.50 tibble_3.2.1 tidyselect_1.2.1
#> [52] knitr_1.49 xtable_1.8-4 igraph_2.1.3
#> [55] htmltools_0.5.8.1 rmarkdown_2.29 compiler_4.4.2