Case Study: CELLxGENE Ontologies
Martin Morgan
Roswell Park Comprehensive Cancer Center, Buffalo, NY, USSource:
vignettes/b_case_study_cxg.Rmd
b_case_study_cxg.Rmd
This case study illustrates use of OLSr to facilitate data set discovery in the CELLxGENE data portal. The CELLxGENE data portal provides a graphical user interface to collections of single-cell sequence data processed in standard ways to ‘count matrix’ summaries. The cellxgenedp R / Bioconductor package provides an R-based interface, allowing data discovery, viewing, and downloading.
This vignette uses the cellxgenedp, igraph, and visNetwork packages. Install these with the following command
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager", repos = "https://cloud.r-project.org")
pkgs <- c("cellxgenedp", "igraph", "visNetwork")
required_pkgs <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(required_pkgs)
Start by loading the OLSr and cellxgenedp packages
CELLxGENE
A central element of the CELLxGENE resource is the collection of datasets. These are discovered programmatically using
db <- db()
datasets(db)
#> # A tibble: 1,142 × 31
#> dataset_id dataset_version_id collection_id donor_id assay batch_condition
#> <chr> <chr> <chr> <list> <list> <list>
#> 1 1a38e762-24… 50671879-9cc9-4a9… bacccb91-066… <list> <list> <list [1]>
#> 2 fc0ceb80-d2… d9e98068-b506-434… 1ca90a2d-294… <list> <list> <list [1]>
#> 3 f67f2cfa-ba… 3361dba5-7446-4e3… 1ca90a2d-294… <list> <list> <list [1]>
#> 4 eec3e37d-ed… 427e3253-c230-4d7… 1ca90a2d-294… <list> <list> <list [1]>
#> 5 e8ac3386-31… 75338c67-3e91-428… 1ca90a2d-294… <list> <list> <list [1]>
#> 6 de104f7e-14… 19cf2bd2-44ba-40a… 1ca90a2d-294… <list> <list> <list [1]>
#> 7 d87f3f91-dc… b8883057-f9b4-41e… 1ca90a2d-294… <list> <list> <list [1]>
#> 8 d2fc9880-e6… b0313650-ec3e-416… 1ca90a2d-294… <list> <list> <list [1]>
#> 9 d0ea3ec4-0f… 806eaf48-4832-42e… 1ca90a2d-294… <list> <list> <list [1]>
#> 10 c76098ba-ee… df8e241a-6cf8-43c… 1ca90a2d-294… <list> <list> <list [1]>
#> # ℹ 1,132 more rows
#> # ℹ 25 more variables: cell_count <int>, cell_type <list>,
#> # default_embedding <chr>, development_stage <list>, disease <list>,
#> # embeddings <list>, explorer_url <chr>, feature_biotype <chr>,
#> # feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> # raw_data_location <chr>, schema_version <chr>, …
Data sets are classified using ontologies exposed by cellxgenedp to the user as ‘facets’.
facets <- facets(db)
facets
#> # A tibble: 1,540 × 4
#> facet label ontology_term_id n
#> <chr> <chr> <chr> <int>
#> 1 assay 10x 3' v3 EFO:0009922 546
#> 2 assay 10x 3' v2 EFO:0009899 245
#> 3 assay Slide-seqV2 EFO:0030062 223
#> 4 assay Visium Spatial Gene Expression EFO:0010961 108
#> 5 assay 10x 5' v1 EFO:0011025 76
#> 6 assay Smart-seq2 EFO:0008931 63
#> 7 assay 10x multiome EFO:0030059 61
#> 8 assay 10x 5' v2 EFO:0009900 18
#> 9 assay sci-RNA-seq3 EFO:0030028 15
#> 10 assay 10x 5' transcription profiling EFO:0030004 13
#> # ℹ 1,530 more rows
Some insight into the ontologies used to construct facets can be
gained by extracting the ontology from the
ontology_term_id
.
facets |>
## a small number of terms are not from any ontology; these do not
## have a ':' in the ontology_term_id
filter(grepl(":", ontology_term_id, fixed = TRUE)) |>
## for the remainder, discover the ontology from the ontology_term_id
mutate(ontology = tolower(sub("(.*):.*", "\\1", ontology_term_id))) |>
count(facet, ontology)
#> # A tibble: 12 × 3
#> facet ontology n
#> <chr> <chr> <int>
#> 1 assay efo 37
#> 2 cell_type cl 757
#> 3 development_stage hsapdv 173
#> 4 development_stage mmusdv 51
#> 5 development_stage uberon 5
#> 6 disease mondo 95
#> 7 disease pato 1
#> 8 organism ncbitaxon 7
#> 9 self_reported_ethnicity hancestro 29
#> 10 sex pato 2
#> 11 tissue cl 12
#> 12 tissue uberon 366
Annotating the ‘disease’ facet
Our example will work with the ‘disease’ facet, based on the ‘mondo’ ontology. Check that the ontology is in the OLS.
"mondo" %in% get_ontologies()$id
#> [1] TRUE
Retrieve all terms from the ontology.
mondo <- get_terms("mondo")
mondo
#> # A tibble: 23,678 × 14
#> label obo_id description iri synonyms annotation has_children is_root
#> <chr> <chr> <chr> <chr> <list> <list> <lgl> <lgl>
#> 1 AIPL1-re… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 2 glycogen… MONDO… Any glycog… http… <list> <named list> FALSE FALSE
#> 3 Asperger… MONDO… An inherit… http… <NULL> <named list> TRUE FALSE
#> 4 GUCY2D-r… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 5 RP2-rela… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 6 RDH5-rel… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 7 RLBP1-re… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 8 LCA5-rel… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 9 CNGB3-re… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> 10 ATF6-rel… MONDO… A retinopa… http… <list> <named list> TRUE FALSE
#> # ℹ 23,668 more rows
#> # ℹ 6 more variables: short_form <chr>, in_subset <list>,
#> # obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> # `_links` <list>
There are 23678 terms in this ontology; only a small subset of 96 terms are used in CELLxGENE datasets. Annotate the disease facet with information from the OLS.
disease_terms <-
facets(db, "disease") |>
## join based on identity between CxG's 'ontology_term_id' and
## OLS's 'obo_id'
left_join(mondo, by = c("label", ontology_term_id = "obo_id")) |>
## for consistency with OLSr convention, rename the
## 'ontology_term_id' to 'obo_id'
rename(obo_id = "ontology_term_id")
disease_terms
#> # A tibble: 96 × 16
#> facet label obo_id n description iri synonyms annotation has_children
#> <chr> <chr> <chr> <int> <chr> <chr> <list> <list> <lgl>
#> 1 dise… norm… PATO:… 966 NA NA <NULL> <NULL> NA
#> 2 dise… COVI… MONDO… 55 A disease … http… <list> <named list> FALSE
#> 3 dise… deme… MONDO… 50 Loss of in… http… <list> <named list> TRUE
#> 4 dise… diab… MONDO… 26 Progressiv… http… <list> <named list> FALSE
#> 5 dise… myoc… MONDO… 26 Gross necr… http… <list> <named list> TRUE
#> 6 dise… auto… MONDO… 24 Autosomal … http… <list> <named list> TRUE
#> 7 dise… Alzh… MONDO… 15 A progress… http… <list> <named list> TRUE
#> 8 dise… brea… MONDO… 13 A primary … http… <list> <named list> TRUE
#> 9 dise… lung… MONDO… 11 A carcinom… http… <list> <named list> TRUE
#> 10 dise… smal… MONDO… 11 Small cell… http… <list> <named list> TRUE
#> # ℹ 86 more rows
#> # ℹ 7 more variables: is_root <lgl>, short_form <chr>, in_subset <list>,
#> # obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> # `_links` <list>
Note that the ‘normal’ term with OBO id ‘PATO:0000461’ used by
CELLxGENE is not from the ‘mondo’ ontology; we could discover
information about this term by also retrieving the ‘pato’ ontology from
the OLS, or using get_term("pato", "PATO:0000461")
.
Identifying carcinoma datasets
CELLxGENE presents the disease facet as a flat list of terms, although the datasets are presented with an ad hoc filter to group terms and facilitate selection of datasets containing samples from multiple terms. A more principled approach is to use the ontology to identify terms of interest.
Suppose one is interested in all datasets annotated with disease term
derived from ‘carcinoma’; this term does not itself appear in the
disease facet. Find the obo_id
of the ‘carcinoma’ term in
the the mondo ontology, and use that to find descendants.
carcinoma_id <-
mondo |>
filter(label == "carcinoma") |>
pull("obo_id")
carcinoma <- get_descendants("mondo", carcinoma_id)
There are 987 terms descending from ‘carcinoma’. Terms used by the disease facet are
disease_terms |>
inner_join(carcinoma) |>
select(obo_id, n, label, description)
#> Joining with `by = join_by(label, obo_id, description, iri, synonyms,
#> annotation, has_children, is_root, short_form, in_subset,
#> obo_definition_citation, obo_xref, obo_synonym, `_links`)`
#> # A tibble: 14 × 4
#> obo_id n label description
#> <chr> <int> <chr> <chr>
#> 1 MONDO:0005061 11 lung adenocarcinoma A carcinoma that arise…
#> 2 MONDO:0008433 11 small cell lung carcinoma Small cell lung cancer…
#> 3 MONDO:0005086 9 renal cell carcinoma A carcinoma that arise…
#> 4 MONDO:0004989 7 breast carcinoma A carcinoma that arise…
#> 5 MONDO:0024885 5 malignant ovarian serous tumor An invasive malignant …
#> 6 MONDO:0004970 3 adenocarcinoma A common cancer charac…
#> 7 MONDO:0007763 3 nonpapillary renal cell carcinoma NA
#> 8 MONDO:0005097 3 squamous cell lung carcinoma A carcinoma arising fr…
#> 9 MONDO:0005005 2 clear cell renal carcinoma A malignant epithelial…
#> 10 MONDO:0005233 2 non-small cell lung carcinoma A group of at least th…
#> 11 MONDO:0017885 1 chromophobe renal cell carcinoma Chromophobe renal cell…
#> 12 MONDO:0003050 1 lung large cell carcinoma A poorly differentiate…
#> 13 MONDO:0002120 1 neuroendocrine carcinoma A malignant neuroendoc…
#> 14 MONDO:0003573 1 pleomorphic carcinoma A usually aggressive m…
The facets_filter()
function of cellxgenedp makes it
easy to identify datasets using terms derived from ‘carcinoma’, perhaps
in addition filtering to single type of assay.
datasets_of_interest <-
datasets(db) |>
filter(
facets_filter(disease, "ontology_term_id", carcinoma$obo_id),
facets_filter(assay, "label", "10x 3' v3")
)
datasets_of_interest
#> # A tibble: 26 × 31
#> dataset_id dataset_version_id collection_id donor_id assay batch_condition
#> <chr> <chr> <chr> <list> <list> <list>
#> 1 232f6a5a-a0… b9a24331-5b4c-404… edb893ee-406… <list> <list> <list [3]>
#> 2 1e6a6ef9-7e… 04929c2b-00af-4ad… edb893ee-406… <list> <list> <list [3]>
#> 3 e3a7e927-26… f430ff40-4cf5-465… 4796c91c-9d8… <list> <list> <lgl [1]>
#> 4 b252b015-b4… ef1b7fbb-789f-4d6… 4796c91c-9d8… <list> <list> <lgl [1]>
#> 5 97d9238c-1a… 8105f639-52fc-481… 4796c91c-9d8… <list> <list> <lgl [1]>
#> 6 44c93f2b-dd… dc79da35-ebd5-4dc… 4796c91c-9d8… <list> <list> <lgl [1]>
#> 7 0caedec7-1c… 59ccd759-41c5-4d0… 4796c91c-9d8… <list> <list> <lgl [1]>
#> 8 9f222629-9e… e4eeabe8-9852-48e… 6f6d381a-770… <list> <list> <list [1]>
#> 9 d41f45c1-1b… ecd2961a-ad80-4e2… efd94500-1fd… <list> <list> <lgl [1]>
#> 10 f64e1be1-de… aa0c4b0a-e79f-4d8… 62e8f058-9c3… <list> <list> <lgl [1]>
#> # ℹ 16 more rows
#> # ℹ 25 more variables: cell_count <int>, cell_type <list>,
#> # default_embedding <chr>, development_stage <list>, disease <list>,
#> # embeddings <list>, explorer_url <chr>, feature_biotype <chr>,
#> # feature_count <int>, feature_reference <list>, is_primary_data <list>,
#> # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
#> # raw_data_location <chr>, schema_version <chr>, …
There are 26 datasets. Further steps, such as visualization using CELLxGENE or downloading datasets for local analysis, are described in the cellxgenedp vignette.
Ontologies as graphs
It can be useful to compute on ontologies as graphs. Use
get_descendants_graph()
to retrieve, recursively, the
children of the ‘carcinoma’ term. This step can take one or two minutes,
and requires one API call per child. The API calls are memoised, so the
process is time consuming only the first time the graph is
constructed.
carcinoma_graph <-
## many calls to the OLS API; slow the first time
get_descendants_graph("mondo", carcinoma_id, mondo)
#> 0 relatives visited; 1 in queue
#> 987 relatives visited; 0 in queue
get_descendants_graph()
returns a
olsr_graph
object, which contains information about the
type of graph (ancestors or descendants), the ontology used to construct
the graph, and the ids used to seed the query.
carcinoma_graph
#> class: descendants_graph (988 nodes x 1578 edges)
#> ontology: mondo
#> ids (1): MONDO:0004993
The object can be queried from relevant components, for instance annotation of each node
olsr_graph_nodes(carcinoma_graph)
#> # A tibble: 988 × 14
#> obo_id label description iri synonyms annotation has_children is_root
#> <chr> <chr> <chr> <chr> <list> <list> <lgl> <lgl>
#> 1 MONDO:000… carc… A malignan… http… <list> <named list> TRUE FALSE
#> 2 MONDO:085… lymp… A lymph no… http… <NULL> <named list> FALSE FALSE
#> 3 MONDO:004… glyc… A carcinom… http… <list> <named list> TRUE FALSE
#> 4 MONDO:004… inva… A carcinom… http… <list> <named list> TRUE FALSE
#> 5 MONDO:002… seco… A carcinom… http… <list> <named list> TRUE FALSE
#> 6 MONDO:002… carc… A carcinom… http… <list> <named list> TRUE FALSE
#> 7 MONDO:002… basa… A carcinom… http… <list> <named list> TRUE FALSE
#> 8 MONDO:002… muci… NA http… <list> <named list> TRUE FALSE
#> 9 MONDO:002… comb… A malignan… http… <list> <named list> TRUE FALSE
#> 10 MONDO:001… chor… A malignan… http… <list> <named list> TRUE FALSE
#> # ℹ 978 more rows
#> # ℹ 6 more variables: short_form <chr>, in_subset <list>,
#> # obo_definition_citation <list>, obo_xref <list>, obo_synonym <list>,
#> # `_links` <list>
Graph manipulation
There are 988 nodes in this graph, but only a dozen or so terms used in the CELLxGENE disease facet. We would like to compute on the graph so that it contains only the nodes involved in paths from the ‘carcinoma’ term to terms used by the disease facet.
Start by coercing the olsr_graph
to an object that can
be used by the igraph package.
ig <- olsr_graph_as_igraph(carcinoma_graph)
olsr_graph_as_igraph()
uses the nodes of
carcinoma_graph
as vertex attributes, and adds information
about the relationship, ontology, and id(s) as a graph attribute. Note
that the OLS field obo_id
has been translated to the igraph
attribute name
.
igraph::vertex_attr_names(ig)
#> [1] "name" "label"
#> [3] "description" "iri"
#> [5] "synonyms" "annotation"
#> [7] "has_children" "is_root"
#> [9] "short_form" "in_subset"
#> [11] "obo_definition_citation" "obo_xref"
#> [13] "obo_synonym" "_links"
igraph::graph_attr(ig, "olsr_graph") |>
str()
#> List of 3
#> $ relation: chr "descendants"
#> $ ontology: chr "mondo"
#> $ ids : chr "MONDO:0004993"
Identifying a subgraph of ig
corresponds (I think!) to
calculating a minimal directed Steiner tree. An algorithm for this does
not exist in igraph or other
R packages, so we must develop our own. The solution defined in
the directed_steiner_tree_approx()
function is to find the
shortest paths from the ‘carcinoma’ node to each term in the disease
facet, and then to create a subgraph containing just those nodes.
directed_steiner_tree_approx <-
function(graph, from, to)
{
## only an approximation! Use the union of all nodes on all
## shortest paths from 'from' to each of 'to'
## find shortest paths from node 'from' to each 'to' node
paths <- igraph::shortest_paths(ig, from, to)$vpath
## identify nodes on the shortest paths
shortest_paths_nodes <-
lapply(paths, names) |>
unlist() |>
unique()
## create an induced subgraph on the shortest-path nodes
igraph::induced_subgraph(ig, shortest_paths_nodes)
}
This solution is not necessarily efficient, and does not guarantee a minimal tree, but it is perhaps good enough for the size of data encountered in a typical ontology.
To use the algorithm, identify the starting node and the relevant nodes of the disease facet, and use these as inputs
keep_ids <- c(
carcinoma_id,
intersect(disease_terms$obo_id, carcinoma$obo_id)
)
ig1 <- directed_steiner_tree_approx(ig, head(keep_ids, 1), tail(keep_ids, -1))
ig1
#> IGRAPH a0e9ef1 DN-- 22 26 --
#> + attr: olsr_graph (g/x), name (v/c), label (v/c), description (v/c),
#> | iri (v/c), synonyms (v/x), annotation (v/x), has_children (v/l),
#> | is_root (v/l), short_form (v/c), in_subset (v/x),
#> | obo_definition_citation (v/x), obo_xref (v/x), obo_synonym (v/x),
#> | _links (v/x)
#> + edges from a0e9ef1 (vertex names):
#> [1] MONDO:0004993->MONDO:0018364 MONDO:0004993->MONDO:0006406
#> [3] MONDO:0004993->MONDO:0005206 MONDO:0004993->MONDO:0005138
#> [5] MONDO:0004993->MONDO:0004989 MONDO:0004993->MONDO:0004970
#> [7] MONDO:0004993->MONDO:0002120 MONDO:0004993->MONDO:0005232
#> + ... omitted several edges
The induced graph returned by
directed_steiner_tree_approx()
includes vertex and graph
annotations from the input graph, and can be coerced to a
olsr_graph
object.
igraph_as_olsr_graph(ig1)
#> class: descendants_graph (22 nodes x 26 edges)
#> ontology: mondo
#> ids (1): MONDO:0004993
Graph visualization
There are 22 nodes in this subgraph, few enough for convenient interactive visualization using the visNetwork package.
Use igraph commands to color nodes used in graph construction (the ‘from’ and ‘to’ nodes) differently from nodes required to represent paths between nodes. Use the igraph ‘name’ attribute (OBO id) as the ‘title’ (i.e., mouse-over tool-tip). The ‘label’ attribute is used to label nodes.
## colorspace::qualitative_hcl(2, "Set 2")
palette <- c("#ED90A4", "#00C1B2")
igraph::V(ig1)$color <- palette[ (igraph::V(ig1)$name %in% keep_ids) + 1L]
igraph::V(ig1)$title <- igraph::V(ig1)$name # obo_id
## V(ig1)$label # 'label' (short description) from OLS
Create an interactive visualization using visIgraph()
.
The ‘sugiyama’ layout is appropriate for directed tree-like graphs.
idToLabel = FALSE
ensures that the label vertex attribute,
rather than the internal igraph id, is used to label nodes.
Note that the graph is interactive – zoom in and out, mouse-over and select nodes, etc.
visNetwork::visIgraph(ig1, layout = "layout_with_sugiyama", idToLabel = FALSE)
A more flexible alternative for visualization is to translate an
olsr_graph
to a visNetwork
object, and to
transform that for visualization.
carcinoma_subgraph <- igraph_as_olsr_graph(ig1)
olsr_graph_as_visNetwork(carcinoma_subgraph) |>
visNetwork::visEdges(arrows = "to") |>
visNetwork::visHierarchicalLayout(
sortMethod = "directed",
direction = "LR"
)
Session information
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] cellxgenedp_1.4.1 OLSr_0.0.3 dplyr_1.1.4 BiocStyle_2.28.1
#>
#> loaded via a namespace (and not attached):
#> [1] rappdirs_0.3.3 sass_0.4.7 utf8_1.2.4
#> [4] generics_0.1.3 tidyr_1.3.0 stringi_1.8.2
#> [7] digest_0.6.33 magrittr_2.0.3 evaluate_0.23
#> [10] bookdown_0.36 fastmap_1.1.1 rprojroot_2.0.4
#> [13] jsonlite_1.8.7 promises_1.2.1 BiocManager_1.30.22
#> [16] httr_1.4.7 purrr_1.0.2 fansi_1.0.5
#> [19] httr2_1.0.0 textshaping_0.3.7 jquerylib_0.1.4
#> [22] cli_3.6.1 shiny_1.8.0 rlang_1.1.2
#> [25] visNetwork_2.1.2 ellipsis_0.3.2 withr_2.5.2
#> [28] cachem_1.0.8 yaml_2.3.7 BiocBaseUtils_1.2.0
#> [31] tools_4.3.2 memoise_2.0.1 httpuv_1.6.12
#> [34] DT_0.30 curl_5.1.0 mime_0.12
#> [37] vctrs_0.6.4 rjsoncons_1.0.0.9200 R6_2.5.1
#> [40] lifecycle_1.0.4 stringr_1.5.1 fs_1.6.3
#> [43] htmlwidgets_1.6.3 ragg_1.2.6 pkgconfig_2.0.3
#> [46] desc_1.4.2 later_1.3.1 pkgdown_2.0.7
#> [49] pillar_1.9.0 bslib_0.6.1 Rcpp_1.0.11
#> [52] glue_1.6.2 systemfonts_1.0.5 xfun_0.41
#> [55] tibble_3.2.1 tidyselect_1.2.0 knitr_1.45
#> [58] xtable_1.8-4 igraph_1.5.1 htmltools_0.5.7
#> [61] rmarkdown_2.25 compiler_4.3.2