D. From CELLxGENE to R and Bioconductor • XM2023

We use the following packages during this section of the workshop.

## package developed for this workshop
library(XM2023)

## general programming tools
library(dplyr)

## data access
library(cellxgenedp)

## single cell data representation in R
library(SingleCellExperiment) # Bioconductor representation
library(Seurat)               # Seurat representation

Data Transformations

FASTQ files

DNA sequences and quality scores
Very large
Processed e.g., by CellRanger or other software
Quality control, summary to count matrix (below)

Count matrix

Usually genes (rows) x cells (columns)
Can be large (e.g., 30,000 genes x 50,000 cells)
Usually very sparse, e.g., 95% of cells ‘0’.
Still large enough to require a decent amount of computing power, e.g., 32 GB RAM, 8 CPU for some steps.

Common formats for count matrix data

CSV file – lots of zero’s so very wasteful of space.
‘Matrix Market’ sparse matrix files, e.g., tuples of <row, column, count> for non-zero values.
HDF5, e.g., .loom or .h5ad (anndata).
RDS – An R file, in CELLxGENE these are Seurat objects but in general they could contain any R object.

Matrix representation in R

In-memory sparse matrices: dgCMatrix class from the Matrix package
On-disk representation via Bioconductor’s DelayedArray / HDF5Array.

Counts & annotations

CSV and Matrix Market files store just counts; usually annotations on columns (e.g., what sample did each cell come from?) are stored separately.
HDF5 file formats coordinate row and column annotations with count data.
R software tries to offer a coordinated representation of counts and metadata, e.g., the Seurat or Bioconductor SingleCellExperiment objects.

CELLxGENE

Data Portal

What’s available?

Collections and datasets contributed by the single-cell community, with some overlap with data sets in the HCA Data Portal.
FASTQ files
.h5ad or .RDS summarized count data and cell metadata, as well as reduced-dimension (e.g., UMAP) representations
- Summarized count files provided by the contributor / individual lab, so of uncertain provenance
Easy to download count data
Easy to visualize (!)

Programatic Discovery

Why use an R script when the Data Portals exist?

Easily reproducible
Flexible exploration of rich & complex data
Direct integration with Seurat or Bioconductor single-cell workflows

Load the cellxgenedp package

library(cellxgenedp)

Retrieve the current database, and use ‘tidy’ functionality to mimic the graphical selection in the web browser – 10x 3’ v3 (EFO:0009922) assay, African American ethnicity, female gender)

db <- db()
african_american_female <-
    datasets(db) |>
    dplyr::filter(
        facets_filter(assay, "ontology_term_id", "EFO:0009922"),
        facets_filter(self_reported_ethnicity, "label", "African American"),
        facets_filter(sex, "label", "female")
    )
african_american_female
#> # A tibble: 32 × 26
#>    dataset_id                 collection_id donor_id assay  cell_count cell_type
#>    <chr>                      <chr>         <list>   <list>      <int> <list>   
#>  1 8693bc2c-d84f-450f-983a-d… bcb61471-2a4… <list>   <list>     304652 <list>   
#>  2 37f30005-8a5c-4bdf-88aa-a… bcb61471-2a4… <list>   <list>     172847 <list>   
#>  3 54914718-c0d3-4e17-9e06-d… bcb61471-2a4… <list>   <list>     107344 <list>   
#>  4 4c4dedb7-1f74-4be7-9916-2… b953c942-f5d… <list>   <list>      23197 <list>   
#>  5 8716534a-4c1c-414c-aa98-4… b953c942-f5d… <list>   <list>      96129 <list>   
#>  6 24205601-0780-4bf2-b1d9-0… c9706a92-0e5… <list>   <list>      31696 <list>   
#>  7 a963dd1b-ee7e-42b3-8afe-0… a98b828a-622… <list>   <list>      48783 <list>   
#>  8 36e76662-60b8-4193-8a70-1… 625f6bf4-2f3… <list>   <list>      46500 <list>   
#>  9 611c3577-81cb-4795-ad3a-4… b9fc3d70-5a7… <list>   <list>      49139 <list>   
#> 10 0ed93577-0ff5-435e-b599-b… b9fc3d70-5a7… <list>   <list>     109995 <list>   
#> # ℹ 22 more rows
#> # ℹ 20 more variables: dataset_deployments <chr>, development_stage <list>,
#> #   disease <list>, is_primary_data <chr>, is_valid <lgl>,
#> #   mean_genes_per_cell <dbl>, name <chr>, organism <list>,
#> #   processing_status <list>, published <lgl>, revision <int>,
#> #   schema_version <chr>, self_reported_ethnicity <list>, sex <list>,
#> #   suspension_type <list>, tissue <list>, tombstone <lgl>, …

A particular collection is

collection_id <- "c9706a92-0e5f-46c1-96d8-20e42467f287"
african_american_female |>
    dplyr::filter(collection_id %in% .env$collection_id)
#> # A tibble: 1 × 26
#>   dataset_id                  collection_id donor_id assay  cell_count cell_type
#>   <chr>                       <chr>         <list>   <list>      <int> <list>   
#> 1 24205601-0780-4bf2-b1d9-0e… c9706a92-0e5… <list>   <list>      31696 <list>   
#> # ℹ 20 more variables: dataset_deployments <chr>, development_stage <list>,
#> #   disease <list>, is_primary_data <chr>, is_valid <lgl>,
#> #   mean_genes_per_cell <dbl>, name <chr>, organism <list>,
#> #   processing_status <list>, published <lgl>, revision <int>,
#> #   schema_version <chr>, self_reported_ethnicity <list>, sex <list>,
#> #   suspension_type <list>, tissue <list>, tombstone <lgl>, created_at <date>,
#> #   published_at <date>, updated_at <date>

The dataset associated with this collection is

dataset <-
    african_american_female |>
    dplyr::filter(collection_id %in% .env$collection_id) |>
    dplyr::select(dataset_id)
dataset_id <-
    dataset |>
    dplyr::pull(dataset_id)
dataset_id
#> [1] "24205601-0780-4bf2-b1d9-0e3cacbc2cd6"

The helper function xm2023_cxg_dataset() provides a summary of this dataset.

xm2023_cxg_dataset(dataset_id)
#> title: A single-cell atlas of the healthy breast tissues reveals
#>     clinically relevant clusters of breast epithelial cells
#> description: Single-cell RNA sequencing (scRNA-seq) is an evolving
#>     technology used to elucidate the cellular architecture of adult
#>     organs. Previous scRNA-seq on breast tissue utilized reduction
#>     mammoplasty samples, which are often histologically abnormal. We
#>     report a rapid tissue collection/processing protocol to perform
#>     scRNA-seq of breast biopsies of healthy women and identify 23
#>     breast epithelial cell clusters. Putative cell-of-origin signatures
#>     derived from these clusters are applied to analyze transcriptomes
#>     of ~3,000 breast cancers. Gene signatures derived from mature
#>     luminal cell clusters are enriched in ~68% of breast cancers,
#>     whereas a signature from a luminal progenitor cluster is enriched
#>     in ~20% of breast cancers. Overexpression of luminal progenitor
#>     cluster-derived signatures in HER2+, but not in other subtypes, is
#>     associated with unfavorable outcome. We identify TBX3 and PDK4 as
#>     genes co-expressed with estrogen receptor (ER) in the normal
#>     breasts, and their expression analyses in >550 breast cancers
#>     enable prognostically relevant subclassification of ER+ breast
#>     cancers.
#> authors: Bhat-Nakshatri, Poornima; Gao, Hongyu; Sheng, Liu; McGuire,
#>     Patrick C.; Xuei, Xiaoling; Wan, Jun; Liu, Yunlong; Althouse,
#>     Sandra K.; Colter, Austyn; Sandusky, George; Storniolo, Anna Maria;
#>     Nakshatri, Harikrishna
#> journal: Cell Reports Medicine
#> assays: 10x 3' v2; 10x 3' v3
#> organism: Homo sapiens
#> ethnicity: African American; Chinese; European

‘Join’ selected datasets and files to identify the files associated with these datasets.

selected_files <-
    dplyr::left_join(
        dataset,
        files(db),
        by = "dataset_id"
    )
selected_files
#> # A tibble: 4 × 8
#>   dataset_id          file_id filename filetype s3_uri user_submitted created_at
#>   <chr>               <chr>   <chr>    <chr>    <chr>  <lgl>          <date>    
#> 1 24205601-0780-4bf2… e1c842… "raw.h5… RAW_H5AD s3://… TRUE           1970-01-01
#> 2 24205601-0780-4bf2… 15e9d9… "local.… H5AD     s3://… TRUE           1970-01-01
#> 3 24205601-0780-4bf2… 0d3974… ""       CXG      s3://… TRUE           1970-01-01
#> 4 24205601-0780-4bf2… e254f9… "local.… RDS      s3://… TRUE           1970-01-01
#> # ℹ 1 more variable: updated_at <date>

Select the first ‘CXG’ file available in this subset of data; for reproducibility we retrieve the dataset id…

selected_files |>
    dplyr::filter(filetype == "CXG")
#> # A tibble: 1 × 8
#>   dataset_id          file_id filename filetype s3_uri user_submitted created_at
#>   <chr>               <chr>   <chr>    <chr>    <chr>  <lgl>          <date>    
#> 1 24205601-0780-4bf2… 0d3974… ""       CXG      s3://… TRUE           1970-01-01
#> # ℹ 1 more variable: updated_at <date>

Visualize this ‘CXG’ file in the browser…

selected_files |>
    dplyr::filter(filetype == "CXG", dataset_id %in% .env$dataset_id) |>
    datasets_visualize()

…or select the ‘H5AD’ or ‘RDS’ (Seurat) file associated with the dataset and download it for subsequent processing in R

seurat_file <-
    selected_files |>
    dplyr::filter(filetype == "RDS", dataset_id %in% .env$dataset_id) |>
    files_download(dry.run = FALSE)

h5ad_file <-
    selected_files |>
    dplyr::filter(filetype == "H5AD", dataset_id == .env$dataset_id) |>
    files_download(dry.run = FALSE)

The downloaded file is cached, so the next time access is fast.

Seurat

Data input & representation

Fortunately, CELLxGENE distributes Seurat (v. 4) files, and they can be input directly.

library(Seurat)
seurat <- readRDS(seurat_file)
seurat
#> An object of class Seurat 
#> 33234 features across 31696 samples within 1 assay 
#> Active assay: RNA (33234 features, 0 variable features)
#>  3 dimensional reductions calculated: pca, tsne, umap

Standard workflow

Bioconductor

Data input & representation

The SingleCellExperiment is used to represent ‘rectangular’ single cell expression and other data in R / Bioconductor. It coordinates a gene x cell count matricies (assay()) with annotations on the genes (rowData()) and columns (cellData()), and with reduced-dimension summaries.

An effective way to represent the .h5ad data as a SingleCellExperiment is using zellkonverter::readH5AD().

h5ad <- zellkonverter::readH5AD(h5ad_file, reader = "R", use_hdf5 = TRUE)
#> Registered S3 methods overwritten by 'zellkonverter':
#>   method                                             from      
#>   py_to_r.numpy.ndarray                              reticulate
#>   py_to_r.pandas.core.arrays.categorical.Categorical reticulate
#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

#> Warning in H5Aread(A, ...): Reading attribute data of type 'ENUM' not yet
#> implemented. Values replaced by NA's.

Displaying the object…

h5ad
#> class: SingleCellExperiment 
#> dim: 33234 31696 
#> metadata(3): default_embedding schema_version title
#> assays(1): X
#> rownames(33234): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
#>   ENSG00000268674
#> rowData names(4): feature_is_filtered feature_name feature_reference
#>   feature_biotype
#> colnames(31696): CMGpool_AAACCCAAGGACAACC CMGpool_AAACCCACAATCTCTT ...
#>   K109064_TTTGTTGGTTGCATCA K109064_TTTGTTGGTTGGACCC
#> colData names(34): donor_id self_reported_ethnicity_ontology_term_id
#>   ... self_reported_ethnicity development_stage
#> reducedDimNames(3): X_pca X_tsne X_umap
#> mainExpName: NULL
#> altExpNames(0):

…suggests the data available and how to access it – there are 33234 genes and 31696 cells. The ‘raw’ data include counts assays(h5ad, "counts"), annotations on each gene (rowData()) and cell (colData()), etc…

Working with SingleCellExperiment objects is described in additional detail subsequent articles.

Standard workflow

Session information

This document was produced with the following R software:

sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SeuratObject_4.1.3          Seurat_4.3.0               
#>  [3] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
#>  [5] Biobase_2.61.0              GenomicRanges_1.53.1       
#>  [7] GenomeInfoDb_1.37.1         IRanges_2.35.1             
#>  [9] S4Vectors_0.39.1            BiocGenerics_0.47.0        
#> [11] MatrixGenerics_1.13.0       matrixStats_1.0.0          
#> [13] cellxgenedp_1.5.0           dplyr_1.1.2                
#> [15] XM2023_0.0.1.1             
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3      jsonlite_1.8.5          magrittr_2.0.3         
#>   [4] spatstat.utils_3.0-3    rmarkdown_2.22          fs_1.6.2               
#>   [7] zlibbioc_1.47.0         ragg_1.2.5              vctrs_0.6.3            
#>  [10] ROCR_1.0-11             spatstat.explore_3.2-1  memoise_2.0.1          
#>  [13] RCurl_1.98-1.12         htmltools_0.5.5         S4Arrays_1.1.4         
#>  [16] curl_5.0.1              Rhdf5lib_1.23.0         rhdf5_2.45.0           
#>  [19] SparseArray_1.1.10      sctransform_0.3.5       sass_0.4.6             
#>  [22] parallelly_1.36.0       KernSmooth_2.23-21      bslib_0.5.0            
#>  [25] basilisk_1.13.1         htmlwidgets_1.6.2       desc_1.4.2             
#>  [28] ica_1.0-3               plyr_1.8.8              plotly_4.10.2          
#>  [31] zoo_1.8-12              cachem_1.0.8            igraph_1.5.0           
#>  [34] mime_0.12               lifecycle_1.0.3         pkgconfig_2.0.3        
#>  [37] Matrix_1.5-4.1          R6_2.5.1                fastmap_1.1.1          
#>  [40] GenomeInfoDbData_1.2.10 fitdistrplus_1.1-11     future_1.32.0          
#>  [43] shiny_1.7.4             digest_0.6.31           colorspace_2.1-0       
#>  [46] patchwork_1.1.2         tensor_1.5              rprojroot_2.0.3        
#>  [49] irlba_2.3.5.1           textshaping_0.3.6       filelock_1.0.2         
#>  [52] progressr_0.13.0        spatstat.sparse_3.0-1   fansi_1.0.4            
#>  [55] polyclip_1.10-4         abind_1.4-5             httr_1.4.6             
#>  [58] compiler_4.3.0          withr_2.5.0             rjsoncons_1.0.0        
#>  [61] HDF5Array_1.29.3        MASS_7.3-60             DelayedArray_0.27.5    
#>  [64] tools_4.3.0             lmtest_0.9-40           httpuv_1.6.11          
#>  [67] future.apply_1.11.0     goftest_1.2-3           glue_1.6.2             
#>  [70] rhdf5filters_1.13.3     nlme_3.1-162            promises_1.2.0.1       
#>  [73] grid_4.3.0              Rtsne_0.16              reshape2_1.4.4         
#>  [76] cluster_2.1.4           generics_0.1.3          spatstat.data_3.0-1    
#>  [79] gtable_0.3.3            tidyr_1.3.0             data.table_1.14.8      
#>  [82] sp_1.6-1                utf8_1.2.3              XVector_0.41.1         
#>  [85] spatstat.geom_3.2-1     RcppAnnoy_0.0.20        ggrepel_0.9.3          
#>  [88] RANN_2.6.1              pillar_1.9.0            stringr_1.5.0          
#>  [91] later_1.3.1             splines_4.3.0           lattice_0.21-8         
#>  [94] deldir_1.0-9            survival_3.5-5          tidyselect_1.2.0       
#>  [97] miniUI_0.1.1.1          pbapply_1.7-0           knitr_1.43             
#> [100] gridExtra_2.3           scattermore_1.2         xfun_0.39              
#> [103] DT_0.28                 stringi_1.7.12          lazyeval_0.2.2         
#> [106] yaml_2.3.7              evaluate_0.21           codetools_0.2-19       
#> [109] tibble_3.2.1            cli_3.6.1               uwot_0.1.14            
#> [112] xtable_1.8-4            reticulate_1.30         systemfonts_1.0.4      
#> [115] munsell_0.5.0           jquerylib_0.1.4         zellkonverter_1.11.1   
#> [118] Rcpp_1.0.10             dir.expiry_1.9.0        spatstat.random_3.1-5  
#> [121] globals_0.16.2          png_0.1-8               parallel_4.3.0         
#> [124] ellipsis_0.3.2          pkgdown_2.0.7           ggplot2_3.4.2          
#> [127] basilisk.utils_1.13.1   bitops_1.0-7            listenv_0.9.0          
#> [130] viridisLite_0.4.2       scales_1.2.1            ggridges_0.5.4         
#> [133] leiden_0.4.3            purrr_1.0.1             crayon_1.5.2           
#> [136] rlang_1.1.1             cowplot_1.1.1