B. Accessing HCA Data from R / Bioconductor • HCABiocTraining

We use the following packages during this section of the workshop.

## package developed for this workshop
library(HCABiocTraining)

## general programming tools
library(dplyr)

## data access
library(hca)
library(cellxgenedp)

## single cell data representation in R
library(SingleCellExperiment) # Bioconductor representation
library(Seurat)               # Seurat representation

Data Transformations

FASTQ files

DNA sequences and quality scores
Very large
Processed e.g., by CellRanger or other software
Quality control, summary to count matrix (below)

Count matrix

Usually genes (rows) x cells (columns)
Can be large (e.g., 30,000 genes x 50,000 cells)
Usually very sparse, e.g., 95% of cells ‘0’.
Still large enough to require a decent amount of computing power, e.g., 32 GB RAM, 8 CPU for some steps.
Common formats
- CSV file – lots of zero’s so very wasteful of space.
- ‘Matrix Market’ sparse matrix files, e.g., tuples of <row, column, count> for non-zero values.
- HDF5, e.g., .loom or .h5ad (anndata).
- RDS – An R file, in CELLxGENE these are Seurat objects but in general they could contain any R object.
Representation in R
- In-memory sparse matrices: dgCMatrix class from the Matrix package
- On-disk representation via Bioconductor’s DelayedArray / HDF5Array.

Counts & annotations

CSV and Matrix Market files store just counts; usually annotations on columns (e.g., what sample did each cell come from?) are stored separately.
HDF5 file formats coordinate row and column annotations with count data.
R software tries to offer a coordinated representation of counts and metadata, e.g., the Seurat or Bioconductor SingleCellExperiment objects.

Data Discovery

Human Cell Atlas

Data portal

What’s available?

Project & sample annotations for HCA-funded projects
FASTQ files
‘Legacy’ .loom files for about 50 experiments
Ad hoc count matrix data – mostly CSV or Matrix Market files – easy to download, but…
Count matrix data have uncertain provenance (how were they computed?). Often considerable work required to create usable data, e.g., hcaCaseStudies

CellXGene

Data Portal

What’s available?

Collections and datasets contributed by the single-cell community, with some overlap with data sets in the HCA Data Portal.
FASTQ files
.h5ad or .RDS summarized count data and cell metadata, as well as reduced-dimension (e.g., UMAP) representations
- Summarized count files provided by the contributor / individual lab, so of uncertain provenance
Easy to download count data
Easy to visualize (!)

Programatic Discovery

Why use an R script when the Data Portals exist?

Easily reproducible
Flexible exploration of rich & complex data
Direct integration with Seurat or Bioconductor single-cell workflows

Human Cell Atlas

See the ExploratingHCACxG workshop on HCA data retrieval for retrieving legacy .loom files.
See hcaCaseStudies for examples of processing CSV and Matrix Market files.

CELLxGENE

Load the cellxgenedp package

library(cellxgenedp)

Retrieve the current database, and use ‘tidy’ functionality to mimic the graphical selection in the web browser – 10x 3’ v3 (EFO:0009922) assay, Affrican American ethnicity, female gender)

db <- db()
african_american_female <-
    datasets(db) |>
    dplyr::filter(
        facets_filter(assay, "ontology_term_id", "EFO:0009922"),
        facets_filter(self_reported_ethnicity, "label", "African American"),
        facets_filter(sex, "label", "female")
    )
african_american_female
## # A tibble: 21 × 28
##    dataset_id     colle…¹ donor…² assay  cell_…³ cell_…⁴ datas…⁵ devel…⁶ disease
##    <chr>          <chr>   <list>  <list>   <int> <list>  <chr>   <list>  <list> 
##  1 de985818-285f… c9706a… <list>  <list>   31696 <list>  https:… <list>  <list> 
##  2 f72958f5-7f42… 2f75d2… <list>  <list>  982538 <list>  https:… <list>  <list> 
##  3 bc2a7b3d-f04e… b9fc3d… <list>  <list>  109995 <list>  https:… <list>  <list> 
##  4 96a3f64b-0ee9… b9fc3d… <list>  <list>  239696 <list>  https:… <list>  <list> 
##  5 d9b4bc69-ed90… b9fc3d… <list>  <list>   20000 <list>  https:… <list>  <list> 
##  6 59b69042-47c2… b9fc3d… <list>  <list>   49139 <list>  https:… <list>  <list> 
##  7 e763ed0d-0e5a… b9fc3d… <list>  <list>    7274 <list>  https:… <list>  <list> 
##  8 db0752b9-f20e… b9fc3d… <list>  <list>   55348 <list>  https:… <list>  <list> 
##  9 e9175006-8978… 62e8f0… <list>  <list>   14072 <list>  https:… <list>  <list> 
## 10 d224c8e0-c28e… 62e8f0… <list>  <list>    8030 <list>  https:… <list>  <list> 
## # … with 11 more rows, 19 more variables: is_primary_data <chr>,
## #   is_valid <lgl>, linked_genesets <lgl>, mean_genes_per_cell <dbl>,
## #   name <chr>, organism <list>, processing_status <list>, published <lgl>,
## #   revision <int>, schema_version <chr>, self_reported_ethnicity <list>,
## #   sex <list>, suspension_type <list>, tissue <list>, tombstone <lgl>,
## #   created_at <date>, published_at <date>, revised_at <date>,
## #   updated_at <date>, and abbreviated variable names ¹collection_id, …

‘Join’ selected datasets and files to identify the files associated with these datasets.

selected_files <-
    left_join(
        african_american_female |> select(dataset_id),
        files(db),
        by = "dataset_id"
    )
selected_files
## # A tibble: 63 × 8
##    dataset_id       file_id filen…¹ filet…² s3_uri user_…³ created_at updated_at
##    <chr>            <chr>   <chr>   <chr>   <chr>  <lgl>   <date>     <date>    
##  1 de985818-285f-4… 15e9d9… local.… H5AD    s3://… TRUE    2022-10-21 2022-10-21
##  2 de985818-285f-4… 0d3974… explor… CXG     s3://… TRUE    2022-10-21 2022-10-21
##  3 de985818-285f-4… e254f9… local.… RDS     s3://… TRUE    2022-10-21 2022-10-21
##  4 f72958f5-7f42-4… 59bd46… local.… RDS     s3://… TRUE    2022-10-18 2022-10-18
##  5 f72958f5-7f42-4… 3a2467… explor… CXG     s3://… TRUE    2022-10-18 2022-10-18
##  6 f72958f5-7f42-4… d9f9d0… local.… H5AD    s3://… TRUE    2022-10-18 2022-10-18
##  7 bc2a7b3d-f04e-4… f6d9f2… local.… H5AD    s3://… TRUE    2022-10-18 2022-10-18
##  8 bc2a7b3d-f04e-4… 46de9c… explor… CXG     s3://… TRUE    2022-10-18 2022-10-18
##  9 bc2a7b3d-f04e-4… 5331f2… local.… RDS     s3://… TRUE    2022-10-18 2022-10-18
## 10 96a3f64b-0ee9-4… b77452… local.… H5AD    s3://… TRUE    2022-10-18 2022-10-18
## # … with 53 more rows, and abbreviated variable names ¹filename, ²filetype,
## #   ³user_submitted

Select the first ‘CXG’ file available in this subset of data; for reproducibility we retrieve the dataset id…

selected_files |>
    dplyr::filter(filetype == "CXG") |>
    dplyr::slice(1) |>
    pull(dataset_id)
## [1] "de985818-285f-4f59-9dbd-d74968fddba3"

… and set it; the helper function training_cxg_dataset() provides a summary of this dataset.

dataset <- "de985818-285f-4f59-9dbd-d74968fddba3"
training_cxg_dataset(dataset)
## title: A single-cell atlas of the healthy breast tissues reveals
##     clinically relevant clusters of breast epithelial cells
## description: Single-cell RNA sequencing (scRNA-seq) is an evolving
##     technology used to elucidate the cellular architecture of adult
##     organs. Previous scRNA-seq on breast tissue utilized reduction
##     mammoplasty samples, which are often histologically abnormal. We
##     report a rapid tissue collection/processing protocol to perform
##     scRNA-seq of breast biopsies of healthy women and identify 23
##     breast epithelial cell clusters. Putative cell-of-origin signatures
##     derived from these clusters are applied to analyze transcriptomes
##     of ~3,000 breast cancers. Gene signatures derived from mature
##     luminal cell clusters are enriched in ~68% of breast cancers,
##     whereas a signature from a luminal progenitor cluster is enriched
##     in ~20% of breast cancers. Overexpression of luminal progenitor
##     cluster-derived signatures in HER2+, but not in other subtypes, is
##     associated with unfavorable outcome. We identify TBX3 and PDK4 as
##     genes co-expressed with estrogen receptor (ER) in the normal
##     breasts, and their expression analyses in >550 breast cancers
##     enable prognostically relevant subclassification of ER+ breast
##     cancers.
## authors: Bhat-Nakshatri, Poornima; Gao, Hongyu; Sheng, Liu; McGuire,
##     Patrick C.; Xuei, Xiaoling; Wan, Jun; Liu, Yunlong; Althouse,
##     Sandra K.; Colter, Austyn; Sandusky, George; Storniolo, Anna Maria;
##     Nakshatri, Harikrishna
## journal: Cell Reports Medicine
## assays: 10x 3' v2; 10x 3' v3
## organism: Homo sapiens
## ethnicity: African American; Chinese; European

Visualize this ‘CXG’ file in the browser…

selected_files |>
    dplyr::filter(filetype == "CXG", dataset_id == dataset) |>
    datasets_visualize()

…or select the ‘H5AD’ or ‘RDS’ (Seurat) file associated with the dataset and download it for subsequent processing in R

seurat_file <-
    selected_files |>
    dplyr::filter(filetype == "RDS", dataset_id == dataset) |>
    files_download(dry.run = FALSE)

h5ad_file <-
    selected_files |>
    dplyr::filter(filetype == "H5AD", dataset_id == dataset) |>
    files_download(dry.run = FALSE)

The downloaded file is cached, so the next time access is fast.

Representation in R

Seurat

Fortunately, CELLxGENE distributes Seurat (v. 4) files, and they can be input directly.

library(Seurat)
seurat <- readRDS(seurat_file)
seurat
## An object of class Seurat 
## 33234 features across 31696 samples within 1 assay 
## Active assay: RNA (33234 features, 0 variable features)
##  3 dimensional reductions calculated: pca, tsne, umap

We will work with this object a little in a subsequent article.

The .h5ad files from CELLxGENE can also be used to construct a Seurat object. Normally, a Seurat object can be created using SeuratDisk using this workflow

h5seurat_file <- sub(".h5ad$", "h5seurat", h5ad_file)
Convert(h5ad_file, dest = "h5seurat", overwrite = TRUE)
h5seurat <- LoadH5Seurat(h5seurat_file)

Unfortunately this does not currently handle the .h5ad file format made available by CELLxGENE. The helper function training_read_h5ad_as_seurat() sketches an approach to data import using the anndata R package and anndata module.

SingleCellExperiment

The SingleCellExperiment is used to represent ‘rectangular’ single cell expression and other data in R / Bioconductor. It coordinates a gene x cell count matricies (assay()) with annotations on the genes (rowData()) and columns (cellData()), and with reduced-dimension summaries.

Usually, an effective way to represent the .h5ad data as a SingleCellExperiment is using zellkonverter::readH5AD(). Recent updates to the .h5ad file format have unfortunately broken that function; it will likely be fixed in a timely manner. Instead, we use the helper function training_read_h5ad_as_sce() to load the data

h5ad <- training_read_h5ad_as_sce(h5ad_file)

The approach uses the anndata R package and anndata module. Displaying the object…

h5ad
## class: SingleCellExperiment 
## dim: 33234 31696 
## metadata(3): default_embedding schema_version title
## assays(1): counts
## rownames(33234): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(4): feature_is_filtered feature_name feature_reference
##   feature_biotype
## colnames(31696): CMGpool_AAACCCAAGGACAACC CMGpool_AAACCCACAATCTCTT ...
##   K109064_TTTGTTGGTTGCATCA K109064_TTTGTTGGTTGGACCC
## colData names(34): donor_id self_reported_ethnicity_ontology_term_id
##   ... self_reported_ethnicity development_stage
## reducedDimNames(3): X_pca X_tsne X_umap
## mainExpName: NULL
## altExpNames(0):

…suggests the data available and how to access it – there are 33234 genes and 31696 cells. The ‘raw’ data include counts assays(h5ad, "counts"), annotations on each gene (rowData()) and cell (colData()), etc…

Working with SingleCellExperiment objects is described in additional detail subsequent articles.