B. Accessing HCA Data from R / Bioconductor
b-hca-data-access.Rmd
We use the following packages during this section of the workshop.
## package developed for this workshop
library(HCABiocTraining)
## general programming tools
library(dplyr)
## data access
library(hca)
library(cellxgenedp)
## single cell data representation in R
library(SingleCellExperiment) # Bioconductor representation
library(Seurat) # Seurat representation
Data Transformations
FASTQ files
- DNA sequences and quality scores
- Very large
- Processed e.g., by CellRanger or other software
- Quality control, summary to count matrix (below)
Count matrix
Usually genes (rows) x cells (columns)
Can be large (e.g., 30,000 genes x 50,000 cells)
Usually very sparse, e.g., 95% of cells ‘0’.
Still large enough to require a decent amount of computing power, e.g., 32 GB RAM, 8 CPU for some steps.
-
Common formats
-
Representation in R
- In-memory sparse matrices:
dgCMatrix
class from the Matrix package - On-disk representation via Bioconductor’s DelayedArray / HDF5Array.
- In-memory sparse matrices:
Counts & annotations
- CSV and Matrix Market files store just counts; usually annotations on columns (e.g., what sample did each cell come from?) are stored separately.
- HDF5 file formats coordinate row and column annotations with count data.
- R software tries to offer a coordinated representation of counts and metadata, e.g., the Seurat or Bioconductor SingleCellExperiment objects.
Data Discovery
Human Cell Atlas
What’s available?
- Project & sample annotations for HCA-funded projects
- FASTQ files
- ‘Legacy’
.loom
files for about 50 experiments - Ad hoc count matrix data – mostly CSV or Matrix Market files – easy to download, but…
- Count matrix data have uncertain provenance (how were they computed?). Often considerable work required to create usable data, e.g., hcaCaseStudies
CellXGene
What’s available?
- Collections and datasets contributed by the single-cell community, with some overlap with data sets in the HCA Data Portal.
- FASTQ files
-
.h5ad
or.RDS
summarized count data and cell metadata, as well as reduced-dimension (e.g., UMAP) representations- Summarized count files provided by the contributor / individual lab, so of uncertain provenance
- Easy to download count data
- Easy to visualize (!)
Programatic Discovery
Why use an R script when the Data Portals exist?
- Easily reproducible
- Flexible exploration of rich & complex data
- Direct integration with Seurat or Bioconductor single-cell workflows
Human Cell Atlas
See the ExploratingHCACxG workshop on HCA data retrieval for retrieving legacy
.loom
files.See hcaCaseStudies for examples of processing CSV and Matrix Market files.
CELLxGENE
Load the cellxgenedp package
library(cellxgenedp)
Retrieve the current database, and use ‘tidy’ functionality to mimic the graphical selection in the web browser – 10x 3’ v3 (EFO:0009922) assay, Affrican American ethnicity, female gender)
db <- db()
african_american_female <-
datasets(db) |>
dplyr::filter(
facets_filter(assay, "ontology_term_id", "EFO:0009922"),
facets_filter(self_reported_ethnicity, "label", "African American"),
facets_filter(sex, "label", "female")
)
african_american_female
## # A tibble: 21 × 28
## dataset_id colle…¹ donor…² assay cell_…³ cell_…⁴ datas…⁵ devel…⁶ disease
## <chr> <chr> <list> <list> <int> <list> <chr> <list> <list>
## 1 de985818-285f… c9706a… <list> <list> 31696 <list> https:… <list> <list>
## 2 f72958f5-7f42… 2f75d2… <list> <list> 982538 <list> https:… <list> <list>
## 3 bc2a7b3d-f04e… b9fc3d… <list> <list> 109995 <list> https:… <list> <list>
## 4 96a3f64b-0ee9… b9fc3d… <list> <list> 239696 <list> https:… <list> <list>
## 5 d9b4bc69-ed90… b9fc3d… <list> <list> 20000 <list> https:… <list> <list>
## 6 59b69042-47c2… b9fc3d… <list> <list> 49139 <list> https:… <list> <list>
## 7 e763ed0d-0e5a… b9fc3d… <list> <list> 7274 <list> https:… <list> <list>
## 8 db0752b9-f20e… b9fc3d… <list> <list> 55348 <list> https:… <list> <list>
## 9 e9175006-8978… 62e8f0… <list> <list> 14072 <list> https:… <list> <list>
## 10 d224c8e0-c28e… 62e8f0… <list> <list> 8030 <list> https:… <list> <list>
## # … with 11 more rows, 19 more variables: is_primary_data <chr>,
## # is_valid <lgl>, linked_genesets <lgl>, mean_genes_per_cell <dbl>,
## # name <chr>, organism <list>, processing_status <list>, published <lgl>,
## # revision <int>, schema_version <chr>, self_reported_ethnicity <list>,
## # sex <list>, suspension_type <list>, tissue <list>, tombstone <lgl>,
## # created_at <date>, published_at <date>, revised_at <date>,
## # updated_at <date>, and abbreviated variable names ¹collection_id, …
‘Join’ selected datasets and files to identify the files associated with these datasets.
selected_files <-
left_join(
african_american_female |> select(dataset_id),
files(db),
by = "dataset_id"
)
selected_files
## # A tibble: 63 × 8
## dataset_id file_id filen…¹ filet…² s3_uri user_…³ created_at updated_at
## <chr> <chr> <chr> <chr> <chr> <lgl> <date> <date>
## 1 de985818-285f-4… 15e9d9… local.… H5AD s3://… TRUE 2022-10-21 2022-10-21
## 2 de985818-285f-4… 0d3974… explor… CXG s3://… TRUE 2022-10-21 2022-10-21
## 3 de985818-285f-4… e254f9… local.… RDS s3://… TRUE 2022-10-21 2022-10-21
## 4 f72958f5-7f42-4… 59bd46… local.… RDS s3://… TRUE 2022-10-18 2022-10-18
## 5 f72958f5-7f42-4… 3a2467… explor… CXG s3://… TRUE 2022-10-18 2022-10-18
## 6 f72958f5-7f42-4… d9f9d0… local.… H5AD s3://… TRUE 2022-10-18 2022-10-18
## 7 bc2a7b3d-f04e-4… f6d9f2… local.… H5AD s3://… TRUE 2022-10-18 2022-10-18
## 8 bc2a7b3d-f04e-4… 46de9c… explor… CXG s3://… TRUE 2022-10-18 2022-10-18
## 9 bc2a7b3d-f04e-4… 5331f2… local.… RDS s3://… TRUE 2022-10-18 2022-10-18
## 10 96a3f64b-0ee9-4… b77452… local.… H5AD s3://… TRUE 2022-10-18 2022-10-18
## # … with 53 more rows, and abbreviated variable names ¹filename, ²filetype,
## # ³user_submitted
Select the first ‘CXG’ file available in this subset of data; for reproducibility we retrieve the dataset id…
selected_files |>
dplyr::filter(filetype == "CXG") |>
dplyr::slice(1) |>
pull(dataset_id)
## [1] "de985818-285f-4f59-9dbd-d74968fddba3"
… and set it; the helper function training_cxg_dataset()
provides a summary of this dataset.
dataset <- "de985818-285f-4f59-9dbd-d74968fddba3"
training_cxg_dataset(dataset)
## title: A single-cell atlas of the healthy breast tissues reveals
## clinically relevant clusters of breast epithelial cells
## description: Single-cell RNA sequencing (scRNA-seq) is an evolving
## technology used to elucidate the cellular architecture of adult
## organs. Previous scRNA-seq on breast tissue utilized reduction
## mammoplasty samples, which are often histologically abnormal. We
## report a rapid tissue collection/processing protocol to perform
## scRNA-seq of breast biopsies of healthy women and identify 23
## breast epithelial cell clusters. Putative cell-of-origin signatures
## derived from these clusters are applied to analyze transcriptomes
## of ~3,000 breast cancers. Gene signatures derived from mature
## luminal cell clusters are enriched in ~68% of breast cancers,
## whereas a signature from a luminal progenitor cluster is enriched
## in ~20% of breast cancers. Overexpression of luminal progenitor
## cluster-derived signatures in HER2+, but not in other subtypes, is
## associated with unfavorable outcome. We identify TBX3 and PDK4 as
## genes co-expressed with estrogen receptor (ER) in the normal
## breasts, and their expression analyses in >550 breast cancers
## enable prognostically relevant subclassification of ER+ breast
## cancers.
## authors: Bhat-Nakshatri, Poornima; Gao, Hongyu; Sheng, Liu; McGuire,
## Patrick C.; Xuei, Xiaoling; Wan, Jun; Liu, Yunlong; Althouse,
## Sandra K.; Colter, Austyn; Sandusky, George; Storniolo, Anna Maria;
## Nakshatri, Harikrishna
## journal: Cell Reports Medicine
## assays: 10x 3' v2; 10x 3' v3
## organism: Homo sapiens
## ethnicity: African American; Chinese; European
Visualize this ‘CXG’ file in the browser…
selected_files |>
dplyr::filter(filetype == "CXG", dataset_id == dataset) |>
datasets_visualize()
…or select the ‘H5AD’ or ‘RDS’ (Seurat) file associated with the dataset and download it for subsequent processing in R
seurat_file <-
selected_files |>
dplyr::filter(filetype == "RDS", dataset_id == dataset) |>
files_download(dry.run = FALSE)
h5ad_file <-
selected_files |>
dplyr::filter(filetype == "H5AD", dataset_id == dataset) |>
files_download(dry.run = FALSE)
The downloaded file is cached, so the next time access is fast.
Representation in R
Seurat
Fortunately, CELLxGENE distributes Seurat (v. 4) files, and they can be input directly.
library(Seurat)
seurat <- readRDS(seurat_file)
seurat
## An object of class Seurat
## 33234 features across 31696 samples within 1 assay
## Active assay: RNA (33234 features, 0 variable features)
## 3 dimensional reductions calculated: pca, tsne, umap
We will work with this object a little in a subsequent article.
The .h5ad
files from CELLxGENE can also be used to
construct a Seurat
object. Normally, a Seurat
object can be created using SeuratDisk
using this workflow
h5seurat_file <- sub(".h5ad$", "h5seurat", h5ad_file)
Convert(h5ad_file, dest = "h5seurat", overwrite = TRUE)
h5seurat <- LoadH5Seurat(h5seurat_file)
Unfortunately this does not currently handle the .h5ad
file format made available by CELLxGENE. The helper function
training_read_h5ad_as_seurat()
sketches an approach to data
import using the anndata
R package and anndata
module.
SingleCellExperiment
The SingleCellExperiment
is used to represent
‘rectangular’ single cell expression and other data in R /
Bioconductor. It coordinates a gene x cell count matricies
(assay()
) with annotations on the genes
(rowData()
) and columns (cellData()
), and with
reduced-dimension summaries.
Usually, an effective way to represent the .h5ad
data as
a SingleCellExperiment
is using
zellkonverter::readH5AD()
. Recent updates to the
.h5ad
file format have unfortunately broken that function;
it will likely be fixed in a timely manner. Instead, we use the helper
function training_read_h5ad_as_sce()
to load the data
h5ad <- training_read_h5ad_as_sce(h5ad_file)
The approach uses the anndata R package and anndata module. Displaying the object…
h5ad
## class: SingleCellExperiment
## dim: 33234 31696
## metadata(3): default_embedding schema_version title
## assays(1): counts
## rownames(33234): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(4): feature_is_filtered feature_name feature_reference
## feature_biotype
## colnames(31696): CMGpool_AAACCCAAGGACAACC CMGpool_AAACCCACAATCTCTT ...
## K109064_TTTGTTGGTTGCATCA K109064_TTTGTTGGTTGGACCC
## colData names(34): donor_id self_reported_ethnicity_ontology_term_id
## ... self_reported_ethnicity development_stage
## reducedDimNames(3): X_pca X_tsne X_umap
## mainExpName: NULL
## altExpNames(0):
…suggests the data available and how to access it – there are 33234
genes and 31696 cells. The ‘raw’ data include counts
assays(h5ad, "counts")
, annotations on each gene
(rowData()
) and cell (colData()
), etc…
Working with SingleCellExperiment
objects is described
in additional detail subsequent articles.