Discover and download datasets and files from the cellxgene data portal
Martin Morgan
Roswell Park Comprehensive Cancer CenterMartin.Morgan@RoswellPark.org Source:
vignettes/using_cellxgenedp.Rmd
using_cellxgenedp.Rmd
Abstract
The cellxgene data portal (https://cellxgene.cziscience.com/) provides a graphical user interface to collections of single-cell sequence data processed in standard ways to ‘count matrix’ summaries. The cellxgenedp package provides an alternative, R-based inteface, allowing flexible data discovery, viewing, and downloading.
NOTE: The interface to CELLxGENE has changed; versions of cellxgenedp prior to 1.4.1 / 1.5.2 will cease to work when CELLxGENE removes the previous interface. See the vignette section ‘API changes’ for additional details.
Installation and use
This package is available in Bioconductor version 3.15 and later. The following code installs cellxgenedp as well as other packages required for this vignette.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager", repos = "https://CRAN.R-project.org")
BiocManager::install("cellxgenedp")
Alternatively, install the ‘development’ version from GitHub
if (!"remotes" %in% rownames(installed.packages()))
install.packages("remotes", repos = "https://CRAN.R-project.org")
remotes::install_github("mtmorgan/cellxgenedp")
To also install additional packages required for this vignette, use
pkgs <- c("zellkonverter", "SingleCellExperiment", "HDF5Array")
required_pkgs <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(required_pkgs)
Load the package into your current R session. We make extensive use of the dplyr packages, and at the end of the vignette use SingleCellExperiment and zellkonverter, so load those as well.
library(zellkonverter)
library(SingleCellExperiment) # load early to avoid masking dplyr::count()
library(dplyr)
library(cellxgenedp)
cxg()
Provides a ‘shiny’ interface
The following sections outline how to use the cellxgenedp
package in an R script; most functionality is also available in
the cxg()
shiny application, providing an easy way to
identify, download, and visualize one or several datasets. Start the
app
cxg()
choose a project on the first tab, and a dataset for visualization, or one or more datasets for download!
Collections, datasets and files
Retrieve metadata about resources available at the cellxgene data
portal using db()
:
db <- db()
Printing the db
object provides a brief overview of the
available data, as well as hints, in the form of functions like
collections()
, for further exploration.
db
## cellxgene_db
## number of collections(): 182
## number of datasets(): 1167
## number of files(): 2314
The portal organizes data hierarchically, with ‘collections’ (research studies, approximately), ‘datasets’, and ‘files’. Discover data using the corresponding functions.
collections(db)
## # A tibble: 182 × 18
## collection_id collection_version_id collection_url consortia contact_email
## <chr> <chr> <chr> <list> <chr>
## 1 ceb895f4-ff9f-4… ee098b5a-4f33-473b-b… https://cellx… <list> panagiotis.r…
## 2 af893e86-8e9f-4… 768170a6-c590-4900-a… https://cellx… <list> ruichen@bcm.…
## 3 1d1c7275-476a-4… 609becde-c797-41bb-8… https://cellx… <list> wey334@g.har…
## 4 1b014f39-f202-4… 1d88cb46-6e84-4b5b-b… https://cellx… <lgl [1]> kimberly.ald…
## 5 48d354f5-a5ca-4… 2862daa3-c933-43c8-9… https://cellx… <list> Nathan.Salom…
## 6 43d4bb39-21af-4… 78360f02-1acc-415c-a… https://cellx… <lgl [1]> raymond.cho@…
## 7 f7cecffa-00b4-4… 43224f82-db2a-443c-9… https://cellx… <list> st9@sanger.a…
## 8 f17b9205-f61f-4… 21ff4724-95e2-491b-8… https://cellx… <list> genevieve.ko…
## 9 64b24fda-6591-4… e414854b-2666-4977-9… https://cellx… <lgl [1]> magness@med.…
## 10 48259aa8-f168-4… 44601b80-bd11-49d8-a… https://cellx… <lgl [1]> wtk22@cam.ac…
## # ℹ 172 more rows
## # ℹ 13 more variables: contact_name <chr>, curator_name <chr>,
## # description <chr>, doi <chr>, links <list>, name <chr>,
## # publisher_metadata <list>, revising_in <lgl>, revision_of <lgl>,
## # visibility <chr>, created_at <date>, published_at <date>, revised_at <date>
datasets(db)
## # A tibble: 1,167 × 31
## dataset_id dataset_version_id collection_id donor_id assay batch_condition
## <chr> <chr> <chr> <list> <list> <list>
## 1 53ce2631-36… 2f17c183-388a-4c0… ceb895f4-ff9… <list> <list> <list [2]>
## 2 1d4128f6-c2… 94762ee1-9f9f-49e… ceb895f4-ff9… <list> <list> <list [2]>
## 3 ed419b4e-db… 758b30a8-5fb0-46c… af893e86-8e9… <list> <list> <lgl [1]>
## 4 aad97cb5-f3… d6966985-89f9-485… af893e86-8e9… <list> <list> <lgl [1]>
## 5 8f10185b-e0… 63d7a3a3-9691-41d… af893e86-8e9… <list> <list> <lgl [1]>
## 6 359f7af4-87… 0f461193-282f-443… af893e86-8e9… <list> <list> <lgl [1]>
## 7 11ef37ee-21… 74253a67-927c-4cd… af893e86-8e9… <list> <list> <lgl [1]>
## 8 0129dbd9-a7… a970179d-2e9e-4d2… af893e86-8e9… <list> <list> <lgl [1]>
## 9 00e5dedd-b9… 94c0e74c-b269-4ce… af893e86-8e9… <list> <list> <lgl [1]>
## 10 d319af7f-be… 3c80a5bb-8c89-433… 1d1c7275-476… <list> <list> <lgl [1]>
## # ℹ 1,157 more rows
## # ℹ 25 more variables: cell_count <int>, cell_type <list>, citation <chr>,
## # development_stage <list>, disease <list>, embeddings <list>,
## # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
## # feature_reference <list>, is_primary_data <list>,
## # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
## # raw_data_location <chr>, schema_version <chr>, …
files(db)
## # A tibble: 2,314 × 4
## dataset_id filesize filetype url
## <chr> <dbl> <chr> <chr>
## 1 53ce2631-3646-4172-bbd9-38b0a44d8214 406108808 H5AD https://datasets.ce…
## 2 53ce2631-3646-4172-bbd9-38b0a44d8214 399752425 RDS https://datasets.ce…
## 3 1d4128f6-c27b-40c4-af77-b1c7e2b694e7 906795740 H5AD https://datasets.ce…
## 4 1d4128f6-c27b-40c4-af77-b1c7e2b694e7 1060800682 RDS https://datasets.ce…
## 5 ed419b4e-db9b-40f1-8593-68fdf8dfb076 1071401902 H5AD https://datasets.ce…
## 6 ed419b4e-db9b-40f1-8593-68fdf8dfb076 1419579253 RDS https://datasets.ce…
## 7 aad97cb5-f375-45ef-ae9d-178e7f5d5180 785137201 H5AD https://datasets.ce…
## 8 aad97cb5-f375-45ef-ae9d-178e7f5d5180 1025253758 RDS https://datasets.ce…
## 9 8f10185b-e0b3-46a5-8706-7f1799225d79 3077438912 H5AD https://datasets.ce…
## 10 8f10185b-e0b3-46a5-8706-7f1799225d79 4090930879 RDS https://datasets.ce…
## # ℹ 2,304 more rows
Each of these resources has a unique primary identifier (e.g.,
file_id
) as well as an identifier describing the
relationship of the resource to other components of the database (e.g.,
dataset_id
). These identifiers can be used to ‘join’
information across tables.
Using dplyr
to navigate data
A collection may have several datasets, and datasets may have several files. For instance, here is the collection with the most datasets
We can find out about this collection by joining with the
collections()
table.
left_join(
collection_with_most_datasets |> select(collection_id),
collections(db),
by = "collection_id"
) |> glimpse()
## Rows: 1
## Columns: 18
## $ collection_id <chr> "283d65eb-dd53-496d-adb7-7570c7caa443"
## $ collection_version_id <chr> "4c16c611-00a9-42f9-a8c4-7b42daa226fe"
## $ collection_url <chr> "https://cellxgene.cziscience.com/collections/28…
## $ consortia <list> ["BRAIN Initiative", "CZI Single-Cell Biology"]
## $ contact_email <chr> "kimberly.siletti@ki.se"
## $ contact_name <chr> "Kimberly Siletti"
## $ curator_name <chr> "James Chaffer"
## $ description <chr> "First draft atlas of human brain transcriptomic…
## $ doi <chr> "10.1126/science.add7046"
## $ links <list> [["", "RAW_DATA", "http://data.nemoarchive.org/b…
## $ name <chr> "Human Brain Cell Atlas v1.0"
## $ publisher_metadata <list> [[["Siletti", "Kimberly"], ["Hodge", "Rebecca"]…
## $ revising_in <lgl> NA
## $ revision_of <lgl> NA
## $ visibility <chr> "PUBLIC"
## $ created_at <date> 2023-12-12
## $ published_at <date> 2022-12-09
## $ revised_at <date> 2023-12-13
We can take a similar strategy to identify all datasets belonging to this collection
left_join(
collection_with_most_datasets |> select(collection_id),
datasets(db),
by = "collection_id"
)
## # A tibble: 138 × 31
## collection_id dataset_id dataset_version_id donor_id assay batch_condition
## <chr> <chr> <chr> <list> <list> <list>
## 1 283d65eb-dd53-… ff7d15fa-… 51e05270-1f00-452… <list> <list> <list [1]>
## 2 283d65eb-dd53-… fe1a73ab-… 4e124ecc-7885-465… <list> <list> <list [1]>
## 3 283d65eb-dd53-… fbf173f9-… 5a52f557-aeaf-4fc… <list> <list> <list [1]>
## 4 283d65eb-dd53-… fa554686-… 6606e9aa-e4c4-452… <list> <list> <list [1]>
## 5 283d65eb-dd53-… f9034091-… 8f5b1977-8317-447… <list> <list> <list [1]>
## 6 283d65eb-dd53-… f8dda921-… 1ad58833-956c-454… <list> <list> <list [1]>
## 7 283d65eb-dd53-… f7d003d4-… 4d002ac1-4671-490… <list> <list> <list [1]>
## 8 283d65eb-dd53-… f6d9f2ad-… 2102f4b8-c1fe-4ee… <list> <list> <list [1]>
## 9 283d65eb-dd53-… f5a04dff-… b92375fd-dafe-44c… <list> <list> <list [1]>
## 10 283d65eb-dd53-… f502c312-… b750310e-1abb-4c7… <list> <list> <list [1]>
## # ℹ 128 more rows
## # ℹ 25 more variables: cell_count <int>, cell_type <list>, citation <chr>,
## # development_stage <list>, disease <list>, embeddings <list>,
## # explorer_url <chr>, feature_biotype <list>, feature_count <int>,
## # feature_reference <list>, is_primary_data <list>,
## # mean_genes_per_cell <dbl>, organism <list>, primary_cell_count <int>,
## # raw_data_location <chr>, schema_version <chr>, …
facets()
provides information on ‘levels’ present in
specific columns
Notice that some columns are ‘lists’ rather than atomic vectors like ‘character’ or ‘integer’.
## # A tibble: 1,167 × 15
## donor_id assay batch_condition cell_type development_stage disease
## <list> <list> <list> <list> <list> <list>
## 1 <list [12]> <list [1]> <list [2]> <list [13]> <list [10]> <list>
## 2 <list [12]> <list [1]> <list [2]> <list [13]> <list [10]> <list>
## 3 <list [6]> <list [1]> <lgl [1]> <list [4]> <list [5]> <list>
## 4 <list [6]> <list [1]> <lgl [1]> <list [1]> <list [5]> <list>
## 5 <list [6]> <list [1]> <lgl [1]> <list [2]> <list [5]> <list>
## 6 <list [6]> <list [1]> <lgl [1]> <list [1]> <list [5]> <list>
## 7 <list [6]> <list [1]> <lgl [1]> <list [1]> <list [5]> <list>
## 8 <list [6]> <list [1]> <lgl [1]> <list [6]> <list [5]> <list>
## 9 <list [6]> <list [1]> <lgl [1]> <list [2]> <list [5]> <list>
## 10 <list [7]> <list [2]> <lgl [1]> <list [1]> <list [7]> <list>
## # ℹ 1,157 more rows
## # ℹ 9 more variables: embeddings <list>, feature_biotype <list>,
## # feature_reference <list>, is_primary_data <list>, organism <list>,
## # self_reported_ethnicity <list>, sex <list>, suspension_type <list>,
## # tissue <list>
This indicates that at least some of the datasets had more than one
type of assay
, cell_type
, etc. The
facets()
function provides a convenient way of discovering
possible levels of each column, e.g., assay
,
organism
, self_reported_ethnicity
, or
sex
, and the number of datasets with each label.
facets(db, "assay")
## # A tibble: 38 × 4
## facet label ontology_term_id n
## <chr> <chr> <chr> <int>
## 1 assay 10x 3' v3 EFO:0009922 563
## 2 assay 10x 3' v2 EFO:0009899 254
## 3 assay Slide-seqV2 EFO:0030062 223
## 4 assay Visium Spatial Gene Expression EFO:0010961 108
## 5 assay 10x 5' v1 EFO:0011025 81
## 6 assay Smart-seq2 EFO:0008931 63
## 7 assay 10x multiome EFO:0030059 61
## 8 assay 10x 5' v2 EFO:0009900 23
## 9 assay sci-RNA-seq3 EFO:0030028 15
## 10 assay Drop-seq EFO:0008722 14
## # ℹ 28 more rows
facets(db, "self_reported_ethnicity")
## # A tibble: 32 × 4
## facet label ontology_term_id n
## <chr> <chr> <chr> <int>
## 1 self_reported_ethnicity European HANCESTRO:0005 499
## 2 self_reported_ethnicity unknown unknown 411
## 3 self_reported_ethnicity na na 314
## 4 self_reported_ethnicity Asian HANCESTRO:0008 141
## 5 self_reported_ethnicity African American HANCESTRO:0568 61
## 6 self_reported_ethnicity Native American,Hispanic or L… HANCESTRO:0013,… 50
## 7 self_reported_ethnicity Hispanic or Latin American HANCESTRO:0014 48
## 8 self_reported_ethnicity African American or Afro-Cari… HANCESTRO:0016 26
## 9 self_reported_ethnicity Greater Middle Eastern (Midd… HANCESTRO:0015 22
## 10 self_reported_ethnicity South Asian HANCESTRO:0006 11
## # ℹ 22 more rows
facets(db, "sex")
## # A tibble: 3 × 4
## facet label ontology_term_id n
## <chr> <chr> <chr> <int>
## 1 sex male PATO:0000384 903
## 2 sex female PATO:0000383 677
## 3 sex unknown unknown 173
Filtering faceted columns
Suppose we were interested in finding datasets from the 10x 3’ v3
assay (ontology_term_id
of EFO:0009922
)
containing individuals of African American ethnicity, and female sex.
Use the facets_filter()
utility function to filter data
sets as needed
african_american_female <-
datasets(db) |>
filter(
facets_filter(assay, "ontology_term_id", "EFO:0009922"),
facets_filter(self_reported_ethnicity, "label", "African American"),
facets_filter(sex, "label", "female")
)
Use nrow(african_american_female)
to find the number of
datasets satisfying our criteria. It looks like there are up to
## # A tibble: 1 × 1
## total_cell_count
## <int>
## 1 4320736
cells sequenced (each dataset may contain cells from several
ethnicities, as well as males or individuals of unknown gender, so we do
not know the actual number of cells available without downloading
files). Use left_join
to identify the corresponding
collections:
## collections
left_join(
african_american_female |> select(collection_id) |> distinct(),
collections(db),
by = "collection_id"
)
## # A tibble: 13 × 18
## collection_id collection_version_id collection_url consortia contact_email
## <chr> <chr> <chr> <list> <chr>
## 1 f17b9205-f61f-4… 21ff4724-95e2-491b-8… https://cellx… <list> genevieve.ko…
## 2 625f6bf4-2f33-4… 0c0d607f-00b8-4f3d-8… https://cellx… <list> a5wang@healt…
## 3 c9706a92-0e5f-4… bc627471-7137-4518-a… https://cellx… <list> hnakshat@iup…
## 4 a98b828a-622a-4… cee0b899-009a-40ec-a… https://cellx… <list> markusbi@med…
## 5 bcb61471-2a44-4… 39fca0ca-2b0f-47b5-9… https://cellx… <list> info@kpmp.org
## 6 72d37bc9-76cc-4… 3e396ffb-b0d8-4ce4-b… https://cellx… <list> m.sepp@zmbh.…
## 7 b953c942-f5d8-4… 7727e578-1805-47c8-b… https://cellx… <lgl [1]> icobos@stanf…
## 8 62e8f058-9c37-4… addce074-53d2-4f21-9… https://cellx… <list> chanj3@mskcc…
## 9 71f4bccf-53d4-4… 5a524bd4-231b-4941-a… https://cellx… <list> kevinmbyrd@g…
## 10 e1fa9900-3fc9-4… 85624898-8006-4209-a… https://cellx… <lgl [1]> j.ma@yale.edu
## 11 4195ab4c-20bd-4… f7da9dd1-b0ec-401f-9… https://cellx… <list> nnavin@mdand…
## 12 6b701826-37bb-4… 95ab05df-9716-4fc8-a… https://cellx… <list> astreets@ber…
## 13 b9fc3d70-5a72-4… 6701e565-6dfe-4649-b… https://cellx… <list> bruce.aronow…
## # ℹ 13 more variables: contact_name <chr>, curator_name <chr>,
## # description <chr>, doi <chr>, links <list>, name <chr>,
## # publisher_metadata <list>, revising_in <lgl>, revision_of <lgl>,
## # visibility <chr>, created_at <date>, published_at <date>, revised_at <date>
Publication and other external data
Many collections include publication information and other external
data. This information is available in the return value of
collections()
, but the helper function
publisher_metadata()
, authors()
, and
links()
may facilite access.
Suppose one is interested in the publication “A single-cell atlas of the healthy breast tissues reveals clinically relevant clusters of breast epithelial cells”. Discover it in the collections
title_of_interest <- paste(
"A single-cell atlas of the healthy breast tissues reveals clinically",
"relevant clusters of breast epithelial cells"
)
collection_of_interest <-
collections(db) |>
dplyr::filter(startsWith(name, title_of_interest))
collection_of_interest |>
glimpse()
## Rows: 1
## Columns: 18
## $ collection_id <chr> "c9706a92-0e5f-46c1-96d8-20e42467f287"
## $ collection_version_id <chr> "bc627471-7137-4518-a593-2f679bac054e"
## $ collection_url <chr> "https://cellxgene.cziscience.com/collections/c9…
## $ consortia <list> ["CZI Single-Cell Biology"]
## $ contact_email <chr> "hnakshat@iupui.edu"
## $ contact_name <chr> "Harikrishna Nakshatri"
## $ curator_name <chr> "Jennifer Yu-Sheng Chien"
## $ description <chr> "Single-cell RNA sequencing (scRNA-seq) is an ev…
## $ doi <chr> "10.1016/j.xcrm.2021.100219"
## $ links <list> [["", "RAW_DATA", "https://data.humancellatlas.o…
## $ name <chr> "A single-cell atlas of the healthy breast tiss…
## $ publisher_metadata <list> [[["Bhat-Nakshatri", "Poornima"], ["Gao", "Hongy…
## $ revising_in <lgl> NA
## $ revision_of <lgl> NA
## $ visibility <chr> "PUBLIC"
## $ created_at <date> 2023-12-12
## $ published_at <date> 2021-03-25
## $ revised_at <date> 2023-12-13
Use the collection_id
to extract publisher metadata
(including a DOI if available) and author information
collection_id_of_interest <- pull(collection_of_interest, "collection_id")
publisher_metadata(db) |>
filter(collection_id == collection_id_of_interest) |>
glimpse()
## Rows: 1
## Columns: 9
## $ collection_id <chr> "c9706a92-0e5f-46c1-96d8-20e42467f287"
## $ name <chr> "A single-cell atlas of the healthy breast tissues rev…
## $ is_preprint <lgl> FALSE
## $ journal <chr> "Cell Reports Medicine"
## $ published_at <date> 2021-03-01
## $ published_year <int> 2021
## $ published_month <int> 3
## $ published_day <int> 1
## $ doi <chr> NA
## # A tibble: 12 × 4
## collection_id family given consortium
## <chr> <chr> <chr> <chr>
## 1 c9706a92-0e5f-46c1-96d8-20e42467f287 Bhat-Nakshatri Poornima NA
## 2 c9706a92-0e5f-46c1-96d8-20e42467f287 Gao Hongyu NA
## 3 c9706a92-0e5f-46c1-96d8-20e42467f287 Sheng Liu NA
## 4 c9706a92-0e5f-46c1-96d8-20e42467f287 McGuire Patrick C. NA
## 5 c9706a92-0e5f-46c1-96d8-20e42467f287 Xuei Xiaoling NA
## 6 c9706a92-0e5f-46c1-96d8-20e42467f287 Wan Jun NA
## 7 c9706a92-0e5f-46c1-96d8-20e42467f287 Liu Yunlong NA
## 8 c9706a92-0e5f-46c1-96d8-20e42467f287 Althouse Sandra K. NA
## 9 c9706a92-0e5f-46c1-96d8-20e42467f287 Colter Austyn NA
## 10 c9706a92-0e5f-46c1-96d8-20e42467f287 Sandusky George NA
## 11 c9706a92-0e5f-46c1-96d8-20e42467f287 Storniolo Anna Maria NA
## 12 c9706a92-0e5f-46c1-96d8-20e42467f287 Nakshatri Harikrishna NA
Collections may have links to additional external data, in this case
a DOI and two links to RAW_DATA
.
external_links <- links(db)
external_links
## # A tibble: 716 × 4
## collection_id link_name link_type link_url
## <chr> <chr> <chr> <chr>
## 1 ceb895f4-ff9f-403a-b7c3-187a9657ac2c SCP1859 OTHER https://singl…
## 2 ceb895f4-ff9f-403a-b7c3-187a9657ac2c NA LAB_WEBSITE https://labs.…
## 3 ceb895f4-ff9f-403a-b7c3-187a9657ac2c NA OTHER http://genome…
## 4 ceb895f4-ff9f-403a-b7c3-187a9657ac2c GSE204684 RAW_DATA https://www.n…
## 5 ceb895f4-ff9f-403a-b7c3-187a9657ac2c analysis code OTHER https://zenod…
## 6 af893e86-8e9f-41f1-a474-ef05359b1fb7 NA OTHER https://retin…
## 7 af893e86-8e9f-41f1-a474-ef05359b1fb7 NA RAW_DATA https://data.…
## 8 af893e86-8e9f-41f1-a474-ef05359b1fb7 GSE226108 RAW_DATA https://www.n…
## 9 1d1c7275-476a-49e2-9022-ad1b1c793594 GSE148077 RAW_DATA https://www.n…
## 10 1d1c7275-476a-49e2-9022-ad1b1c793594 NA OTHER https://singl…
## # ℹ 706 more rows
external_links |>
count(link_type)
## # A tibble: 5 × 2
## link_type n
## <chr> <int>
## 1 DATA_SOURCE 35
## 2 LAB_WEBSITE 38
## 3 OTHER 329
## 4 PROTOCOL 44
## 5 RAW_DATA 270
external_links |>
filter(collection_id == collection_id_of_interest)
## # A tibble: 2 × 4
## collection_id link_name link_type link_url
## <chr> <chr> <chr> <chr>
## 1 c9706a92-0e5f-46c1-96d8-20e42467f287 NA RAW_DATA https://data.humance…
## 2 c9706a92-0e5f-46c1-96d8-20e42467f287 NA RAW_DATA https://www.ncbi.nlm…
Conversely, knowledge of a DOI, etc., can be used to discover details of the corresponding collection.
doi_of_interest <- "https://doi.org/10.1016/j.stem.2018.12.011"
links(db) |>
filter(link_url == doi_of_interest) |>
left_join(collections(db), by = "collection_id") |>
glimpse()
## Rows: 1
## Columns: 21
## $ collection_id <chr> "b1a879f6-5638-48d3-8f64-f6592c1b1561"
## $ link_name <chr> "PSC-ATO protocol"
## $ link_type <chr> "PROTOCOL"
## $ link_url <chr> "https://doi.org/10.1016/j.stem.2018.12.011"
## $ collection_version_id <chr> "aa814356-20ba-4066-88be-fcbf89c84899"
## $ collection_url <chr> "https://cellxgene.cziscience.com/collections/b1…
## $ consortia <list> ["CZI Single-Cell Biology", "Wellcome HCA Strate…
## $ contact_email <chr> "st9@sanger.ac.uk"
## $ contact_name <chr> "Sarah Teichmann"
## $ curator_name <chr> "Batuhan Cakir"
## $ description <chr> "Single-cell genomics studies have decoded the i…
## $ doi <chr> "10.1126/science.abo0510"
## $ links <list> [["scVI Models", "DATA_SOURCE", "https://develop…
## $ name <chr> "Mapping the developing human immune system acro…
## $ publisher_metadata <list> [[["Suo", "Chenqu"], ["Dann", "Emma"], ["Goh", "…
## $ revising_in <lgl> NA
## $ revision_of <lgl> NA
## $ visibility <chr> "PUBLIC"
## $ created_at <date> 2023-12-11
## $ published_at <date> 2022-10-04
## $ revised_at <date> 2023-12-13
Visualizing data in cellxgene
Visualization is straight-forward once dataset_id
is
available. For example, to visualize the first dataset in
african_american_female
, use
african_american_female |>
## use criteria to identify a single dataset (here just the
## 'first' dataset), then visualize
slice(1) |>
datasets_visualize()
Visualization is an interactive process, so
datasets_visualize()
will only open up to 5 browser tabs
per call.
File download and use
Datasets usually contain H5AD
(files produced by the
python AnnData module), and Rds
(serialized files produced
by the R Seurat package). The Rds
files may be
unreadable if the version of Seurat used to create the file is different
from the version used to read the file. We therefore focus on the
H5AD
files.
For illustration, we find all files associated with studies with African American females
download one of our selected files.
selected_files <-
left_join(
african_american_female |> select(dataset_id),
files(db),
by = "dataset_id"
)
And then choose a single dataset and its H5AD file for download
local_file <-
selected_files |>
filter(
dataset_id == "de985818-285f-4f59-9dbd-d74968fddba3",
filetype == "H5AD"
) |>
files_download(dry.run = FALSE)
basename(local_file)
## [1] "64f14a2b-d754-4bc9-b496-b26f05ebfe4e.h5ad"
These are downloaded to a local cache (use the internal function
cellxgenedp:::.cellxgenedb_cache_path()
for the location of
the cache), so the process is only time-consuming the first time.
H5AD
files can be converted to R /
Bioconductor objects using the zellkonverter
package.
h5ad <- readH5AD(local_file, use_hdf5 = TRUE, reader = "R")
h5ad
## class: SingleCellExperiment
## dim: 33234 31696
## metadata(5): citation default_embedding schema_reference schema_version
## title
## assays(1): X
## rownames(33234): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
## ENSG00000268674
## rowData names(5): feature_is_filtered feature_name feature_reference
## feature_biotype feature_length
## colnames(31696): CMGpool_AAACCCAAGGACAACC CMGpool_AAACCCACAATCTCTT ...
## K109064_TTTGTTGGTTGCATCA K109064_TTTGTTGGTTGGACCC
## colData names(36): donor_id self_reported_ethnicity_ontology_term_id
## ... development_stage observation_joinid
## reducedDimNames(3): X_pca X_tsne X_umap
## mainExpName: NULL
## altExpNames(0):
The SingleCellExperiment
object is a matrix-like object
with rows corresponding to genes and columns to cells. Thus we can
easily explore the cells present in the data.
## # A tibble: 7 × 3
## sex donor_id n
## <fct> <fct> <int>
## 1 female D1 2303
## 2 female D2 864
## 3 female D3 2517
## 4 female D4 1771
## 5 female D5 2244
## 6 female D11 7454
## 7 female pooled [D9,D7,D8,D10,D6] 14543
Next steps
The Orchestrating
Single-Cell Analysis with Bioconductor online resource provides an
excellent introduction to analysis and visualization of single-cell data
in R / Bioconductor. Extensive opportunities for
working with AnnData objects in R but using the native python
interface are briefly described in, e.g., ?AnnData2SCE
help
page of zellkonverter.
The hca package provides programmatic access to the Human Cell Atlas data portal, allowing retrieval of primary as well as derived single-cell data files.
API changes
Data access provided by CELLxGENE has changed to a new ‘Discover’ API. The main functionality of the cellxgenedp package has not changed, but specific columns have been removed, replaced or added, as follows:
- Removed:
access_type
,data_submission_policy_version
- Replaced:
updated_at
replaced withrevised_at
- Added:
collection_version_id
,collection_url
,doi
,revising_in
,revision_of
- Removed:
is_valid
,processing_status
,published
,revision
,created_at
- Replaced:
dataset_deployments
replaced withexplorer_url
,name
replaced withtitle
,updated_at
replaced withrevised_at
- Added:
dataset_version_id
,batch_condition
,x_approximate_distribution
- Removed:
file_id
,filename
,s3_uri
,user_submitted
,created_at
,updated_at
- Added:
filesize
,url
Session info
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] cellxgenedp_1.7.1.9000 dplyr_1.1.4
## [3] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
## [5] Biobase_2.62.0 GenomicRanges_1.54.1
## [7] GenomeInfoDb_1.38.5 IRanges_2.36.0
## [9] S4Vectors_0.40.2 BiocGenerics_0.48.1
## [11] MatrixGenerics_1.14.0 matrixStats_1.2.0
## [13] zellkonverter_1.12.1 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 filelock_1.0.3 bitops_1.0-7
## [4] fastmap_1.1.1 RCurl_1.98-1.14 promises_1.2.1
## [7] digest_0.6.34 mime_0.12 lifecycle_1.0.4
## [10] ellipsis_0.3.2 magrittr_2.0.3 compiler_4.3.2
## [13] rlang_1.1.3 sass_0.4.8 tools_4.3.2
## [16] utf8_1.2.4 yaml_2.3.8 knitr_1.45
## [19] S4Arrays_1.2.0 htmlwidgets_1.6.4 curl_5.2.0
## [22] reticulate_1.34.0 DelayedArray_0.28.0 abind_1.4-5
## [25] HDF5Array_1.30.0 withr_2.5.2 purrr_1.0.2
## [28] desc_1.4.3 grid_4.3.2 fansi_1.0.6
## [31] xtable_1.8-4 Rhdf5lib_1.24.1 cli_3.6.2
## [34] rmarkdown_2.25 crayon_1.5.2 ragg_1.2.7
## [37] generics_0.1.3 httr_1.4.7 rhdf5_2.46.1
## [40] cachem_1.0.8 stringr_1.5.1 zlibbioc_1.48.0
## [43] parallel_4.3.2 BiocManager_1.30.22 XVector_0.42.0
## [46] basilisk_1.14.1 vctrs_0.6.5 Matrix_1.6-1.1
## [49] jsonlite_1.8.8 dir.expiry_1.10.0 bookdown_0.37
## [52] systemfonts_1.0.5 jquerylib_0.1.4 glue_1.7.0
## [55] pkgdown_2.0.7 DT_0.31 stringi_1.8.3
## [58] later_1.3.2 tibble_3.2.1 pillar_1.9.0
## [61] rhdf5filters_1.14.1 basilisk.utils_1.14.1 htmltools_0.5.7
## [64] GenomeInfoDbData_1.2.11 R6_2.5.1 textshaping_0.3.7
## [67] evaluate_0.23 shiny_1.8.0 lattice_0.21-9
## [70] png_0.1-8 memoise_2.0.1 httpuv_1.6.13
## [73] bslib_0.6.1 rjsoncons_1.1.0 Rcpp_1.0.12
## [76] SparseArray_1.2.3 xfun_0.41 fs_1.6.3
## [79] pkgconfig_2.0.3