vignettes/c_seurat_bioconductor.Rmd
c_seurat_bioconductor.Rmd
Register or log-in
choose the ‘ARTNet 2023’
Wait a minute or so
Click to open RStudio
In RStudio, choose ‘File’ / ‘Open File…’ / ‘vignettes/a_r.Rmd’
There are rich, well-documented workflows for single-cell analysis in R. We’ll point to these resources and outline overall workflows, but will only perform illustrative calculations. Some steps require both computing time and resources that aren’t suitable for workshop presentation so we focus on a ‘classic’ data set, the 10x Genomics Peripheral Blood Mononuclear Cells (PBMC) dataset conssists of 2,700 cells. Analysis of data sets with say 50,000 cells might require 32- or 64 GB of memory and take up to an hour for an end-to-end analysis; one way to access a computer with the much memory might be through a commercial (Google, Azure, AWS) cloud, or on a research network like the AnVIL computational cloud.
Based on Getting Started with Seurat and the Seurat - Guided Clustering Tutorial. The guided tutorial covers
To illustrating working with Seurat
, we take the first
few steps through this workflow.
Here we re-capitulate some of the initial steps in the Seurat tutorial.
Data retrieval. Here we retrieve the raw data to a local cache. It is
distributed as a compressed ‘tar’ archive, so we untar()
it
to a temporary location.
## download to local cache
url <- paste0(
"https://cf.10xgenomics.com/samples/cell/pbmc3k/",
"pbmc3k_filtered_gene_bc_matrices.tar.gz"
)
pbmc3k_tar_gz <- BiocFileCache::bfcrpath(rname = url)
## 'untar' to a temporary location
pbmc3k_directory <- tempfile()
untar(pbmc3k_tar_gz, exdir = pbmc3k_directory)
## location of the relevant data
pbmc3k_data_dir <-
file.path(pbmc3k_directory, "filtered_gene_bc_matrices", "hg19")
Following the Seurat tutorial, we read the data into a sparse matrix, and then into a Seurat object.
library(Seurat)
#> The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
#> which was just loaded, will retire in October 2023.
#> Please refer to R-spatial evolution reports for details, especially
#> https://r-spatial.org/r/2023/05/15/evolution4.html.
#> It may be desirable to make the sf package available;
#> package maintainers should consider adding sf to Suggests:.
#> The sp package is now running under evolution status 2
#> (status 2 uses the sf package in place of rgdal)
#> Attaching SeuratObject
## read the 10x data as a sparse matrix
pbmc.data <- Read10X(data.dir = pbmc3k_data_dir)
class(pbmc.data)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(pbmc.data) # 32738 features (genes) x 2700 samples (cells)
#> [1] 32738 2700
## create a Seurat object from the count matrix
pbmc <- CreateSeuratObject(
counts = pbmc.data,
project = "pbmc3k",
## filter some features (present in less than 3 cells) and cells
## (fewer than 200 features)
min.cells = 3, min.features = 200
)
#> Warning: Feature names cannot have underscores ('_'), replacing with dashes
#> ('-')
pbmc # 13714 features x 2700 samples
#> An object of class Seurat
#> 13714 features across 2700 samples within 1 assay
#> Active assay: RNA (13714 features, 0 variable features)
pbmc[["RNA"]]
#> Assay data with 13714 features for 2700 cells
#> First 10 features:
#> AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9, LINC00115, NOC2L,
#> KLHL17, PLEKHN1, RP11-54O7.17, HES4
The first step in the workflow is to perform quality control. Start by identifying the mitochondrial genes. In this particular example, mitchondrial genes are annotated based on the feature name.
## The [[ operator can add columns to object metadata. This is a great
## place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
Visualize some metrics…
…and use ad hoc criteria to filter features;
subset()
is a base R function.
pbmc <- subset(
pbmc,
subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5
)
pbmc
#> An object of class Seurat
#> 13714 features across 2638 samples within 1 assay
#> Active assay: RNA (13714 features, 0 variable features)
The workflow continues from here…
The following is based on Orchestrating Single-Cell Analysis with Bioconductor. This resource is separated into introductory, basic, advanced, and multi-sample sections, with a collection of workflows illustrating use. A good place to start is with a workflow to get a feel for data analysis, and then to refer back to earlier sections for more detailed operations or understanding.
We’ll follow the Unfiltered human PBMCs workflow for another PBMC data, but use the same dataset as in the Seurat section earlier in this article.
Data retrieval is a outlined above. After untar()
ing the
data pbmc3k_data_dir
is the location to use for data
input.
pbmc3k <- read10xCounts(pbmc3k_data_dir, col.names = TRUE)
pbmc3k
#> class: SingleCellExperiment
#> dim: 32738 2700
#> metadata(1): Samples
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
#> ENSG00000215611
#> rowData names(2): ID Symbol
#> colnames(2700): AAACATACAACCAC-1 AAACATTGAGCTAC-1 ... TTTGCATGAGAGGC-1
#> TTTGCATGCCTCAC-1
#> colData names(2): Sample Barcode
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
Skipping to the quality control step, let’s identify mitochondrial
genes. In this data set, we can identify the mitochondrial genes by
looking at the Symbol
column of rowData()
rowData(pbmc3k) |>
dplyr::as_tibble() |>
dplyr::filter(startsWith(Symbol, "MT-"))
#> # A tibble: 13 × 2
#> ID Symbol
#> <chr> <chr>
#> 1 ENSG00000198888 MT-ND1
#> 2 ENSG00000198763 MT-ND2
#> 3 ENSG00000198804 MT-CO1
#> 4 ENSG00000198712 MT-CO2
#> 5 ENSG00000228253 MT-ATP8
#> 6 ENSG00000198899 MT-ATP6
#> 7 ENSG00000198938 MT-CO3
#> 8 ENSG00000198840 MT-ND3
#> 9 ENSG00000212907 MT-ND4L
#> 10 ENSG00000198886 MT-ND4
#> 11 ENSG00000198786 MT-ND5
#> 12 ENSG00000198695 MT-ND6
#> 13 ENSG00000198727 MT-CYB
is_mitochondrial <-
rowData(pbmc3k)$Symbol |>
startsWith("MT-") |>
which()
Calculate per-cell quality control metrics…
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is_mitochondrial))
stats |> dplyr::as_tibble()
#> # A tibble: 2,700 × 6
#> sum detected subsets_Mito_sum subsets_Mito_detected subsets_Mito_percent
#> <dbl> <int> <dbl> <int> <dbl>
#> 1 2421 781 73 10 3.02
#> 2 4903 1352 186 10 3.79
#> 3 3149 1131 28 8 0.889
#> 4 2639 960 46 10 1.74
#> 5 981 522 12 5 1.22
#> 6 2164 782 36 7 1.66
#> 7 2176 783 83 10 3.81
#> 8 2260 790 70 9 3.10
#> 9 1276 533 15 6 1.18
#> 10 1103 550 32 7 2.90
#> # ℹ 2,690 more rows
#> # ℹ 1 more variable: total <dbl>
high_mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
table(high_mito)
#> high_mito
#> FALSE TRUE
#> 2607 93
…and use high_mito
to create a subset of the results
pbmc3k[, !high_mito]
#> class: SingleCellExperiment
#> dim: 32738 2607
#> metadata(1): Samples
#> assays(1): counts
#> rownames(32738): ENSG00000243485 ENSG00000237613 ... ENSG00000215616
#> ENSG00000215611
#> rowData names(2): ID Symbol
#> colnames(2607): AAACATACAACCAC-1 AAACATTGAGCTAC-1 ... TTTGCATGAGAGGC-1
#> TTTGCATGCCTCAC-1
#> colData names(2): Sample Barcode
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
The workflow continues…!
It is worthwhile to compare the Seurat and SingleCellExperiment work flows.
Actually, a first question is ‘why not just have a single function that takes the analysis from begining to end?’ Answers might be
It is educational to understand the data transformations and their assummptions required in an analysis.
No two data sets are the same, so at any step some manipulation (e.g., to identify mitochondrial genes), choice of algorithm, or interpretation of results might be required.
What about the difference between the Seurat and Bioconductor workflows? In broad detail the analyses accomplish the same ends. Seurat is more standardized and widely used, while Bioconductor may be less standardized and more flexible.
This document was produced with the following R software:
sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scuttle_1.11.0 DropletUtils_1.21.0
#> [3] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
#> [5] Biobase_2.61.0 GenomicRanges_1.53.1
#> [7] GenomeInfoDb_1.37.2 IRanges_2.35.2
#> [9] S4Vectors_0.39.1 BiocGenerics_0.47.0
#> [11] MatrixGenerics_1.13.0 matrixStats_1.0.0
#> [13] SeuratObject_4.1.3 Seurat_4.3.0.1
#>
#> loaded via a namespace (and not attached):
#> [1] RcppAnnoy_0.0.20 splines_4.3.0
#> [3] later_1.3.1 bitops_1.0-7
#> [5] filelock_1.0.2 tibble_3.2.1
#> [7] R.oo_1.25.0 polyclip_1.10-4
#> [9] lifecycle_1.0.3 edgeR_3.43.7
#> [11] rprojroot_2.0.3 globals_0.16.2
#> [13] lattice_0.21-8 MASS_7.3-60
#> [15] magrittr_2.0.3 limma_3.57.6
#> [17] plotly_4.10.2 sass_0.4.6
#> [19] rmarkdown_2.22 jquerylib_0.1.4
#> [21] yaml_2.3.7 httpuv_1.6.11
#> [23] sctransform_0.3.5 sp_2.0-0
#> [25] spatstat.sparse_3.0-2 reticulate_1.30
#> [27] cowplot_1.1.1 pbapply_1.7-2
#> [29] DBI_1.1.3 RColorBrewer_1.1-3
#> [31] abind_1.4-5 zlibbioc_1.47.0
#> [33] Rtsne_0.16 purrr_1.0.1
#> [35] R.utils_2.12.2 RCurl_1.98-1.12
#> [37] GenomeInfoDbData_1.2.10 ggrepel_0.9.3
#> [39] irlba_2.3.5.1 listenv_0.9.0
#> [41] spatstat.utils_3.0-3 goftest_1.2-3
#> [43] dqrng_0.3.0 spatstat.random_3.1-5
#> [45] fitdistrplus_1.1-11 parallelly_1.36.0
#> [47] pkgdown_2.0.7 DelayedMatrixStats_1.23.0
#> [49] leiden_0.4.3 codetools_0.2-19
#> [51] DelayedArray_0.27.5 tidyselect_1.2.0
#> [53] farver_2.1.1 BiocFileCache_2.9.0
#> [55] spatstat.explore_3.2-1 jsonlite_1.8.5
#> [57] ellipsis_0.3.2 progressr_0.13.0
#> [59] ggridges_0.5.4 survival_3.5-5
#> [61] systemfonts_1.0.4 tools_4.3.0
#> [63] ragg_1.2.5 ica_1.0-3
#> [65] Rcpp_1.0.10 glue_1.6.2
#> [67] gridExtra_2.3 SparseArray_1.1.10
#> [69] xfun_0.39 HDF5Array_1.29.3
#> [71] dplyr_1.1.2 withr_2.5.0
#> [73] fastmap_1.1.1 rhdf5filters_1.13.3
#> [75] fansi_1.0.4 digest_0.6.32
#> [77] R6_2.5.1 mime_0.12
#> [79] textshaping_0.3.6 colorspace_2.1-0
#> [81] scattermore_1.2 tensor_1.5
#> [83] spatstat.data_3.0-1 RSQLite_2.3.1
#> [85] R.methodsS3_1.8.2 utf8_1.2.3
#> [87] tidyr_1.3.0 generics_0.1.3
#> [89] data.table_1.14.8 httr_1.4.6
#> [91] htmlwidgets_1.6.2 S4Arrays_1.1.4
#> [93] uwot_0.1.15 pkgconfig_2.0.3
#> [95] gtable_0.3.3 blob_1.2.4
#> [97] lmtest_0.9-40 XVector_0.41.1
#> [99] htmltools_0.5.5 scales_1.2.1
#> [101] png_0.1-8 knitr_1.43
#> [103] reshape2_1.4.4 nlme_3.1-162
#> [105] curl_5.0.1 rhdf5_2.45.0
#> [107] cachem_1.0.8 zoo_1.8-12
#> [109] stringr_1.5.0 KernSmooth_2.23-21
#> [111] parallel_4.3.0 miniUI_0.1.1.1
#> [113] desc_1.4.2 pillar_1.9.0
#> [115] grid_4.3.0 vctrs_0.6.3
#> [117] RANN_2.6.1 promises_1.2.0.1
#> [119] beachmat_2.17.8 dbplyr_2.3.2
#> [121] xtable_1.8-4 cluster_2.1.4
#> [123] evaluate_0.21 locfit_1.5-9.8
#> [125] cli_3.6.1 compiler_4.3.0
#> [127] rlang_1.1.1 crayon_1.5.2
#> [129] future.apply_1.11.0 labeling_0.4.2
#> [131] plyr_1.8.8 fs_1.6.2
#> [133] stringi_1.7.12 viridisLite_0.4.2
#> [135] deldir_1.0-9 BiocParallel_1.35.2
#> [137] munsell_0.5.0 lazyeval_0.2.2
#> [139] spatstat.geom_3.2-1 Matrix_1.5-4.1
#> [141] patchwork_1.1.2 sparseMatrixStats_1.13.0
#> [143] bit64_4.0.5 future_1.32.0
#> [145] Rhdf5lib_1.23.0 ggplot2_3.4.2
#> [147] shiny_1.7.4 highr_0.10
#> [149] ROCR_1.0-11 igraph_1.5.0
#> [151] memoise_2.0.1 bslib_0.5.0
#> [153] bit_4.0.5