Week 5 Bioinformatics with Bioconductor

5.1 Day 29 (Monday) Zoom check-in

5.1.1 Review and trouble shoot

5.1.2 Bioconductor, packages, and biological data

Repository of more than 1900 packages for the ‘analysis and comprehension of high throughput genomic data’
Bulk and single cell RNASeq; ChIP and other gene regulation, called variants, flow cytometery, proteomics, …
Package discovery using biocViews and workflows
Landing pages & vignettes

Packages

One-time installation of BiocManager

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager", repos = "https://cran.r-project.org")

Install Bioconductor or CRAN packages

## Biostrings and genbankr are Bioconductor packages, rentrez is from CRAN
BiocManager::install(c("Biostrings", "rentrez"))

Validate installation

BiocManager::version() # current release is 3.11; needs R 4.0
## [1] '3.10'
BiocManager::valid()
## Warning: 10 packages out-of-date; 0 packages too new
## 
## * sessionInfo()
## 
## R version 3.6.3 Patched (2020-02-29 r77906)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] BiocManager_1.30.10 compiler_3.6.3      magrittr_1.5       
##  [4] bookdown_0.18       htmltools_0.4.0     tools_3.6.3        
##  [7] yaml_2.2.1          Rcpp_1.0.4.6        stringi_1.4.6      
## [10] rmarkdown_2.1       knitr_1.28          stringr_1.4.0      
## [13] digest_0.6.25       xfun_0.13           rlang_0.4.6        
## [16] evaluate_0.14      
## 
## Bioconductor version '3.10'
## 
##   * 10 packages out-of-date
##   * 0 packages too new
## 
## create a valid installation with
## 
##   BiocManager::install(c(
##     "bookdown", "ellipsis", "modelr", "natserv", "RcppArmadillo", "sp",
##     "taxize", "tidyr", "tinytex", "xfun"
##   ), update = TRUE, ask = FALSE)
## 
## more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Types of packages
- Software (e.g., Biostrings, SingleCellExperiment)
- Annotation (e.g., org.Hs.eg.db, TxDb.Hsapiens.UCSC.hg38.knownGene, GO.db)
- Experiment data (e.g., airway, scRNAseq)

Key packages

Biostrings for representing sequence data
GenomicRanges for working with genomic coordinates
SummarizedExperiment and SingleCellExperiment for representing experimental summaries, e.g., a gene x sample matrix of RNA-seq expression values.

‘Class’ and ‘method’

Classes represent data in ways that allow the specialized nature of the data to be exploited, e.g., it makes sense to calculate the reverseComplement() of a DNA sequence, but not of an arbitrary character vector.

library(Biostrings)
sequences <- c(
    chr1 = "CTGACT",
    chr2 = "CTGGGAACT"
)
dna <- DNAStringSet(sequences)
dna
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1]     6 CTGACT                                            chr1
## [2]     9 CTGGGAACT                                         chr2

## fails! `T` not in the RNA alphabet
result <- try( RNAStringSet(sequences) )
## Error in .Call2("new_XStringSet_from_CHARACTER", class(x0), elementType(x0),  : 
##   key 84 (char 'T') not in lookup table

Methods represent the implementation of common (‘generic’) operations, specialized to the specific classes that the method operates on.

length(dna)
## [1] 2
reverseComplement(dna)
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1]     6 AGTCAG                                            chr1
## [2]     9 AGTTCCCAG                                         chr2
translate(dna)
##   A AAStringSet instance of length 2
##     width seq                                               names               
## [1]     2 MT                                                chr1
## [2]     3 MGT                                               chr2

E.g., one could translate() both a DNA sequence, and an RNA sequence, so there is both translate,DNAStringSet-method and translate,RNAStringSet-method

Not all operations are implemented as methods on a class
- E.g., length() is a concept shared by collections of DNA and RNA sequences, so the method is defined on a common ‘parent’ class
- E.g., dim() of a two-dimensional object is sufficient to implement nrow() (as dim()[1]) and ncol() (as dim()[2]) so one could use a ‘plain old’ function nrow <- function(x) dim(x)[1] for nrow() / ncol(), relying on a dim() generic and methods to provide specialized behavior.

Getting help

Vignettes, e.g., DESeq2 vignette! browseVignettes(package = "DESeq2")
Help pages for classes, e.g., ?DNAStringSet, ?GRanges

Discovering methods

methods("reverse")
methods(class = "DNAStringSet")

Looking at symbols made available by packages (after the package is attached to the search() path): ls("package:GenomicRanges")

5.1.3 This week: SARS-CoV-2 sequence and human tissue-specific gene expression data

Biological background

RNA virus (source, license: CC BY-SA 4.0)
Key domains (source, license: CC BY 2)

Setup

Some essential Bioconductor data structures

library(Biostrings)
library(GenomicRanges)

Additional software

library(rentrez)
library(genbankr)
library(DECIPHER)
library(ggtree)

Need to install software?

Install BiocManager

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager", repos = "https://cran.r-project.org")

Install other CRAN or Bioconductor packges

pkgs <- c(
    ## packages needing installation, e.g.,
    "Biostrings", "GenomicRanges", "genbankr", "rentrez", "DECIPHER",
    "ggtree"
)
BiocManager::install(pkgs)
BiocManager::valid()

Biostrings and GenomicRanges

Biostrings: representing DNA sequences

sequences <- c(
    chr1 = "CTGACT",
    chr2 = "CTGGGAACT"
)
dna <- DNAStringSet(sequences)
dna
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1]     6 CTGACT                                            chr1
## [2]     9 CTGGGAACT                                         chr2

GenomicRanges: annotations in genome space, e.g., create a variable representing ‘regions of interest’

regions_of_interest <- c(
    "chr1:1-3",
    "chr1:3-6",
    "chr2:4-6"
)
roi <- GRanges(regions_of_interest)
roi
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-3      *
##   [2]     chr1       3-6      *
##   [3]     chr2       4-6      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Many fun operations, e.g., extract sequences of regions of interest

getSeq(dna, roi)
##   A DNAStringSet instance of length 3
##     width seq
## [1]     3 CTG
## [2]     4 GACT
## [3]     3 GGA

Sequences and annotations

Input FASTA files from New York State samples (it makes little biological sense to use just New York samples, but we’ll do it anyway!) to DNAStringSet
Work the Genbank accession of the reference sequence
- DNA sequence
- Coding sequence annotations
- Coding sequences

Alignment and visualization

Multiple alignment using DECIPHER
Visualization using ggtree

Gene expression

Retrieving example data from ExperimentHub
Annotating cell types

5.2 Day 30 DNA sequences and annotations

5.2.1 Setup

Set up a directory for working

workdir <- "workdir"
if (!dir.exists(workdir))
    dir.create(workdir)

Make sure the following packages are installed

library(Biostrings)
library(GenomicRanges)
library(rentrez)
library(genbankr)

Install necessary packages, if needed

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager", repos = "https://cran.r-project.org")

## for example...
pkgs <- c("Biostrings", "GenomicRanges", "rentrez", "genbankr")
BiocManager::install(pkgs)

5.2.2 Representing DNA sequences

Sample information

Visit the NCBI SARS-CoV-2 site for orientation. We’ll retrieve and analyse some of the samples sequenced in New York state

Read a csv file summarizing records available from the US NCBI.

url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-genbank-accessions.csv"
csv_file <- file.path(workdir, basename(url))
if ( !file.exists(csv_file) )
    download.file(url, csv_file)

records <- readr::read_csv(csv_file)
## Parsed with column specification:
## cols(
##   accession = col_character(),
##   link = col_character(),
##   date = col_date(format = ""),
##   country = col_character(),
##   region = col_character()
## )
records
## # A tibble: 2,434 x 5
##    accession link                                    date       country   region
##    <chr>     <chr>                                   <date>     <chr>     <chr> 
##  1 NC_045512 https://www.ncbi.nlm.nih.gov/nuccore/N… NA         China     <NA>  
##  2 MT447189  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-02 Uzbekist… <NA>  
##  3 MT447188  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-02 Uzbekist… <NA>  
##  4 MT447177  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-26 Iran      <NA>  
##  5 MT447176  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-20 Thailand  <NA>  
##  6 MT447175  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-19 Thailand  <NA>  
##  7 MT447174  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-01 Thailand  <NA>  
##  8 MT447173  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-18 Thailand  <NA>  
##  9 MT447172  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-18 Thailand  <NA>  
## 10 MT447171  https://www.ncbi.nlm.nih.gov/nuccore/M… 2020-03-16 Thailand  <NA>  
## # … with 2,424 more rows

FASTA sequence files

Download a FASTA file containing DNA sequences of all samples sequenced in New York State (this doesn’t make much sense biologically!)

url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-SARS-CoV2-NY.fasta"
fasta_file <- file.path(workdir, basename(url))
if ( !file.exists(fasta_file) )
    download.file(url, fasta_file)

This is a plain text file with a simple format. Each DNA sequence is on a line starting with an identifier, >MT..., and then lines representing the DNA (or RNA or amino acid) sequence.

readLines(fasta_file, 5)
## [1] ">MT434817"                                                                       
## [2] "ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAA"
## [3] "AATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGG"
## [4] "ACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT"
## [5] "TGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGC"

Load the Biostrings package

library(Biostrings)
options(Biostrings.coloring = FALSE)

Read the DNA sequences and explore basic properties

dna <- readDNAStringSet(fasta_file)
dna
##   A DNAStringSet instance of length 245
##       width seq                                             names               
##   [1] 29873 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT434817
##   [2] 29879 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT434816
##   [3] 29873 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT434815
##   [4] 29882 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT434814
##   [5] 29882 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT434813
##   ...   ... ...
## [241] 29724 GATCTGTTCTCTAAACGAACTT...TGTACAGTGAACAATGCTAGGG MT370833
## [242] 29717 TCTAAACGAACTTTAAAATCTG...TACAGTGAACAATGCTAGGGAG MT370832
## [243] 29716 TCTAAACGAACTTTAAAATCTG...GTACAGTGAACAATGCTAGGGA MT370831
## [244] 29882 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT325627
## [245] 29882 ATTAAAGGTTTATACCTTCCCA...GGAGAATGACAAAAAAAAAAAA MT304486

Perform simple manipulations

## how many records?
length(dna)
## [1] 245

## subset
dna[1:3]
##   A DNAStringSet instance of length 3
##     width seq                                               names               
## [1] 29873 ATTAAAGGTTTATACCTTCCCAG...AGGAGAATGACAAAAAAAAAAAA MT434817
## [2] 29879 ATTAAAGGTTTATACCTTCCCAG...AGGAGAATGACAAAAAAAAAAAA MT434816
## [3] 29873 ATTAAAGGTTTATACCTTCCCAG...AGGAGAATGACAAAAAAAAAAAA MT434815
dna[c('MT434817', 'MT434815')]
##   A DNAStringSet instance of length 2
##     width seq                                               names               
## [1] 29873 ATTAAAGGTTTATACCTTCCCAG...AGGAGAATGACAAAAAAAAAAAA MT434817
## [2] 29873 ATTAAAGGTTTATACCTTCCCAG...AGGAGAATGACAAAAAAAAAAAA MT434815

## extract
dna[[1]]
##   29873-letter "DNAString" instance
## seq: ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAAC...TTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAA

## number of characters in each record
head(nchar(dna))
## [1] 29873 29879 29873 29882 29882 29882
hist(nchar(dna)) # histogram of sequence lengths

5.2.3 Working with `GenBankRecord` objects

Data is often represented in complicated formats that do not fit into the tidy ‘table’ format of a CSV file. An example is the information displayed on the SARS-CoV-2 reference genome web page.

Often ‘someone’ has written software to manipulate this data in a more convient way.

Attach the rentrez package, which provides a way to query the NCBI’s ‘Entrez’ collection of data bases. Use BiocManager::install('rentrez') if you need to install the package

library(rentrez)

## who's responsible?
maintainer("rentrez")
## [1] "David Winter <david.winter@gmail.com>"

## how to cite?
citation("rentrez")
## 
## To cite rentrez in publications use:
## 
##   Winter, D. J. (2017) rentrez: an R package for the NCBI eUtils API
##   The R Journal 9(2):520-526
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {{rentrez}: an R package for the NCBI eUtils API},
##     author = {David J. Winter},
##     journal = {The R Journal},
##     year = {2017},
##     volume = {9},
##     issue = {2},
##     pages = {520--526},
##   }

The accession number for the SARS-CoV-2 reference sequence is NC_045512. Use entrez_search() to search the NCBI ‘nuccore’ database to discover information about this accession.
```
accession <- "NC_045512"
nuccore_record <- entrez_search("nuccore", accession)
id <- nuccore_record$ids
id
## [1] "1798174254"
```

The id is an identifier that can be used to actually retrieve the genbank record

gb_record <-
    entrez_fetch("nuccore", id, rettype = "gbwithparts", retmode = "text")

The record contains all the text in the record; if you like visualize it with
```
cat(gb_record)
```

Represent the data as GenBankR object

Rather than try to ‘munge’ (e.g., copy and paste) relevant information from the GenBank record into R, use the genbankr package. Again, install with BiocManager::install("genbankr") if necessary

library(genbankr)

maintainer("genbankr")
## [1] "Gabriel Becker <becker.gabriel@gene.com>"

citation("genbankr")
## 
## To cite package 'genbankr' in publications use:
## 
##   Gabriel Becker and Michael Lawrence (2019). genbankr: Parsing GenBank
##   files into semantically useful objects. R package version 1.14.0.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {genbankr: Parsing GenBank files into semantically useful objects},
##     author = {Gabriel Becker and Michael Lawrence},
##     year = {2019},
##     note = {R package version 1.14.0},
##   }

Use the function readGenBank() to parse the complicated text of gb_record

gb <- readGenBank(text = gb_record)
gb
## GenBank Annotations
## Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. 
## Accession: NC_045512
## 1 Sequence(s) with total length length: 29903
## 11 genes
## 12 transcripts
## 13 exons/cds elements
## 0 variations
## 34 other features

Extract the reference sequence from the object

ref <- getSeq(gb)
ref
##   A DNAStringSet instance of length 1
##     width seq                                               names               
## [1] 29903 ATTAAAGGTTTATACCTTCCCAG...AAAAAAAAAAAAAAAAAAAAAAA Severe acute resp...

Extract the coding sequences from the reference sequence; the cds() command is explained a little later.

cds <- getSeq(ref, cds(gb))
cds
##   A DNAStringSet instance of length 13
##      width seq
##  [1] 13203 ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA...CTTCAGTCAGCTGATGCACAATCGTTTTTAAAC
##  [2]  8088 CGGGTTTGCGGTGTAAGTGCAGCCCGTCTTACA...GTTATTTCTAGTGATGTTCTTGTTAACAACTAA
##  [3] 13218 ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA...GCACAATCGTTTTTAAACGGGTTTGCGGTGTAA
##  [4]  3822 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTC...GTGCTCAAAGGAGTCAAATTACATTACACATAA
##  [5]   828 ATGGATTTGTTTATGAGAATCTTCACAATTGGA...GAACCGACGACGACTACTAGCGTGCCTTTGTAA
##  ...   ... ...
##  [9]   366 ATGAAAATTATTCTTTTCTTGGCACTGATAACA...CTTTGCTTCACACTCAAAAGAAAGACAGAATGA
## [10]   132 ATGATTGAACTTTCATTAATTGACTTCTATTTG...CTGCAAGATCATAATGAAACTTGTCACGCCTAA
## [11]   366 ATGAAATTTCTTGTTTTCTTAGGAATCATCACA...CATGACGTTCGTGTTGTTTTAGATTTCATCTAA
## [12]  1260 ATGTCTGATAATGGACCCCAAAATCAGCGAAAT...TCCATGAGCAGTGCTGACTCAACTCAGGCCTAA
## [13]   117 ATGGGCTATATAAACGTTTTCGCTTTTCCGTTT...CAAGTAGATGTAGTTAACTTTAATCTCACATAG

There are 13 coding sequences in the genome. Translate these to their amino acid sequences

translate(cds)
##   A AAStringSet instance of length 13
##      width seq
##  [1]  4401 MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGD...VCTVCGMWKGYGCSCDQLREPMLQSADAQSFLN
##  [2]  2696 RVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKV...INDMILSLLSKGRLIIRENNRVVISSDVLVNN*
##  [3]  4406 MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGD...GMWKGYGCSCDQLREPMLQSADAQSFLNGFAV*
##  [4]  1274 MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFT...SCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT*
##  [5]   276 MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATA...VQIHTIDGSSGVVNPVMEPIYDEPTTTTSVPL*
##  ...   ... ...
##  [9]   122 MKIILFLALITLATCELYHYQECVRGTTVLLKE...QEEVQELYSPIFLIVAAIVFITLCFTLKRKTE*
## [10]    44 MIELSLIDFYLCFLAFLLFLVLIMLIIFWFSLELQDHNETCHA*
## [11]   122 MKFLVFLGIITTVAAFHQECSLQSCTQHQPYVV...CQEPKLGSLVVRCSFYEDFLEYHDVRVVLDFI*
## [12]   420 MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERS...KQQTVTLLPAADLDDFSKQLQQSMSSADSTQA*
## [13]    39 MGYINVFAFPFTIYSLLLCRMNSRNYIAQVDVVNFNLT*

5.2.4 Genomic ranges and DNA sequences

Genomic ranges

Attach the GenomicRanges package. If necessary, install with BiocManager::install("GenomicRanges")
```
library(GenomicRanges)
```
Extract the coordinates of the coding sequences contained in the GenBank accession
```
cds <- cds(gb)
```

Look at the genomic ranges defined in cds

granges(cds)
## GRanges object with 13 ranges and 0 metadata columns:
##                                              seqnames      ranges strand
##                                                 <Rle>   <IRanges>  <Rle>
##    [1] Severe acute respiratory syndrome coronavirus2   266-13468      +
##    [2] Severe acute respiratory syndrome coronavirus2 13468-21555      +
##    [3] Severe acute respiratory syndrome coronavirus2   266-13483      +
##    [4] Severe acute respiratory syndrome coronavirus2 21563-25384      +
##    [5] Severe acute respiratory syndrome coronavirus2 25393-26220      +
##    ...                                            ...         ...    ...
##    [9] Severe acute respiratory syndrome coronavirus2 27394-27759      +
##   [10] Severe acute respiratory syndrome coronavirus2 27756-27887      +
##   [11] Severe acute respiratory syndrome coronavirus2 27894-28259      +
##   [12] Severe acute respiratory syndrome coronavirus2 28274-29533      +
##   [13] Severe acute respiratory syndrome coronavirus2 29558-29674      +
##   -------
##   seqinfo: 1 sequence from NC_045512.2 genome

There are 13 coding sequences in the reference genome. The sequence is named ‘Severe acute respiratory syndrome coronavirus2’. The first region starts at position 266 of the reference genome, and goes to position 13468. The coding sequence is on the ‘+’ strand.

The full cds object has additional detail about each coding sequence. Display the full object (the display ‘wraps’ around several lines

cds
## GRanges object with 13 ranges and 14 metadata columns:
##                                              seqnames      ranges strand |
##                                                 <Rle>   <IRanges>  <Rle> |
##    [1] Severe acute respiratory syndrome coronavirus2   266-13468      + |
##    [2] Severe acute respiratory syndrome coronavirus2 13468-21555      + |
##    [3] Severe acute respiratory syndrome coronavirus2   266-13483      + |
##    [4] Severe acute respiratory syndrome coronavirus2 21563-25384      + |
##    [5] Severe acute respiratory syndrome coronavirus2 25393-26220      + |
##    ...                                            ...         ...    ... .
##    [9] Severe acute respiratory syndrome coronavirus2 27394-27759      + |
##   [10] Severe acute respiratory syndrome coronavirus2 27756-27887      + |
##   [11] Severe acute respiratory syndrome coronavirus2 27894-28259      + |
##   [12] Severe acute respiratory syndrome coronavirus2 28274-29533      + |
##   [13] Severe acute respiratory syndrome coronavirus2 29558-29674      + |
##               type        gene   locus_tag ribosomal_slippage
##        <character> <character> <character>          <logical>
##    [1]         CDS      ORF1ab  GU280_gp01               TRUE
##    [2]         CDS      ORF1ab  GU280_gp01               TRUE
##    [3]         CDS      ORF1ab  GU280_gp01              FALSE
##    [4]         CDS           S  GU280_gp02              FALSE
##    [5]         CDS       ORF3a  GU280_gp03              FALSE
##    ...         ...         ...         ...                ...
##    [9]         CDS       ORF7a  GU280_gp07              FALSE
##   [10]         CDS       ORF7b  GU280_gp08              FALSE
##   [11]         CDS        ORF8  GU280_gp09              FALSE
##   [12]         CDS           N  GU280_gp10              FALSE
##   [13]         CDS       ORF10  GU280_gp11              FALSE
##                                                note codon_start
##                                         <character>   <numeric>
##    [1] pp1ab; translated by -1 ribosomal frameshift           1
##    [2] pp1ab; translated by -1 ribosomal frameshift           1
##    [3]                                         pp1a           1
##    [4]            structural protein; spike protein           1
##    [5]                                         <NA>           1
##    ...                                          ...         ...
##    [9]                                         <NA>           1
##   [10]                                         <NA>           1
##   [11]                                         <NA>           1
##   [12]                     ORF9; structural protein           1
##   [13]                                         <NA>           1
##                            product     protein_id         db_xref   translation
##                        <character>    <character> <CharacterList> <AAStringSet>
##    [1]          ORF1ab polyprotein YP_009724389.1 GeneID:43740578     MES...VNN
##    [2]          ORF1ab polyprotein YP_009724389.1 GeneID:43740578     MES...VNN
##    [3]           ORF1a polyprotein YP_009725295.1 GeneID:43740578     MES...FAV
##    [4]        surface glycoprotein YP_009724390.1 GeneID:43740568     MFV...HYT
##    [5]               ORF3a protein YP_009724391.1 GeneID:43740569     MDL...VPL
##    ...                         ...            ...             ...           ...
##    [9]               ORF7a protein YP_009724395.1 GeneID:43740573     MKI...KTE
##   [10]                       ORF7b YP_009725318.1 GeneID:43740574     MIE...CHA
##   [11]                ORF8 protein YP_009724396.1 GeneID:43740577     MKF...DFI
##   [12] nucleocapsid phosphoprotein YP_009724397.2 GeneID:43740575     MSD...TQA
##   [13]               ORF10 protein YP_009725255.1 GeneID:43740576     MGY...NLT
##            loctype       gene_synonym     gene_id transcript_id
##        <character>    <CharacterList> <character>   <character>
##    [1]      normal                 NA  GU280_gp01  GU280_gp01.1
##    [2]      normal                 NA  GU280_gp01  GU280_gp01.1
##    [3]      normal                 NA  GU280_gp01  GU280_gp01.2
##    [4]      normal spike glycoprotein  GU280_gp02  GU280_gp02.1
##    [5]      normal                 NA  GU280_gp03  GU280_gp03.1
##    ...         ...                ...         ...           ...
##    [9]      normal                 NA  GU280_gp07  GU280_gp07.1
##   [10]      normal                 NA  GU280_gp08  GU280_gp08.1
##   [11]      normal                 NA  GU280_gp09  GU280_gp09.1
##   [12]      normal                 NA  GU280_gp10  GU280_gp10.1
##   [13]      normal                 NA  GU280_gp11  GU280_gp11.1
##   -------
##   seqinfo: 1 sequence from NC_045512.2 genome

or coerce to a tibble and explore

library(tibble)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect
## The following objects are masked from 'package:Biostrings':
## 
##     collapse, intersect, setdiff, setequal, union
## The following object is masked from 'package:XVector':
## 
##     slice
## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
as_tibble(cds)
## # A tibble: 13 x 19
##    seqnames start   end width strand type  gene  locus_tag ribosomal_slipp…
##    <fct>    <int> <int> <int> <fct>  <chr> <chr> <chr>     <lgl>           
##  1 Severe …   266 13468 13203 +      CDS   ORF1… GU280_gp… TRUE            
##  2 Severe … 13468 21555  8088 +      CDS   ORF1… GU280_gp… TRUE            
##  3 Severe …   266 13483 13218 +      CDS   ORF1… GU280_gp… FALSE           
##  4 Severe … 21563 25384  3822 +      CDS   S     GU280_gp… FALSE           
##  5 Severe … 25393 26220   828 +      CDS   ORF3a GU280_gp… FALSE           
##  6 Severe … 26245 26472   228 +      CDS   E     GU280_gp… FALSE           
##  7 Severe … 26523 27191   669 +      CDS   M     GU280_gp… FALSE           
##  8 Severe … 27202 27387   186 +      CDS   ORF6  GU280_gp… FALSE           
##  9 Severe … 27394 27759   366 +      CDS   ORF7a GU280_gp… FALSE           
## 10 Severe … 27756 27887   132 +      CDS   ORF7b GU280_gp… FALSE           
## 11 Severe … 27894 28259   366 +      CDS   ORF8  GU280_gp… FALSE           
## 12 Severe … 28274 29533  1260 +      CDS   N     GU280_gp… FALSE           
## 13 Severe … 29558 29674   117 +      CDS   ORF10 GU280_gp… FALSE           
## # … with 10 more variables: note <chr>, codon_start <dbl>, product <chr>,
## #   protein_id <chr>, db_xref <I<list>>, translation <chr>, loctype <chr>,
## #   gene_synonym <I<list>>, gene_id <chr>, transcript_id <chr>

For instance, we can find the gene associated with each coding sequence, using the following equivalent commands

cds$gene
##  [1] "ORF1ab" "ORF1ab" "ORF1ab" "S"      "ORF3a"  "E"      "M"      "ORF6"  
##  [9] "ORF7a"  "ORF7b"  "ORF8"   "N"      "ORF10"

subset(cds, , gene)
## GRanges object with 13 ranges and 1 metadata column:
##                                              seqnames      ranges strand |
##                                                 <Rle>   <IRanges>  <Rle> |
##    [1] Severe acute respiratory syndrome coronavirus2   266-13468      + |
##    [2] Severe acute respiratory syndrome coronavirus2 13468-21555      + |
##    [3] Severe acute respiratory syndrome coronavirus2   266-13483      + |
##    [4] Severe acute respiratory syndrome coronavirus2 21563-25384      + |
##    [5] Severe acute respiratory syndrome coronavirus2 25393-26220      + |
##    ...                                            ...         ...    ... .
##    [9] Severe acute respiratory syndrome coronavirus2 27394-27759      + |
##   [10] Severe acute respiratory syndrome coronavirus2 27756-27887      + |
##   [11] Severe acute respiratory syndrome coronavirus2 27894-28259      + |
##   [12] Severe acute respiratory syndrome coronavirus2 28274-29533      + |
##   [13] Severe acute respiratory syndrome coronavirus2 29558-29674      + |
##               gene
##        <character>
##    [1]      ORF1ab
##    [2]      ORF1ab
##    [3]      ORF1ab
##    [4]           S
##    [5]       ORF3a
##    ...         ...
##    [9]       ORF7a
##   [10]       ORF7b
##   [11]        ORF8
##   [12]           N
##   [13]       ORF10
##   -------
##   seqinfo: 1 sequence from NC_045512.2 genome

as_tibble(cds) %>%
    ## info on gene and protein product
    dplyr::select(gene, product)
## # A tibble: 13 x 2
##    gene   product                    
##    <chr>  <chr>                      
##  1 ORF1ab ORF1ab polyprotein         
##  2 ORF1ab ORF1ab polyprotein         
##  3 ORF1ab ORF1a polyprotein          
##  4 S      surface glycoprotein       
##  5 ORF3a  ORF3a protein              
##  6 E      envelope protein           
##  7 M      membrane glycoprotein      
##  8 ORF6   ORF6 protein               
##  9 ORF7a  ORF7a protein              
## 10 ORF7b  ORF7b                      
## 11 ORF8   ORF8 protein               
## 12 N      nucleocapsid phosphoprotein
## 13 ORF10  ORF10 protein

Using genomic ranges to extract DNA sequences

Recall the sequence of the entire genome

ref <- getSeq(gb)
ref
##   A DNAStringSet instance of length 1
##     width seq                                               names               
## [1] 29903 ATTAAAGGTTTATACCTTCCCAG...AAAAAAAAAAAAAAAAAAAAAAA Severe acute resp...

Use getSeq() to extract the DNAStringSet() corresponding to the genomic ranges specified by cds

dna <- getSeq(ref, cds)
dna
##   A DNAStringSet instance of length 13
##      width seq
##  [1] 13203 ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA...CTTCAGTCAGCTGATGCACAATCGTTTTTAAAC
##  [2]  8088 CGGGTTTGCGGTGTAAGTGCAGCCCGTCTTACA...GTTATTTCTAGTGATGTTCTTGTTAACAACTAA
##  [3] 13218 ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAA...GCACAATCGTTTTTAAACGGGTTTGCGGTGTAA
##  [4]  3822 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTC...GTGCTCAAAGGAGTCAAATTACATTACACATAA
##  [5]   828 ATGGATTTGTTTATGAGAATCTTCACAATTGGA...GAACCGACGACGACTACTAGCGTGCCTTTGTAA
##  ...   ... ...
##  [9]   366 ATGAAAATTATTCTTTTCTTGGCACTGATAACA...CTTTGCTTCACACTCAAAAGAAAGACAGAATGA
## [10]   132 ATGATTGAACTTTCATTAATTGACTTCTATTTG...CTGCAAGATCATAATGAAACTTGTCACGCCTAA
## [11]   366 ATGAAATTTCTTGTTTTCTTAGGAATCATCACA...CATGACGTTCGTGTTGTTTTAGATTTCATCTAA
## [12]  1260 ATGTCTGATAATGGACCCCAAAATCAGCGAAAT...TCCATGAGCAGTGCTGACTCAACTCAGGCCTAA
## [13]   117 ATGGGCTATATAAACGTTTTCGCTTTTCCGTTT...CAAGTAGATGTAGTTAACTTTAATCTCACATAG

It might be useful to coordinate information about each coding sequence with the sequence itself

cds$DNA <- dna
cds
## GRanges object with 13 ranges and 15 metadata columns:
##                                              seqnames      ranges strand |
##                                                 <Rle>   <IRanges>  <Rle> |
##    [1] Severe acute respiratory syndrome coronavirus2   266-13468      + |
##    [2] Severe acute respiratory syndrome coronavirus2 13468-21555      + |
##    [3] Severe acute respiratory syndrome coronavirus2   266-13483      + |
##    [4] Severe acute respiratory syndrome coronavirus2 21563-25384      + |
##    [5] Severe acute respiratory syndrome coronavirus2 25393-26220      + |
##    ...                                            ...         ...    ... .
##    [9] Severe acute respiratory syndrome coronavirus2 27394-27759      + |
##   [10] Severe acute respiratory syndrome coronavirus2 27756-27887      + |
##   [11] Severe acute respiratory syndrome coronavirus2 27894-28259      + |
##   [12] Severe acute respiratory syndrome coronavirus2 28274-29533      + |
##   [13] Severe acute respiratory syndrome coronavirus2 29558-29674      + |
##               type        gene   locus_tag ribosomal_slippage
##        <character> <character> <character>          <logical>
##    [1]         CDS      ORF1ab  GU280_gp01               TRUE
##    [2]         CDS      ORF1ab  GU280_gp01               TRUE
##    [3]         CDS      ORF1ab  GU280_gp01              FALSE
##    [4]         CDS           S  GU280_gp02              FALSE
##    [5]         CDS       ORF3a  GU280_gp03              FALSE
##    ...         ...         ...         ...                ...
##    [9]         CDS       ORF7a  GU280_gp07              FALSE
##   [10]         CDS       ORF7b  GU280_gp08              FALSE
##   [11]         CDS        ORF8  GU280_gp09              FALSE
##   [12]         CDS           N  GU280_gp10              FALSE
##   [13]         CDS       ORF10  GU280_gp11              FALSE
##                                                note codon_start
##                                         <character>   <numeric>
##    [1] pp1ab; translated by -1 ribosomal frameshift           1
##    [2] pp1ab; translated by -1 ribosomal frameshift           1
##    [3]                                         pp1a           1
##    [4]            structural protein; spike protein           1
##    [5]                                         <NA>           1
##    ...                                          ...         ...
##    [9]                                         <NA>           1
##   [10]                                         <NA>           1
##   [11]                                         <NA>           1
##   [12]                     ORF9; structural protein           1
##   [13]                                         <NA>           1
##                            product     protein_id         db_xref   translation
##                        <character>    <character> <CharacterList> <AAStringSet>
##    [1]          ORF1ab polyprotein YP_009724389.1 GeneID:43740578     MES...VNN
##    [2]          ORF1ab polyprotein YP_009724389.1 GeneID:43740578     MES...VNN
##    [3]           ORF1a polyprotein YP_009725295.1 GeneID:43740578     MES...FAV
##    [4]        surface glycoprotein YP_009724390.1 GeneID:43740568     MFV...HYT
##    [5]               ORF3a protein YP_009724391.1 GeneID:43740569     MDL...VPL
##    ...                         ...            ...             ...           ...
##    [9]               ORF7a protein YP_009724395.1 GeneID:43740573     MKI...KTE
##   [10]                       ORF7b YP_009725318.1 GeneID:43740574     MIE...CHA
##   [11]                ORF8 protein YP_009724396.1 GeneID:43740577     MKF...DFI
##   [12] nucleocapsid phosphoprotein YP_009724397.2 GeneID:43740575     MSD...TQA
##   [13]               ORF10 protein YP_009725255.1 GeneID:43740576     MGY...NLT
##            loctype       gene_synonym     gene_id transcript_id            DNA
##        <character>    <CharacterList> <character>   <character> <DNAStringSet>
##    [1]      normal                 NA  GU280_gp01  GU280_gp01.1      ATG...AAC
##    [2]      normal                 NA  GU280_gp01  GU280_gp01.1      CGG...TAA
##    [3]      normal                 NA  GU280_gp01  GU280_gp01.2      ATG...TAA
##    [4]      normal spike glycoprotein  GU280_gp02  GU280_gp02.1      ATG...TAA
##    [5]      normal                 NA  GU280_gp03  GU280_gp03.1      ATG...TAA
##    ...         ...                ...         ...           ...            ...
##    [9]      normal                 NA  GU280_gp07  GU280_gp07.1      ATG...TGA
##   [10]      normal                 NA  GU280_gp08  GU280_gp08.1      ATG...TAA
##   [11]      normal                 NA  GU280_gp09  GU280_gp09.1      ATG...TAA
##   [12]      normal                 NA  GU280_gp10  GU280_gp10.1      ATG...TAA
##   [13]      normal                 NA  GU280_gp11  GU280_gp11.1      ATG...TAG
##   -------
##   seqinfo: 1 sequence from NC_045512.2 genome

5.3 Day 31 Sequence alignment and visualization

5.3.1 ‘S’ gene exploration

Read the ‘S’ gene sequence extracted from all genbank records into a DNAStringSet

url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-SARS-CoV2-S.fasta"
fasta_file <- file.path(workdir, basename(url))
if ( !file.exists(fasta_file) )
    download.file(url, fasta_file)
dna <- readDNAStringSet(fasta_file)
dna
##   A DNAStringSet instance of length 2289
##        width seq                                            names               
##    [1]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA LC528232
##    [2]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA LC528233
##    [3]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA LC529905
##    [4]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA LC534418
##    [5]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA LC534419
##    ...   ... ...
## [2285]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA MT447174
## [2286]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA MT447175
## [2287]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA MT447176
## [2288]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA MT447177
## [2289]  3822 ATGTTTGTTTTTCTTGTTTTAT...GTCAAATTACATTACACATAA NC_045512

Explore the data. How many sequences? How many unique sequences?

length(dna)
## [1] 2289

dna %>% unique() %>% length()
## [1] 427

Note that the DNA alphabet contains ‘ambiguity’ letters, corresponding to base sequences that were uncertain, e.g., ‘A’ or ‘T’. Check out the help page for ?translate to see options for translating ambiguity codes. How many unique protein sequences?

IUPAC_CODE_MAP
##      A      C      G      T      M      R      W      S      Y      K      V 
##    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
##      H      D      B      N 
##  "ACT"  "AGT"  "CGT" "ACGT"

dna %>% alphabetFrequency() %>% colSums()
##       A       C       G       T       M       R       W       S       Y       K 
## 2563642 1649084 1604491 2898077       2       9       6       0      17       6 
##       V       H       D       B       N       -       +       . 
##       1       0       1       1   33212       0       0       0

aa <- dna %>% translate(if.fuzzy.codon = "solve") %>% unique()

aa
##   A AAStringSet instance of length 343
##       width seq                                             names               
##   [1]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* LC528232
##   [2]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT007544
##   [3]  1273 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT012098
##   [4]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT020781
##   [5]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT039890
##   ...   ... ...
## [339]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT446361
## [340]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT447160
## [341]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT447163
## [342]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* MT447168
## [343]  1274 MFVFLVLLPLVSSQCVNLTTRI...CCKFDEDDSEPVLKGVKLHYT* MT447177

The table() function counts the number of occurrences of each element, so

nchar(aa) %>% table() %>% as_tibble()
## # A tibble: 2 x 2
##   .         n
##   <chr> <int>
## 1 1273      3
## 2 1274    340

shows us that most sequences have 1274 amino acids.

The names(aa) are misleading – these are the identifiers of a representative accession, but several accessions may map to the same identifier. The following code creates a table that indicates which cluster each original identifier belongs to; the names of aa are updated accordingly.

all <- dna %>% translate(if.fuzzy.codon = "solve")
## match(x, y) finds the index of each element of x in y.
## match(c("apple", "pear"), c("pear", "apple", "banana")) finds
## that 'apple' in x matches the second element of y, and 'pear'
## matches the first element of y. So the result is c(2, 1).
id <- match(all, aa)
clusters <- tibble(
    accession = names(all),
    label = as.character(id)
)
names(aa) <- match(aa, aa)

With this information, it’s easy to count the number of times each sequence was represented. The reference sequence belongs to cluster 1.

clusters %>% filter(accession == "NC_045512")
## # A tibble: 1 x 2
##   accession label
##   <chr>     <chr>
## 1 NC_045512 1

clusters %>%
    count(label) %>%
    arrange(desc(n))
## # A tibble: 343 x 2
##    label     n
##    <chr> <int>
##  1 11     1117
##  2 1       723
##  3 126      21
##  4 34       10
##  5 341      10
##  6 125       6
##  7 167       6
##  8 47        6
##  9 59        6
## 10 170       5
## # … with 333 more rows

5.3.2 Sequence alignment

Setup

Create a working directory, if necessary. E.g.,

workdir <- "workdir"
if (!dir.exists(workdir))
    dir.create(workdir)

Install (if necessary, using BiocManager::install("DECIPHER")) and attach the DECIPHER package

library(DECIPHER)

maintainer("DECIPHER")
## [1] "Erik Wright <eswright@pitt.edu>"

citation("DECIPHER")
## 
## Wright ES (2016). "Using DECIPHER v2.0 to Analyze Big Biological
## Sequence Data in R." _The R Journal_, *8*(1), 352-359.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R},
##     author = {Erik S. Wright},
##     journal = {The R Journal},
##     year = {2016},
##     volume = {8},
##     number = {1},
##     pages = {352-359},
##   }

This package contains tools for multiple sequence alignment. Explore the introductory and other vignettes in this package.
```
browseVignettes(package = "DECIPHER")
```
Check out one of the help pages, e.g.,
```
?AlignSeqs
```
Use example(AlignSeqs) to run the examples on the help page.

Perform multiple alignment

Perform multiple alignment on the unique amino acid sequences using DECIPHER; this has minimal effect, with single insertions in the few sequences with unusual nchar().

aln <- AlignSeqs(aa, verbose = FALSE)
aln
##   A AAStringSet instance of length 343
##       width seq                                             names               
##   [1]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 1
##   [2]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 2
##   [3]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 3
##   [4]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 4
##   [5]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 5
##   ...   ... ...
## [339]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 339
## [340]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 340
## [341]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 341
## [342]  1274 MFVFLVLLPLVSSQCVNLTTRT...CCKFDEDDSEPVLKGVKLHYT* 342
## [343]  1274 MFVFLVLLPLVSSQCVNLTTRI...CCKFDEDDSEPVLKGVKLHYT* 343

nchar(aln) %>% table() %>% as_tibble()
## # A tibble: 1 x 2
##   .         n
##   <chr> <int>
## 1 1274    343

The ‘distance matrix’ measures the proximity (Hamming distance) of each sequence to other sequences in sequence space. The result is a large matrix of pairwise distances

dist <- DistanceMatrix(aln, verbose = FALSE)
class(dist)
## [1] "matrix"
dim(dist)
## [1] 343 343
dist[1:5, 1:5]
##              1            2           3            4            5
## 1 0.0000000000 0.0007849294 0.001569859 0.0007849294 0.0007849294
## 2 0.0007849294 0.0000000000 0.002354788 0.0015698587 0.0015698587
## 3 0.0015698587 0.0023547881 0.000000000 0.0023547881 0.0023547881
## 4 0.0007849294 0.0015698587 0.002354788 0.0000000000 0.0015698587
## 5 0.0007849294 0.0015698587 0.002354788 0.0015698587 0.0000000000

Use the distance matrix to cluster sequences using the ‘Neighbor Joining’ algorithm; visualize the result (see ?plot.dendrogram for plotting options when the first argument is of class dendrogram)

dendrogram <- IdClusters(dist, method = "NJ", type = "dendrogram")
## ================================================================================
## 
## Time difference of 0.61 secs
dendrogram
## 'dendrogram' with 2 branches and 343 members total, at height 0.009898669
plot(dendrogram, leaflab = "none")

Save the tree to the working directory using the ‘Newick’ format

dendrogram_file <- file.path(workdir, "SARS-CoV2-S.newick")
WriteDendrogram(dendrogram, dendrogram_file, quoteLabels = FALSE)

5.3.3 Phylogenetic tree visualization

Setup

Install (if necessary, using BiocManager::install("ggtree")) and attach the ggtree package

library(ggtree)

maintainer("ggtree")
## [1] "Guangchuang Yu <guangchuangyu@gmail.com>"

citation("ggtree")
## 
## To cite ggtree in publications use:
## 
##   Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk
##   Lam. ggtree: an R package for visualization and annotation of
##   phylogenetic trees with their covariates and other associated data.
##   Methods in Ecology and Evolution 2017, 8(1):28-36
## 
##   Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
##   for mapping and visualizing associated data on phylogeny using
##   ggtree. Molecular Biology and Evolution 2018, 35(2):3041-3043. doi:
##   10.1093/molbev/msy194
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

The visualizations this package provides are not well-suited to R’s help pages; check out the online ggtree book to orient yourself.

Read the data from the Newick-format file created by DECIPHER

dendrogram_file <- file.path(workdir, "SARS-CoV2-S.newick")
tree <- treeio::read.newick(dendrogram_file)
tree
## 
## Phylogenetic tree with 343 tips and 288 internal nodes.
## 
## Tip labels:
##  228, 267, 298, 294, 77, 189, ...
## 
## Rooted; includes branch lengths.
tibble::as_tibble(tree)
## # A tibble: 631 x 4
##    parent  node branch.length label
##     <int> <int>         <dbl> <chr>
##  1    346     1      0        228  
##  2    346     2      0.00385  267  
##  3    347     3      0        298  
##  4    348     4      0        294  
##  5    349     5      0        77   
##  6    349     6      0.00600  189  
##  7    350     7      0        295  
##  8    351     8      0.000580 280  
##  9    352     9      0        297  
## 10    353    10      0        87   
## # … with 621 more rows

Plot the tree, adding tip labels to the longest branches

ggtree(tree) +
    geom_tiplab(
        aes(
            subset = branch.length > .001,
            label = label
        ),
        size = 3
    ) +
    ggplot2::ggtitle("SARS-CoV-2 S-locus phylogeny; genbank, 10 May 2020")
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Color sequences by region

Find nodes representing just a single sequence – ‘singletons’
Use GenBank country to identify which country these were sequenced from
Color lineages according to singleton country of origin

5.4 Day 32 Single-cell expression data

This is a very preliminary foray into COVID-19 single-cell RNA-seq gene expression data. It is derived from a preprint by Muus et al, including a ‘Terra’ (NHGRI) cloud computing workspace.

RNA-seq is a technology for measuring RNA expression levels as a proxy for gene expression. Single-cell RNA seq measures RNA expression levels at the single-cell resolution, sometimes in conjunction with imaging or other spatial information. Often, the overall pattern of expression of a single cell can be used to classify the cell to a particular type, e.g., epithelial.

Many interesting questions present themselves in the context of COVID-19. It would be interesting to investigate patterns of expression of human genes known to interact with the SARS-CoV-2 virus, especially in tissues likely to be involved in the introduction of the virus to the human body. One could imagine further studies identifying, e.g., genetic variants that influence susceptibility or response to infection.

Our goal is to do some preliminary visualization of scRNA seq at human genes whose product is known to interact with virus proteins. A much more comprehensive guide is available as Orchestrating Single Cell Analysis with Bioconductor.

5.4.1 Setup

This module requires a recent version of R and Bioconductor. It also requires some advanced packages that may be difficult to install. Definitely it is necessary that

as.package_version(R.version) >= "3.6"
## [1] TRUE
BiocManager::version() >= "3.10"
## [1] TRUE

If those conditions are met, load the following libraries (use BiocManager::install(<pkgs>) if necessary)

library(SingleCellExperiment)
library(Matrix)
library(rhdf5)
library(dplyr)
library(hexbin)

Create a working directory for today.

workdir <- "workdir"
if (!dir.exists(workdir))
    dir.create(workdir)

A couple of scripts provide helper functions for working with the data. Download the following file to the working directory

url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-scRNAseq-import.R"
destination <- file.path(workdir, basename(url))
if (!file.exists(destination))
    download.file(url, destination)

Take a quick look at the downloaded file (e.g., in RStudio File -> open) and verify that it is an R script. Source the script in to your R session so that you have access to the functions in the file.

source(destination)

Download our sample data set

url <- "https://github.com/mtmorgan/QuaRantine/raw/master/assets/05-internal_nonsmokerslung.h5ad"
destination <- file.path(workdir, basename(url))
if (!file.exists(destination))
    download.file(url, destination)

The file is an ‘hdf5’ file format, used to store relatively large data sets in a binary format that allows for faster and more robust input than a CSV file. R and other software packages can read hdf5 files, but Excel, etc.

5.4.2 Data input and exploration

There are two main components to the data. The first describes each observation (cell).

obs <- read_obs(destination)
obs
## # A tibble: 240,511 x 7
##      Age `Cell Type` Count Gender Patient_ID Tissue              cell_subset    
##    <dbl> <fct>       <int> <fct>  <fct>      <fct>               <fct>          
##  1    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  2    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  3    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  4    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  5    57 Secretory       1 F      HU28       internal_nonsmoker… Secretory      
##  6    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  7    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  8    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
##  9    57 Basal           1 F      HU28       internal_nonsmoker… Epithelial (ba…
## 10    57 Secretory       1 F      HU28       internal_nonsmoker… Secretory      
## # … with 240,501 more rows

If R complains that it can’t find read_obs, then remember to source() the script downloaded earlier.

This experiment measured gene expression in 240511 cells. Use standard dplyr commands to explore information about the cells

How many patients are represented? (we need to use dplyr::count() because another package attached to our search path also has a count() function)

obs %>% dplyr::count(Patient_ID)
## # A tibble: 7 x 2
##   Patient_ID     n
##   <fct>      <int>
## 1 HU28       12301
## 2 HU30       33971
## 3 HU37       66594
## 4 HU39       29908
## 5 HU49       56575
## 6 HU52       17768
## 7 HU62       23394

What is the age and gender of each patient?

obs %>% distinct(Patient_ID, Age, Gender)
## # A tibble: 7 x 3
##   Patient_ID   Age Gender
##   <fct>      <dbl> <fct> 
## 1 HU28        57   F     
## 2 HU39         0.3 M     
## 3 HU62        46   M     
## 4 HU30        59   F     
## 5 HU49        18   F     
## 6 HU37        42   F     
## 7 HU52        66   F

Which cell types are represented (the column contains a space in its name, which causes problems for R unless the column is quoted)

obs %>% dplyr::count(`Cell Type`) %>% arrange(desc(n))
## # A tibble: 17 x 2
##    `Cell Type`        n
##    <fct>          <int>
##  1 Basal          67554
##  2 AT2            28326
##  3 Fibroblast     27574
##  4 T.NK.cells     26709
##  5 Myeloid        19730
##  6 Secretory      18643
##  7 AT1            11722
##  8 Endothelial     9611
##  9 Ciliated        9445
## 10 SmoothMuscle    9392
## 11 Pericytes       3385
## 12 B.cells         3232
## 13 Mesothelium     2179
## 14 Mast            1991
## 15 Tuft.like        677
## 16 Neuroendocrine   179
## 17 Ionocytes        162

Each cell was classified to cell type based on overall gene expression. How many cells of each type were observed?

obs %>% dplyr::count(cell_subset)
## # A tibble: 17 x 2
##    cell_subset                            n
##    <fct>                              <int>
##  1 Endothelial cell                    9611
##  2 Epithelial (basal)                 67554
##  3 Epithelial (ciliated)               9445
##  4 Epithelial cell (alveolar type I)  11722
##  5 Epithelial cell (alveolar type II) 28326
##  6 Fibroblast                         27574
##  7 Immune (B cell)                     3232
##  8 Immune (NKT cell)                  26709
##  9 Immune (mast cell)                  1991
## 10 Immune (myeloid)                   19730
## 11 Ionocyte                             162
## 12 Mesothelium                         2179
## 13 Neuroendocrine                       179
## 14 Pericyte                            3385
## 15 Secretory                          18643
## 16 Smooth muscle cell                  9392
## 17 Tuft                                 677

Single-cell RNA assays expression across all genes, but the current data set has been restricted to 22 genes relevant to virus interaction. Read in the expression levels of these 22 genes across all cells.

x <- read_expression(destination)
dim(x)
## [1]     22 240511
class(x)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

x[, 1:3]
## 22 x 3 sparse Matrix of class "dgCMatrix"
##         1B_cell57_HU28 1B_cell418_HU28 1B_cell611_HU28
## ACE2         .               .               .        
## TMPRSS2      0.6002854       0.7688452       .        
## IL6          .               .               .        
## IL6R         .               .               .        
## IL6ST        .               0.7688452       .        
## PCSK1        .               .               .        
## PCSK2        .               .               .        
## FURIN        .               .               .        
## PCSK4        .               .               .        
## PCSK5        .               .               .        
## PCSK6        .               .               .        
## PCSK7        .               .               0.7382636
## C1R          .               .               .        
## C2           .               .               .        
## C3           .               .               .        
## C5           .               .               .        
## CFI          .               .               .        
## CTSS         0.9727762       0.7688452       .        
## CTSL         .               .               .        
## CTSB         1.2435541       1.1983210       .        
## CTSC         .               0.7688452       .        
## CTSE         .               .               .

This is a matrix of 22 x 240511 values, one entry for each gene and cell. However, single-cell assays are ‘sparse’, with a large number of zeros. To take advantage of this sparse data, R has represented the matrix as a special sparse matrix. The . in the display above represent zeros; the object x only contains information about the non-zero values.

How sparse is the data, i.e., what is the fraction of zeros in the data?

Total number of elements
```
nrow(x) * ncol(x)
## [1] 5291242
```
Total number of zero elements (x == 0 returns a logical vector of TRUE and FALSE values; sum() treats TRUE as 1, and FALSE as 0, so sum(x -- 0) is the number of zero elements)
```
sum(x == 0)
## [1] 4321307
```

Fraction of zero elements

sum(x == 0) / (nrow(x) * ncol(x))
## [1] 0.8166905

More than 80% of the values are zeros! This primarily represents limitations of the technology, where relatively few ‘reads’ are available from each cell, so only a fraction of the genes being expressed are detected.

Use the function rowMeans() to calculate average expression of each gene; present this as a tibble

tibble(
    gene = rownames(x),
    average_expression = rowMeans(x)
)
## # A tibble: 22 x 2
##    gene    average_expression
##    <chr>                <dbl>
##  1 ACE2               0.00539
##  2 TMPRSS2            0.0877 
##  3 IL6                0.144  
##  4 IL6R               0.0923 
##  5 IL6ST              0.386  
##  6 PCSK1              0.00302
##  7 PCSK2              0.0117 
##  8 FURIN              0.0908 
##  9 PCSK4              0.00728
## 10 PCSK5              0.0740 
## # … with 12 more rows

5.4.3 Coordinating cell and expression data

Notice that we have two separate data structures – a large matrix of expression values, and a tibble describing each cell. It seems error-prone to manage these data separately, e.g., what if we remove some rows from the cell annotations without removing the corresponding columns from the expression matrix?

A better strategy is to introduce an object that can manage both parts of the data in a coordinated way. This container is provided by the SingleCellExperiment package

library(SingleCellExperiment)

Use the constructor SingleCellExperiment() to create this object. The basic metaphor is that a SingleCellExperiment is like a matrix, but with column (and row) annotations. The column annotations are called colData.

knitr::include_graphics('images/SummarizedExperiment.png')

Create and view a SingleCellExperiment

sce <- SingleCellExperiment(x, colData = obs)
sce
## class: SingleCellExperiment 
## dim: 22 240511 
## metadata(0):
## assays(1): ''
## rownames(22): ACE2 TMPRSS2 ... CTSC CTSE
## rowData names(0):
## colnames(240511): 1B_cell57_HU28 1B_cell418_HU28 ...
##   RP2_neg_cell6794209_HU52 RP2_neg_cell6794225_HU52
## colData names(7): Age Cell Type ... Tissue cell_subset
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

dim(sce)
## [1]     22 240511

Use subset() to select particular rows and columns. The subset operates in a coordinated fashion on both the cell annotations and expression data, so there is no risk of ‘book-keeping’ error.

my_sce <-
    sce %>%
    subset(
        rownames(.) %in% "ACE2",
        `Cell Type` %in% c("Fibroblast", "SmoothMuscle")
    )

my_sce
## class: SingleCellExperiment 
## dim: 1 36966 
## metadata(0):
## assays(1): ''
## rownames(1): ACE2
## rowData names(0):
## colnames(36966): 1B_cell31801_HU28 1B_cell48220_HU28 ...
##   RP2_neg_cell6778193_HU52 RP2_neg_cell6785080_HU52
## colData names(7): Age Cell Type ... Tissue cell_subset
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

dim(my_sce)
## [1]     1 36966

For the exercises below, focus on the ACE2 gene. ACE2 is a gene on the outer surface of the cell membrane in lung and other tissue.

ace2 <- subset(sce, rownames(sce) %in% "ACE2")
ace2
## class: SingleCellExperiment 
## dim: 1 240511 
## metadata(0):
## assays(1): ''
## rownames(1): ACE2
## rowData names(0):
## colnames(240511): 1B_cell57_HU28 1B_cell418_HU28 ...
##   RP2_neg_cell6794209_HU52 RP2_neg_cell6794225_HU52
## colData names(7): Age Cell Type ... Tissue cell_subset
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

We will perform some simple visualization, as we start to explore the data. Create a tibble containing the cell_subset annotation and the expression values of the ACE2 gene. We make use of $ to access columns of the cell annotation (colData) and assay() to access the expression data. as.vector() changes the one-dimensional array returned by assay() into a numeric vector.

tbl <- tibble(
    cell_subset = ace2$cell_subset,
    expression = as.vector(assay(ace2))
)
tbl
## # A tibble: 240,511 x 2
##    cell_subset        expression
##    <fct>                   <dbl>
##  1 Epithelial (basal)          0
##  2 Epithelial (basal)          0
##  3 Epithelial (basal)          0
##  4 Epithelial (basal)          0
##  5 Secretory                   0
##  6 Epithelial (basal)          0
##  7 Epithelial (basal)          0
##  8 Epithelial (basal)          0
##  9 Epithelial (basal)          0
## 10 Secretory                   0
## # … with 240,501 more rows

Summarize the number of cells, non-zero expression values, and average expression of ACE2 for each cell subset.

tbl %>%
    group_by(cell_subset) %>%
    summarize(
        n = n(),
        n_non_zero = sum(expression != 0),
        mean_expression = mean(expression)
    )
## # A tibble: 17 x 4
##    cell_subset                            n n_non_zero mean_expression
##    <fct>                              <int>      <int>           <dbl>
##  1 Endothelial cell                    9611          2       0.0000691
##  2 Epithelial (basal)                 67554        854       0.00726  
##  3 Epithelial (ciliated)               9445        229       0.0167   
##  4 Epithelial cell (alveolar type I)  11722         67       0.00341  
##  5 Epithelial cell (alveolar type II) 28326        594       0.00967  
##  6 Fibroblast                         27574         25       0.000648 
##  7 Immune (B cell)                     3232          1       0.000226 
##  8 Immune (NKT cell)                  26709          5       0.000270 
##  9 Immune (mast cell)                  1991          0       0        
## 10 Immune (myeloid)                   19730          2       0.0000694
## 11 Ionocyte                             162          0       0        
## 12 Mesothelium                         2179          2       0.000418 
## 13 Neuroendocrine                       179          0       0        
## 14 Pericyte                            3385         23       0.00765  
## 15 Secretory                          18643        533       0.0146   
## 16 Smooth muscle cell                  9392          7       0.000642 
## 17 Tuft                                 677          4       0.00291

Pictures speak louder than words. Use the ggplot2 package

library(ggplot2)

to visualize as a horizontal boxplot the ACE2 expression values of each cell subtype. I had initially used geom_boxplot() without coord_flip(), but the cell subtype labels overlapped. I experimented with rotating the labels, but that just made them difficult to read. So coord_flip() makes the boxplot horizontal rather than vertical, the labels are easy to read, and the figure easy to interpret.

tbl %>%
    filter(expression > 0) %>%
    ggplot(aes(cell_subset, expression)) +
        geom_boxplot() +
        coord_flip() +
        ggtitle("ACE2")

I want to explore co-expression of two genes, ACE2 and TMPRSS2. I used subset() to select the relevant expression and cell annotation values, then created a tibble with gene, cell, and expression columns.

The rep() function replicates each element of its first argument the number of times indicated by it’s second argument; explore it’s behavior with simple examples, e.g.,
```
1:3
## [1] 1 2 3

rep(1:3, 2)
## [1] 1 2 3 1 2 3

rep(1:3, each = 2)
## [1] 1 1 2 2 3 3
```

When a two-dimensional array is coerced to a vector, it is ‘flattened’ by appending columns

m <- matrix(1:12, nrow = 4)
m
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

as.vector(m)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Putting this together

two_genes <- subset(sce, rownames(sce) %in% c("ACE2", "TMPRSS2"))
tbl <- tibble(
    gene = rep(rownames(two_genes), ncol(two_genes)),
    cell = rep(colnames(two_genes), each = nrow(two_genes)),
    expression = as.vector(assay(two_genes))
)
tbl
## # A tibble: 481,022 x 3
##    gene    cell            expression
##    <chr>   <chr>                <dbl>
##  1 ACE2    1B_cell57_HU28       0    
##  2 TMPRSS2 1B_cell57_HU28       0.600
##  3 ACE2    1B_cell418_HU28      0    
##  4 TMPRSS2 1B_cell418_HU28      0.769
##  5 ACE2    1B_cell611_HU28      0    
##  6 TMPRSS2 1B_cell611_HU28      0    
##  7 ACE2    1B_cell617_HU28      0    
##  8 TMPRSS2 1B_cell617_HU28      0    
##  9 ACE2    1B_cell885_HU28      0    
## 10 TMPRSS2 1B_cell885_HU28      1.09 
## # … with 481,012 more rows

I’d like to plot the expression of one gene against the expression of the other. It is convenient to transform the tidy tibble we just created to have a separate column for each gene. Use tidyr::pivot_wider() for this operation

wider_tbl <-
    tbl %>%
    mutate(expression = log(expression)) %>%
    tidyr::pivot_wider(names_from=gene, values_from=expression)

wider_tbl
## # A tibble: 240,511 x 3
##    cell              ACE2   TMPRSS2
##    <chr>            <dbl>     <dbl>
##  1 1B_cell57_HU28    -Inf   -0.510 
##  2 1B_cell418_HU28   -Inf   -0.263 
##  3 1B_cell611_HU28   -Inf -Inf     
##  4 1B_cell617_HU28   -Inf -Inf     
##  5 1B_cell885_HU28   -Inf    0.0879
##  6 1B_cell968_HU28   -Inf -Inf     
##  7 1B_cell1008_HU28  -Inf -Inf     
##  8 1B_cell1060_HU28  -Inf   -0.300 
##  9 1B_cell1156_HU28  -Inf   -0.675 
## 10 1B_cell1370_HU28  -Inf -Inf     
## # … with 240,501 more rows

We’re ready to plot expression values of TMPRSS2 against ACE2. If we were to use geom_point(), many points would over-plot and we would have a difficult time seeing overall pattern. So instead we use geom_hex() to divide the plot area into hexagonal regions, and to fill each region with a color that indicates the number of points in the region. coord_fix() ensures that the x- and y- axes have the same scale, and the plot is printed so that the ‘aspect ratio’ (height / width) is 1.

wider_tbl %>%
    ggplot(aes(ACE2, TMPRSS2)) +
    geom_hex() +
    coord_fixed()
## Warning: Removed 239728 rows containing non-finite values (stat_binhex).

The plot shows a positive relation between expression of the two genes.

A major caveat to any conclusion of co-expression from the figure is that covariates may introduce spurious correlations. For instance, sequencing of one cell can be more efficient than sequencing of another, so all genes in one cell might appear to have higher expression than in another cell, but for completely un-interesting reasons (i.e., a technological artifact of sequencing efficiency). We would need to do a more sophisticated analysis to reach robust conclusions.

5.5 Day 33 (Friday) Zoom check-in

Biological background

RNA virus (source, license: CC BY-SA 4.0)
Key domains (source, license: CC BY 2)

5.5.1 Monday troubles (10 minutes)

Where we ran into trouble on Monday

Load specific packages

library(Biostrings)
options(Biostrings.coloring = FALSE)
library(GenomicRanges)

DNA sequences and genomic ranges

sequences <- c(chr1 = "CTGACT", chr2 = "CTGGGAACT")
dna <- DNAStringSet(sequences)

regions_of_interest <- c("chr1:1-3", "chr1:3-6", "chr2:4-6")
roi <- GRanges(regions_of_interest)

So far so good, but…

> getSeq(dna, roi)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'getSeq' for signature '"DNAStringSet"'

What went wrong? The getSeq() generic is defined in Biostrings, but the getSeq() method getSeq,DNAStringSet-method is not! Hence ‘unable to find an inherited method … for signature …’.
Why did it work in the QuaRantine document? I had also loaded another package
```
library(genbankr)
```
- genbankr depends on the package BSgenome
- BSgenome defines a method getSeq,DNAStringSet-method.
- The generic in Biostrings now finds the method in BSgenome!
```
getSeq(dna, roi)
##   A DNAStringSet instance of length 3
##     width seq
## [1]     3 CTG
## [2]     4 GACT
## [3]     3 GGA
```

5.5.2 Review and trouble shoot (40 minutes)

Bioconductor

Useful resource: statistical analysis and comprehension of high-throughput genomic data.
BiocManager::install(), BiocManager::version(), BiocManager::valid()
Key concepts: package, vignette, class, method.

SARS-CoV-2 sequence data

Querying Entrez for NC_045512 and parsing GenBank records

library(genbankr)
gb <- readGenBank(GBAccession("NC_045512"))
## No exons read from genbank file. Assuming sections of CDS are full exons
## No transcript features (mRNA) found, using spans of CDSs
gb
## GenBank Annotations
## Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. 
## Accession: NC_045512
## 1 Sequence(s) with total length length: 29903
## 11 genes
## 12 transcripts
## 13 exons/cds elements
## 0 variations
## 34 other features

Extract whole-genome sequence and annotation

ref <- getSeq(gb)
ref
##   A DNAStringSet instance of length 1
##     width seq                                               names               
## [1] 29903 ATTAAAGGTTTATACCTTCCCAG...AAAAAAAAAAAAAAAAAAAAAAA Severe acute resp...

cds <- cds(gb)
getSeq(ref, subset(cds, gene == "S"))
##   A DNAStringSet instance of length 1
##     width seq
## [1]  3822 ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCT...GTGCTCAAAGGAGTCAAATTACATTACACATAA

Multiple alignment and phylogenetic tree construction

library(DECIPHER)
url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-SARS-CoV2-S.fasta"
fasta_file <- file.path(workdir, basename(url))
if ( !file.exists(fasta_file) )
    download.file(url, fasta_file)
dna <- readDNAStringSet(fasta_file)
aa <- dna %>% translate(if.fuzzy.codon = "solve") %>% unique()
aln <- AlignSeqs(aa, verbose = FALSE)
dist <- DistanceMatrix(aln, verbose = FALSE)
dendrogram <- 
    IdClusters(dist, method = "NJ", type = "dendrogram", verbose = FALSE)

Save intermediate result and take a deep breath

dendrogram_file <- file.path(workdir, "SARS-CoV2-S.newick")
WriteDendrogram(dendrogram, dendrogram_file, quoteLabels = FALSE)

Visualization

library(ggtree)
tree <- treeio::read.newick(dendrogram_file)
ggtree(tree) +
    geom_tiplab(
        aes(subset = branch.length > .001, label = label), 
        size = 3
    ) +
    ggplot2::ggtitle("SARS-CoV-2 S-locus phylogeny; genbank, 10 May 2020")

Single cell expression

Retrieve script and file to a local working directory

workdir <- "workdir"
if (!dir.exists(workdir))
    dir.create(workdir)

url <- "https://raw.githubusercontent.com/mtmorgan/QuaRantine/master/assets/05-scRNAseq-import.R"
destination <- file.path(workdir, basename(url))
if (!file.exists(destination))
    download.file(url, destination)

url <- "https://github.com/mtmorgan/QuaRantine/raw/master/assets/05-internal_nonsmokerslung.h5ad"
h5_file <- file.path(workdir, basename(url))
if (!file.exists(h5_file))
    download.file(url, h5_file)

Input and explore cell annotations and expression values

obs <- read_obs(h5_file)      # tibble
x <- read_expression(h5_file) # sparse matrix

Coordinate cell annotation and expression values as SingleCellExperiment

sce <- SingleCellExperiment(x, colData = obs)
sce
## class: SingleCellExperiment 
## dim: 22 240511 
## metadata(0):
## assays(1): ''
## rownames(22): ACE2 TMPRSS2 ... CTSC CTSE
## rowData names(0):
## colnames(240511): 1B_cell57_HU28 1B_cell418_HU28 ...
##   RP2_neg_cell6794209_HU52 RP2_neg_cell6794225_HU52
## colData names(7): Age Cell Type ... Tissue cell_subset
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

Etc., e.g., cell subtype expression

library(ggplot2)

ace2 <- subset(sce, rownames(sce) %in% "ACE2")
tbl <- tibble(
    cell_subset = ace2$cell_subset,
    expression = as.vector(assay(ace2))
)

tbl %>%
    filter(expression > 0) %>%
    ggplot(aes(cell_subset, expression)) +
        geom_boxplot() +
        coord_flip() +
        ggtitle("ACE2")

Many more single cell resources described in Orchestrating Single Cell Analysis with Bioconductor.

5.5.3 This weekend (10 minutes)

Saturday

Tree visualization: color lineages based on observed attributes

Sunday

Package exploration: identifying solutions for your research challenges.

5.6 Day 34 Revising tree visualization

The goal for today is to return to the Wednesday challenge: to color the S-locus gene phylogeny based on the country the lineage was sequenced in. The steps are as follows:

Find nodes representing just a single sequence – ‘singletons’
Use GenBank country to identify which country these were sequenced from
Color lineages according to singleton country of origin

5.7 Day 35 Exploring Bioconductor

The exciting part of science is that we are always on the edge – on our own edge of new discovery, where no one has gone before. We’ve barely scratched the surface of Bioconductor this week. Explore available packages for solutions that might be appropriate for your particular problem.

Week 5 Bioinformatics with Bioconductor

5.1 Day 29 (Monday) Zoom check-in

5.1.1 Review and trouble shoot

5.1.2 Bioconductor, packages, and biological data

5.1.3 This week: SARS-CoV-2 sequence and human tissue-specific gene expression data

5.2 Day 30 DNA sequences and annotations

5.2.1 Setup

5.2.2 Representing DNA sequences

5.2.3 Working with GenBankRecord objects

5.2.4 Genomic ranges and DNA sequences

5.3 Day 31 Sequence alignment and visualization

5.3.1 ‘S’ gene exploration

5.3.2 Sequence alignment

5.3.3 Phylogenetic tree visualization

5.4 Day 32 Single-cell expression data

5.4.1 Setup

5.4.2 Data input and exploration

5.4.3 Coordinating cell and expression data

5.5 Day 33 (Friday) Zoom check-in

5.5.1 Monday troubles (10 minutes)

5.5.2 Review and trouble shoot (40 minutes)

5.5.3 This weekend (10 minutes)

5.6 Day 34 Revising tree visualization

5.7 Day 35 Exploring Bioconductor

5.2.3 Working with `GenBankRecord` objects