Week 5 Bioinformatics with Bioconductor

5.1 Day 29 (Monday) Zoom check-in

5.1.1 Review and trouble shoot

5.1.2 Bioconductor, packages, and biological data

Bioconductor

  • Repository of more than 1900 packages for the ‘analysis and comprehension of high throughput genomic data’

  • Bulk and single cell RNASeq; ChIP and other gene regulation, called variants, flow cytometery, proteomics, …

  • Package discovery using biocViews and workflows

  • Landing pages & vignettes

Packages

Key packages

‘Class’ and ‘method’

Getting help

  • Vignettes, e.g., DESeq2 vignette! browseVignettes(package = "DESeq2")
  • Help pages for classes, e.g., ?DNAStringSet, ?GRanges
  • Discovering methods

  • Looking at symbols made available by packages (after the package is attached to the search() path): ls("package:GenomicRanges")

5.1.3 This week: SARS-CoV-2 sequence and human tissue-specific gene expression data

Biological background

Setup

Biostrings and GenomicRanges

Sequences and annotations

  • Input FASTA files from New York State samples (it makes little biological sense to use just New York samples, but we’ll do it anyway!) to DNAStringSet

  • Work the Genbank accession of the reference sequence

    • DNA sequence
    • Coding sequence annotations
    • Coding sequences

Alignment and visualization

Gene expression

  • Retrieving example data from ExperimentHub
  • Annotating cell types

5.2 Day 30 DNA sequences and annotations

5.2.2 Representing DNA sequences

Sample information

FASTA sequence files

5.2.3 Working with GenBankRecord objects

Data is often represented in complicated formats that do not fit into the tidy ‘table’ format of a CSV file. An example is the information displayed on the SARS-CoV-2 reference genome web page.

Often ‘someone’ has written software to manipulate this data in a more convient way.

Represent the data as GenBankR object

5.2.4 Genomic ranges and DNA sequences

Genomic ranges

Using genomic ranges to extract DNA sequences

5.3 Day 31 Sequence alignment and visualization

5.3.1 ‘S’ gene exploration

Read the ‘S’ gene sequence extracted from all genbank records into a DNAStringSet

Explore the data. How many sequences? How many unique sequences?

Note that the DNA alphabet contains ‘ambiguity’ letters, corresponding to base sequences that were uncertain, e.g., ‘A’ or ‘T’. Check out the help page for ?translate to see options for translating ambiguity codes. How many unique protein sequences?

The table() function counts the number of occurrences of each element, so

shows us that most sequences have 1274 amino acids.

The names(aa) are misleading – these are the identifiers of a representative accession, but several accessions may map to the same identifier. The following code creates a table that indicates which cluster each original identifier belongs to; the names of aa are updated accordingly.

With this information, it’s easy to count the number of times each sequence was represented. The reference sequence belongs to cluster 1.

5.3.2 Sequence alignment

Setup

Perform multiple alignment

Save the tree to the working directory using the ‘Newick’ format

5.3.3 Phylogenetic tree visualization

Setup

Read the data from the Newick-format file created by DECIPHER

Plot the tree, adding tip labels to the longest branches

Color sequences by region

  • Find nodes representing just a single sequence – ‘singletons’
  • Use GenBank country to identify which country these were sequenced from
  • Color lineages according to singleton country of origin

5.4 Day 32 Single-cell expression data

This is a very preliminary foray into COVID-19 single-cell RNA-seq gene expression data. It is derived from a preprint by Muus et al, including a ‘Terra’ (NHGRI) cloud computing workspace.

RNA-seq is a technology for measuring RNA expression levels as a proxy for gene expression. Single-cell RNA seq measures RNA expression levels at the single-cell resolution, sometimes in conjunction with imaging or other spatial information. Often, the overall pattern of expression of a single cell can be used to classify the cell to a particular type, e.g., epithelial.

Many interesting questions present themselves in the context of COVID-19. It would be interesting to investigate patterns of expression of human genes known to interact with the SARS-CoV-2 virus, especially in tissues likely to be involved in the introduction of the virus to the human body. One could imagine further studies identifying, e.g., genetic variants that influence susceptibility or response to infection.

Our goal is to do some preliminary visualization of scRNA seq at human genes whose product is known to interact with virus proteins. A much more comprehensive guide is available as Orchestrating Single Cell Analysis with Bioconductor.

5.4.1 Setup

This module requires a recent version of R and Bioconductor. It also requires some advanced packages that may be difficult to install. Definitely it is necessary that

If those conditions are met, load the following libraries (use BiocManager::install(<pkgs>) if necessary)

Create a working directory for today.

A couple of scripts provide helper functions for working with the data. Download the following file to the working directory

Take a quick look at the downloaded file (e.g., in RStudio File -> open) and verify that it is an R script. Source the script in to your R session so that you have access to the functions in the file.

Download our sample data set

The file is an ‘hdf5’ file format, used to store relatively large data sets in a binary format that allows for faster and more robust input than a CSV file. R and other software packages can read hdf5 files, but Excel, etc.

5.4.2 Data input and exploration

There are two main components to the data. The first describes each observation (cell).

If R complains that it can’t find read_obs, then remember to source() the script downloaded earlier.

This experiment measured gene expression in 240511 cells. Use standard dplyr commands to explore information about the cells

Single-cell RNA assays expression across all genes, but the current data set has been restricted to 22 genes relevant to virus interaction. Read in the expression levels of these 22 genes across all cells.

This is a matrix of 22 x 240511 values, one entry for each gene and cell. However, single-cell assays are ‘sparse’, with a large number of zeros. To take advantage of this sparse data, R has represented the matrix as a special sparse matrix. The . in the display above represent zeros; the object x only contains information about the non-zero values.

How sparse is the data, i.e., what is the fraction of zeros in the data?

5.4.3 Coordinating cell and expression data

Notice that we have two separate data structures – a large matrix of expression values, and a tibble describing each cell. It seems error-prone to manage these data separately, e.g., what if we remove some rows from the cell annotations without removing the corresponding columns from the expression matrix?

A better strategy is to introduce an object that can manage both parts of the data in a coordinated way. This container is provided by the SingleCellExperiment package

Use the constructor SingleCellExperiment() to create this object. The basic metaphor is that a SingleCellExperiment is like a matrix, but with column (and row) annotations. The column annotations are called colData.

Create and view a SingleCellExperiment

Use subset() to select particular rows and columns. The subset operates in a coordinated fashion on both the cell annotations and expression data, so there is no risk of ‘book-keeping’ error.

For the exercises below, focus on the ACE2 gene. ACE2 is a gene on the outer surface of the cell membrane in lung and other tissue.

We will perform some simple visualization, as we start to explore the data. Create a tibble containing the cell_subset annotation and the expression values of the ACE2 gene. We make use of $ to access columns of the cell annotation (colData) and assay() to access the expression data. as.vector() changes the one-dimensional array returned by assay() into a numeric vector.

Summarize the number of cells, non-zero expression values, and average expression of ACE2 for each cell subset.

Pictures speak louder than words. Use the ggplot2 package

to visualize as a horizontal boxplot the ACE2 expression values of each cell subtype. I had initially used geom_boxplot() without coord_flip(), but the cell subtype labels overlapped. I experimented with rotating the labels, but that just made them difficult to read. So coord_flip() makes the boxplot horizontal rather than vertical, the labels are easy to read, and the figure easy to interpret.

I want to explore co-expression of two genes, ACE2 and TMPRSS2. I used subset() to select the relevant expression and cell annotation values, then created a tibble with gene, cell, and expression columns.

I’d like to plot the expression of one gene against the expression of the other. It is convenient to transform the tidy tibble we just created to have a separate column for each gene. Use tidyr::pivot_wider() for this operation

We’re ready to plot expression values of TMPRSS2 against ACE2. If we were to use geom_point(), many points would over-plot and we would have a difficult time seeing overall pattern. So instead we use geom_hex() to divide the plot area into hexagonal regions, and to fill each region with a color that indicates the number of points in the region. coord_fix() ensures that the x- and y- axes have the same scale, and the plot is printed so that the ‘aspect ratio’ (height / width) is 1.

The plot shows a positive relation between expression of the two genes.

A major caveat to any conclusion of co-expression from the figure is that covariates may introduce spurious correlations. For instance, sequencing of one cell can be more efficient than sequencing of another, so all genes in one cell might appear to have higher expression than in another cell, but for completely un-interesting reasons (i.e., a technological artifact of sequencing efficiency). We would need to do a more sophisticated analysis to reach robust conclusions.

5.5 Day 33 (Friday) Zoom check-in

Biological background

5.5.1 Monday troubles (10 minutes)

Where we ran into trouble on Monday

5.5.2 Review and trouble shoot (40 minutes)

Bioconductor

  • Useful resource: statistical analysis and comprehension of high-throughput genomic data.

  • BiocManager::install(), BiocManager::version(), BiocManager::valid()

  • Key concepts: package, vignette, class, method.

SARS-CoV-2 sequence data

Single cell expression

5.5.3 This weekend (10 minutes)

Saturday

  • Tree visualization: color lineages based on observed attributes

Sunday

  • Package exploration: identifying solutions for your research challenges.

5.6 Day 34 Revising tree visualization

The goal for today is to return to the Wednesday challenge: to color the S-locus gene phylogeny based on the country the lineage was sequenced in. The steps are as follows:

  • Find nodes representing just a single sequence – ‘singletons’
  • Use GenBank country to identify which country these were sequenced from
  • Color lineages according to singleton country of origin

5.7 Day 35 Exploring Bioconductor

The exciting part of science is that we are always on the edge – on our own edge of new discovery, where no one has gone before. We’ve barely scratched the surface of Bioconductor this week. Explore available packages for solutions that might be appropriate for your particular problem.