Week 6 Collaboration

In this final week of QuaRantine, we focus on communicating reproducible analyses with our peers. We will

  • Write ‘markdown’ vignettes to describe and share our analyses

  • Create and document functions that encapsulate common tasks or steps in a work flow

  • Combine vignettes and documented functions into an R package that can be easily shared with others

Instead of counting upward from the begining of our quarantine, we count down to the end.

6.1 5 Days (Monday) Zoom check-in

6.1.1 Weekend review (5 minutes)

6.1.2 Vignettes (25 minutes; Shawn)

Vignette preparation

Create an R markdown document

  • File -> New file -> R markdown...

  • Enter a Title and your name as Author

  • Use HTML as default output

  • RStudio creates a template

  • Save it (e.g., click the floppy disk icon) in the Week-06 folder. Use a title such as mtcars_regression.Rmd.

Preview (knit) the markdown template

  • Click the knit button

  • Install the knitr package if button is missing

Customize the template

  • Let’s add a demonstration of ggplot2 using the mtcars data set:

    ---
    title: "multiple regression plots"
    author: "Shawn Matott"
    date: "5/13/2020"
    output: html_document
    ---
    
    ## Muliple Regression Plots
    
    - Scatterplot by category with regression lines (using ggplot2)
    
    ```{r}
    library(ggplot2)
    
    data("mtcars")
    
    ## convert cyl to a factor-level object
    mtcars <-
        mtcars %>%
        mutate(
            cyl = factor(
                cyl,
                levels = c(4, 6, 8),
                labels = c("4 cyl", "6 cyl", "8 cyl")
            )
        )
    
    ggplot(mtcars, aes(wt, mpg, color=cyl))+
        geom_point()+
        geom_smooth(method="lm")+
        labs(title="Regression of MPG on Weight by # Cylinders",
             x = "Weight",
             y = "Miles per Gallon",
             color = "Cylinders")
    ```
    

Preview your edits

  • Save your changes

  • Click the knit button

After a bit of processing, the Rmarkdown is rendered as an .html page. Notice how the R code block has been evaluated and resulting console output and plot are included in the .html. This is great because we can be sure that the vignette is actually working (e.g. no syntax errors or other coding problems).

There’s a related package called bookdown that allows you to assemble documents and publish them directly to the web for all to see!

6.1.3 Writing and documenting R functions (25 minutes)

Writing our own functions can be useful in several situations

  • Capturing a common operation, such as retieving data from an internet resource.

  • Representing a transformation that can be applied to different parts of the data, e.g., calculuating the difference in new cases, applied to different counties or states

  • Summarizing overall steps in a work flow, e.g., translating, aligning, and clustering a DNA sequence.

As an example, the following function takes as input a vector representing the cummulative number of observations (e.g., new cases, deaths) over successive days. It calculates the difference in cases, and then uses stats::filter() to return the trailing average of the difference

Here’s a vector and the result of applying the function

This function could be used, for instance, to calculate the seven-day average number of new cases in each county of the US.

Let’s formalize this function, including documentation, in a separate file.

  • Use RStudio File -> New Script to create a file workdir/Week-06/trailing_difference.R

  • Document the function using roxygen formatting. This involves placing special comment characters #' at the start of lines, and using ‘tags’ that describe different parts of the function.

    • @title: a one-line description of the help page

    • @description: a short (paragraph-length?) summary of what capabilities the help page documents

    • @param: one for each argument, describing the value (e.g., ‘numeric()’ and meaning (e.g., ‘vector of observations over successive days’)

    • @return: the value returned by the function

    • @examples: valid R code illustrating how the function works.

We use the special tags @importFrom to tell R that we want to use the filter from the stats package, and @export to indicate that the function is meant for the ‘end user’ (i.e., us!).

```
#' @title Trailing difference of a vector
#'
#' @description Calculate the difference of successive elements of a
#'     vector, and then the running average of the difference. The
#'     width of the difference can be specied as an argument.
#'
#' @param x numeric() vector of observations.
#'
#' @param n_days scalar (length 1) numeric() number of days used to
#'     calculate the trailing average. The length of `x` should be
#'     greater than `n_days`.
#'
#' @return numeric() vector with the same length of x, representing
#'     the n_day average difference in x. Initial values are `NA`.
#'
#' @examples
#' obs <- c(1, 2, 4, 6, 7, 12, 14, 18, 19, 20, 20, 21, 22, 24)
#' trailing_difference(obs, n_day = 4
#'
#' @importFrom stats filter
#'
#' @export
trailing_difference <- function(x, n_days = 7) {
    diff <- diff(c(0, x))
    average_weights <- rep(1 / n_days, n_days)
    lag <- stats::filter(diff, average_weights, sides = 1)
    as.vector(lag)
}
```
    

6.1.4 Create an R pacakge!

The vignette and documents function are great for our own use, but we’d really like to share these with our colleagues so that they too can benefit from our work. This is very easy to do.

  • In RStudio, choose File -> New project -> New directory -> R package
  • Enter a package name, e.g., MyQuarantine, and use the add button to select the vignette mtcars_regression.Rmd and R trailing_difference.R files.
  • Choose a location for your package, select the ‘Open in a new session’ button, and click ‘Create project’.

The end result is a directory structure that actually represents an R package that you can build and share with your colleagues. The directory contains

  • An R/ folder, containing your R source code
  • A vignettes/ folder, containing the vignette you wrote.

Later in the week we’ll see how to

  • Edit the DESCRIPTION file to describe your package
  • Generate help pages in the man/ folder from the roxygen comments in the .R files.
  • Create the knit vignette from the source .Rmd file.
  • Build the package for distribution and sharing with others.

6.2 4 Days Write a vignette!

Choose one week from the quarantine, and write a vignette summarizing the material. Start with an outline using using level 1 # and level 2 ## headings as well as bulleted lists / short paragraphs. One could think of this as structured more-or-less along the lines of classic scientific paper, with introductino, methods, results (use case), and discussion

```
# Introduction to tidy data

# Tools for working with tidy data

## Key packages

- dplyr
- ggplot2

## 5 Essential functions

- `mutate()`
- `filter()`
- `select()`
- `group_by()` / `ungroup()`

# Use case

E.g., summarizing and visualizing cell subtypes in Week 5.

- narrative text describing what steps are being taken
- include code chunks for reproducible analysis
- include figures and / or summary tables to help communicate your results

# Discussion

A paragraph on strengths and limitations of the tidy approach

Narrative on use case / insights

Future directions

## Session information

- include a code chunk that has the command `sessionInfo()`; this 
  documents  the specfici versions of packages you used.
```

6.3 3 Days Create documented, reusable functions!

Create functions that represents key operations (e.g., data retrieval), data transformations (e.g., trailing difference in new cases), or that integrate several related steps in an analysis (e.g., translate(), unique(), align, and cluster DNA sequences).

Place the function(s) in separate files (one or several functions per file). Document them using the notation introduced on Monday.

Make sure the functions work by writing simple examples.

6.4 2 Days Share your work as a package!

Use the steps outlined on Monday to create an R package from Tuesday’s vignette and Wednesday’s functions.

6.4.1 DESCRIPTION

Edit the DESCRIPTION file to include a Title and Description. Update the Author infromation to include your name. Add as maintainer your name and email address, using the format Ima Maintainer <my@email.com>.

6.4.2 Documentation

Make sure that getwd() returns the path to the package. Run the command (it may be necessary to install additional packages.

There may be problems with the roxygen that you wrote; investigate how to fix these.

This creates file(s) in the man/ directory that transform the ‘roxygen’ comments (lines with #') in the R/ file. Open one of the man files and use the ‘preview’ button to see the help page.

6.4.3 Vignettes

Add the following lines to the end of the DESCRIPTION file:

Suggests: knitr
VignetteBuilder: knitr

Update the ‘yaml’ at the top of the vignette

---
title: "Multiple regression plots"
author: "Shawn Matott"
date: "5/13/2020"
output: html_document
vignette: |
    %\VignetteIndexEntry{ Multiple Regression Plots }
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

Use the following command to build the vignette

6.4.4 Build and share

With all the pieces now in place, choose Build -> Build Source Package. This creates a single file with a name like MyQuarantine_0.1.0.tar.gz that you can share with colleagues – use

install.packages("path/to/MyQuarantine_0.1.0.tar.gz", repos = NULL)

to install your pacakge!

6.5 Today! (Friday) Zoom check-in

6.5.1 Review and troubleshoot

Vignettes

  • Any vignettes to share?
  • Shruti’s QuaRantine Learnings

Documented functions

Packages

6.5.2 Course review

Basic R

  • Numeric, character, logical vectors
  • subsetting, applying functions
  • The data.frame

‘Tidy’ R and visualization

  • The tibble and pipe (%>%)
  • readr: read_csv()
  • dplyr: mutate(), filter(), select(), group_by(), summarize()
  • ggplot2
  • Visualizing the pandemic locally and globally

Machine learning

  • Underlying concepts
  • Support vector machines, k-nearest neighbors, KNN
  • Accuracy and confusion matrix; ROC curves and AUC

Bioinformatics

  • Bioconductor packages
  • Classes, generics, and methods for representing sequences and ranges
  • Virus phylogeny
  • Host gene expression

Reproducible communication

  • Vignettes
  • Documented functions
  • Packages

6.5.3 Feedback & next steps