Week 6 Collaboration
In this final week of QuaRantine, we focus on communicating reproducible analyses with our peers. We will
Write ‘markdown’ vignettes to describe and share our analyses
Create and document functions that encapsulate common tasks or steps in a work flow
Combine vignettes and documented functions into an R package that can be easily shared with others
Instead of counting upward from the begining of our quarantine, we count down to the end.
6.1 5 Days (Monday) Zoom check-in
6.1.1 Weekend review (5 minutes)
6.1.2 Vignettes (25 minutes; Shawn)
Vignette preparation
Create a directory
Week-06
in the working directory
Create an R markdown document
File
->New file
->R markdown...
Enter a
Title
and your name asAuthor
Use
HTML
as default outputRStudio creates a template
Save it (e.g., click the floppy disk icon) in the
Week-06
folder. Use a title such asmtcars_regression.Rmd
.
Preview (knit) the markdown template
Click the
knit
buttonInstall the
knitr
package if button is missing
Customize the template
Let’s add a demonstration of
ggplot2
using themtcars
data set:--- title: "multiple regression plots" author: "Shawn Matott" date: "5/13/2020" output: html_document --- ## Muliple Regression Plots - Scatterplot by category with regression lines (using ggplot2) ```{r} library(ggplot2) data("mtcars") ## convert cyl to a factor-level object mtcars <- mtcars %>% mutate( cyl = factor( cyl, levels = c(4, 6, 8), labels = c("4 cyl", "6 cyl", "8 cyl") ) ) ggplot(mtcars, aes(wt, mpg, color=cyl))+ geom_point()+ geom_smooth(method="lm")+ labs(title="Regression of MPG on Weight by # Cylinders", x = "Weight", y = "Miles per Gallon", color = "Cylinders") ```
Preview your edits
Save your changes
Click the
knit
button
After a bit of processing, the Rmarkdown
is rendered as an .html
page. Notice how the R code block has been evaluated and resulting console output and plot are included in the .html
. This is great because we can be sure that the vignette is actually working (e.g. no syntax errors or other coding problems).
There’s a related package called bookdown
that allows you to assemble documents and publish them directly to the web for all to see!
6.1.3 Writing and documenting R functions (25 minutes)
Writing our own functions can be useful in several situations
Capturing a common operation, such as retieving data from an internet resource.
Representing a transformation that can be applied to different parts of the data, e.g., calculuating the difference in new cases, applied to different counties or states
Summarizing overall steps in a work flow, e.g., translating, aligning, and clustering a DNA sequence.
As an example, the following function takes as input a vector representing the cummulative number of observations (e.g., new cases, deaths) over successive days. It calculates the difference in cases, and then uses stats::filter()
to return the trailing average of the difference
trailing_difference <- function(x, n_days = 7) {
diff <- diff(c(0, x))
average_weights <- rep(1 / n_days, n_days)
lag <- stats::filter(diff, average_weights, sides = 1)
as.numeric(lag)
}
Here’s a vector and the result of applying the function
obs <- c(1, 2, 4, 6, 7, 12, 14, 18, 19, 20, 20, 21, 22, 24)
trailing_difference(obs, n_days = 4)
## [1] NA NA NA 1.50 1.50 2.50 2.50 3.00 3.00 2.00 1.50 0.75 0.75 1.00
This function could be used, for instance, to calculate the seven-day average number of new cases in each county of the US.
Let’s formalize this function, including documentation, in a separate file.
Use RStudio File -> New Script to create a file
workdir/Week-06/trailing_difference.R
Document the function using
roxygen
formatting. This involves placing special comment characters#'
at the start of lines, and using ‘tags’ that describe different parts of the function.@title
: a one-line description of the help page@description
: a short (paragraph-length?) summary of what capabilities the help page documents@param
: one for each argument, describing the value (e.g., ‘numeric()
’ and meaning (e.g., ‘vector of observations over successive days’)@return
: the value returned by the function@examples
: valid R code illustrating how the function works.
We use the special tags @importFrom
to tell R that we want to use the filter
from the stats package, and @export
to indicate that the function is meant for the ‘end user’ (i.e., us!).
```
#' @title Trailing difference of a vector
#'
#' @description Calculate the difference of successive elements of a
#' vector, and then the running average of the difference. The
#' width of the difference can be specied as an argument.
#'
#' @param x numeric() vector of observations.
#'
#' @param n_days scalar (length 1) numeric() number of days used to
#' calculate the trailing average. The length of `x` should be
#' greater than `n_days`.
#'
#' @return numeric() vector with the same length of x, representing
#' the n_day average difference in x. Initial values are `NA`.
#'
#' @examples
#' obs <- c(1, 2, 4, 6, 7, 12, 14, 18, 19, 20, 20, 21, 22, 24)
#' trailing_difference(obs, n_day = 4
#'
#' @importFrom stats filter
#'
#' @export
trailing_difference <- function(x, n_days = 7) {
diff <- diff(c(0, x))
average_weights <- rep(1 / n_days, n_days)
lag <- stats::filter(diff, average_weights, sides = 1)
as.vector(lag)
}
```
6.1.4 Create an R pacakge!
The vignette and documents function are great for our own use, but we’d really like to share these with our colleagues so that they too can benefit from our work. This is very easy to do.
- In RStudio, choose File -> New project -> New directory -> R package
- Enter a package name, e.g.,
MyQuarantine
, and use theadd
button to select the vignettemtcars_regression.Rmd
and Rtrailing_difference.R
files. - Choose a location for your package, select the ‘Open in a new session’ button, and click ‘Create project’.
The end result is a directory structure that actually represents an R package that you can build and share with your colleagues. The directory contains
- An R/ folder, containing your R source code
- A vignettes/ folder, containing the vignette you wrote.
Later in the week we’ll see how to
- Edit the
DESCRIPTION
file to describe your package - Generate help pages in the
man/
folder from the roxygen comments in the.R
files. - Create the knit vignette from the source .Rmd file.
- Build the package for distribution and sharing with others.
6.2 4 Days Write a vignette!
Choose one week from the quarantine, and write a vignette summarizing the material. Start with an outline using using level 1 #
and level 2 ##
headings as well as bulleted lists / short paragraphs. One could think of this as structured more-or-less along the lines of classic scientific paper, with introductino, methods, results (use case), and discussion
```
# Introduction to tidy data
# Tools for working with tidy data
## Key packages
- dplyr
- ggplot2
## 5 Essential functions
- `mutate()`
- `filter()`
- `select()`
- `group_by()` / `ungroup()`
# Use case
E.g., summarizing and visualizing cell subtypes in Week 5.
- narrative text describing what steps are being taken
- include code chunks for reproducible analysis
- include figures and / or summary tables to help communicate your results
# Discussion
A paragraph on strengths and limitations of the tidy approach
Narrative on use case / insights
Future directions
## Session information
- include a code chunk that has the command `sessionInfo()`; this
documents the specfici versions of packages you used.
```
6.3 3 Days Create documented, reusable functions!
Create functions that represents key operations (e.g., data retrieval), data transformations (e.g., trailing difference in new cases), or that integrate several related steps in an analysis (e.g., translate()
, unique()
, align, and cluster DNA sequences).
Place the function(s) in separate files (one or several functions per file). Document them using the notation introduced on Monday.
Make sure the functions work by writing simple examples.
6.5 Today! (Friday) Zoom check-in
6.5.1 Review and troubleshoot
Vignettes
- Any vignettes to share?
- Shruti’s QuaRantine Learnings
Documented functions
Packages
6.5.2 Course review
Basic R
- Numeric, character, logical vectors
- subsetting, applying functions
- The data.frame
‘Tidy’ R and visualization
- The tibble and pipe (
%>%
) - readr:
read_csv()
- dplyr:
mutate()
,filter()
,select()
,group_by()
,summarize()
- ggplot2
- Visualizing the pandemic locally and globally
Machine learning
- Underlying concepts
- Support vector machines, k-nearest neighbors, KNN
- Accuracy and confusion matrix; ROC curves and AUC
Bioinformatics
- Bioconductor packages
- Classes, generics, and methods for representing sequences and ranges
- Virus phylogeny
- Host gene expression
Reproducible communication
- Vignettes
- Documented functions
- Packages