40 Days and 40 Nights
2020-05-22
Motivation
This is a WORK IN PROGRESS.
This course was suggested and enabled by Adam Kisailus and Richard Hershberger. It is available for Roswell Park graduate students.
Introduction
The word ‘quarantine’ is from the 1660’s and refers to the fourty days (Italian quaranta giorni) a ship suspected of carrying disease was kept in isolation.
What to do in a quarantine? The astronaut Scott Kelly spent nearly a year on the International Space Station. In a New York Times opinion piece he says, among other things, that ‘you need a hobby’, and what better hobby than a useful one? Let’s take the opportunity provided by COVID-19 to learn R for statistical analysis and comprehension of data. Who knows, it may be useful after all this is over!
What to expect
We’ll meet via zoom twice a week, Mondays and Fridays, for one hour. We’ll use this time to make sure everyone is making progress, and to introduce new or more difficult topics. Other days we’ll have short exercises and activities that hopefully provide an opportunity to learn at your own speed.
We haven’t thought this through much, but roughly we might cover:
Week 1: We’ll start with the basics of installing and using R. We’ll set up R and RStudio on your local computer, or if that doesn’t work use a cloud-based RStudio. We’ll learn the basics of R – numeric, character, logical, and other vectors; variables; and slightly more complicated representations of ‘factors’ and dates. We’ll also use RStudio to write a script that allows us to easily re-create an analysis, illustrating the power concept of reproducible research.
Week 2: The
data.frame
. This week is all about R’sdata.frame
, a versatile way of representing and manipulating a table (like an Excel spreadsheet) of data. We’ll learn how to create, write, and read adata.frame
; how to go from data in a spreadsheet in Excel to adata.frame
in R; and how to perform simple manipulations on adata.frame
, like creating a subset of data, summarizing values in a column, and summarizing values in one column based on a grouping variable in another column.url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" cases <- read.csv(url) erie <- subset(cases, county == "Erie" & state == "New York") tail(erie) ## date county state fips cases deaths ## 148952 2020-05-16 Erie New York 36029 4867 428 ## 151883 2020-05-17 Erie New York 36029 4954 438 ## 154819 2020-05-18 Erie New York 36029 4993 444 ## 157757 2020-05-19 Erie New York 36029 5037 450 ## 160706 2020-05-20 Erie New York 36029 5131 455 ## 163659 2020-05-21 Erie New York 36029 5270 463
Week 3: Packages for extending R. A great strength of R is its extensibility through packages. We’ll learn about CRAN, and install and use the ‘tidyverse’ suite of packages. The tidyverse provides us with an alternative set of tools for working with tabular data, and We’ll use publicly available data to explore the spread of COVID-19 in the US. We’ll read, filter, mutate (change), and select subsets of the data, and group data by one column (e.g., ‘state’) to create summaries (e.g., cases per state). We’ll also start to explore data visualization, creating our first plots of the spread of COVID-19.
Week 4: Machine learning. This week will develop basic machine learning models for exploring data.
Week 5: Bioinformatic analysis with Bioconductor. Bioconductor is a collection of more than 1800 R packages for the statistical analysis and comprehension of high-throughput genomic data. We’ll use Bioconductor to look at COVID-19 genome sequences, and to explore emerging genomic data relevant to the virus.
Week 6: COVID-19 has really shown the value of open data and collaboration. In the final week of our quarantine, we’ll explore collaboration. We’ll learn about writing ‘markdown’ vignettes (reports to) share our results with others, such as our lab colleagues. We’ll write and document functions so that we can easily re-do steps in an analysis. And we will synthesize the vignettes and functions into a package for documenting and sharing our work.
Roswell Park Comprehensive Cancer Center, Martin.Morgan@RoswellPark.org↩
Roswell Park Comprehensive Cancer Center↩