Week 2 The data frame

2.1 Day 8 (Monday) Zoom check-in

Logistics

  • Remember to use the QuaRantine Microsoft Team. Your Roswell credentials are required, and you must have been invited (by Adam.Kisailus at RoswellPark.org)

  • We’re thinking of having a ‘networking’ hour after Friday’s class (so 3pm and after) where we’ll break into smaller groups (if necessary) and provide an opportunity for people to turn on their video and audio so that we can increase the amount of intereaction. Likely the first networking hour will be a round of introductions / what you hope to get out of the course / etc., and maybe brief discussion of topics arising.

Review and troubleshoot (15 minutes)

Saving and loading objects

Scripts

The data frame (40 minutes)

Concept

Recall from Day 1:

  • Data frames are handy containers for experimental data.

  • Like a spreadsheet, a data frame has rows and columns

  • The columns of a data frame contain measurements describing each individual

    • Each measurement could refer to a different type of object (numeric, string, etc.)
    • Measurements could be physical observations on samples, e.g., height, weight, age, minutes an activity lasts, etc.
    • Measurements might also describe how the row is classified, e.g., activity, is work?, classification, date, etc.
  • The rows of a data frame represent a ‘tuple’ of measurements corresponding to an experimental observation, e.g.,

    • Note: you must ensure units are consistent across tuples!
  • Rows and columns can be assigned names.

Adding and deleting rows

Adding rows

Delete rows using a logical vector…

…or a numeric vector

Some useful data frame operations

Try these out on your simple data frames df and named_df:

  • str(df) # structure (NOT string!)(sorry Python programmers ;)
  • dim(df) # dimensions
  • View(df) # open tabular view of data frame
  • head(df) # first few rows
  • tail(df) # last few rows
  • names(df) # column names
  • colnames(df) # column names
  • rownames(df) # row names

Writing, reading, and spreadhseets

Saving a data.frame

R and spreadsheets

An alternative way of working with data.frame()

Summarization

Summarization by group

This week’s activities (5 minutes)

Goal: retrieve and summarize COVID 19 cases in Erie county and nationally

2.2 Day 9: Creation and manipulation

Creation

Last week we created vectors summarizing our quarantine activities

Each of these vectors is the same length, and are related to one another in a specific way – the first element of activity, ‘check e-mail’, is related to the first element of minutes, ‘20’, and to is_work, etc.

Use data.frame() to construct an object containing each of these vectors

Column selection

Use [ to select rows and columns

Use $ or [[ to select a column

Column selection and subsetting are often combined, e.g., to create a data.frame of work-related activities, or work-related activities lasting 60 minutes or longer

Reading and writing

Create a file path to store a ‘csv’ file. From day 7, the path could be temporary, chosen interactively, a relative path, or an absolute path

Use write.csv() to save the data.frame to disk as a plain text file in ‘csv’ (comma-separated value) format. The row.names = FALSE argument means that the row indexes are not saved to the file (row names are created when data is read in using read.csv()).

If you wish, use RStudio File -> Open File to navigate to the location where you saved the file, and open it. You could also open the file in Excel or other spreadsheet. Conversely, you can take an Excel sheet and export it as a csv file for reading into R.

Use read.csv() to import a plain text file formatted as csv

Note that some information has not survived the round-trip – the classification and date columns are plain character vectors.

Update these to be a factor() with specific levels, and a Date. `

Reading from a remote file (!)

2.3 Day 10: subset(), with(), and within()

with()

Use with() to simplify column references

  • Goal: calculate maximum number of cases in the Erie county data subset

  • First argument: a data.frame containing data to be manipulated – erie

  • Second argument: an expression to be evaluated, usually referencing columns in the data set – max(cases)

Second argument can be more complicated, using {} to enclose several lines.

within()

Adding and updating columns within() a data.frame

2.4 Day 11: aggregate() and an initial work flow

aggregate() for summarizing columns by group

Goal: summarize maximum number of cases by county in New York state

Setup

aggregate()

  • First argument: a formulacases ~ county

    • Right-hand side: the variable to be used to subset (group) the data – county

    • Left-hand side: the variable to be used in the aggregation function – cases

  • Second argument: source of data – ny_state

  • Third argument: the function to be applied to each subset of data – max

  • Maximum number of cases by county:

Exploring the data summary

Help: ?aggregate.formula

An initial work flow

Data input

Cleaning

Subset to only Erie county, New York state

Manipulation

Summary: calculate maximum (total) number of cases per county in New York state

Summary: calculate maximum (total) number of cases per state

2.5 Day 12 (Friday) Zoom check-in

Review and troubleshoot (20 minutes)

## retrieve and clean the current data set
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)
us <- within(us, {
    date = as.Date(date, format = "%Y-%m-%d")
})

## subset
erie <- subset(us, (county == "Erie") & (state == "New York"))

## manipulate
erie <- within(erie, {
    new_cases <- diff( c(0, cases) )
})

## record of cases to date
erie
##              date county    state  fips cases deaths new_cases
## 2569   2020-03-15   Erie New York 36029     3      0         3
## 3028   2020-03-16   Erie New York 36029     6      0         3
## 3544   2020-03-17   Erie New York 36029     7      0         1
## 4141   2020-03-18   Erie New York 36029     7      0         0
## 4870   2020-03-19   Erie New York 36029    28      0        21
## 5717   2020-03-20   Erie New York 36029    31      0         3
## 6711   2020-03-21   Erie New York 36029    38      0         7
## 7805   2020-03-22   Erie New York 36029    54      0        16
## 9003   2020-03-23   Erie New York 36029    87      0        33
## 10310  2020-03-24   Erie New York 36029   107      0        20
## 11741  2020-03-25   Erie New York 36029   122      0        15
## 13345  2020-03-26   Erie New York 36029   134      2        12
## 15081  2020-03-27   Erie New York 36029   219      6        85
## 16912  2020-03-28   Erie New York 36029   354      6       135
## 18839  2020-03-29   Erie New York 36029   380      6        26
## 20878  2020-03-30   Erie New York 36029   443      8        63
## 23003  2020-03-31   Erie New York 36029   438      8        -5
## 25193  2020-04-01   Erie New York 36029   553     12       115
## 27445  2020-04-02   Erie New York 36029   734     19       181
## 29765  2020-04-03   Erie New York 36029   802     22        68
## 32150  2020-04-04   Erie New York 36029   945     26       143
## 34579  2020-04-05   Erie New York 36029  1059     27       114
## 37047  2020-04-06   Erie New York 36029  1163     30       104
## 39558  2020-04-07   Erie New York 36029  1163     36         0
## 42109  2020-04-08   Erie New York 36029  1205     38        42
## 44685  2020-04-09   Erie New York 36029  1362     46       157
## 47294  2020-04-10   Erie New York 36029  1409     58        47
## 49946  2020-04-11   Erie New York 36029  1472     62        63
## 52616  2020-04-12   Erie New York 36029  1571     75        99
## 55297  2020-04-13   Erie New York 36029  1624     86        53
## 57993  2020-04-14   Erie New York 36029  1668     99        44
## 60705  2020-04-15   Erie New York 36029  1751    110        83
## 63430  2020-04-16   Erie New York 36029  1850    115        99
## 66171  2020-04-17   Erie New York 36029  1929    115        79
## 68926  2020-04-18   Erie New York 36029  1997    115        68
## 71690  2020-04-19   Erie New York 36029  2070    146        73
## 74463  2020-04-20   Erie New York 36029  2109    153        39
## 77241  2020-04-21   Erie New York 36029  2147    161        38
## 80028  2020-04-22   Erie New York 36029  2233    174        86
## 82826  2020-04-23   Erie New York 36029  2450    179       217
## 85627  2020-04-24   Erie New York 36029  2603    184       153
## 88436  2020-04-25   Erie New York 36029  2773    199       170
## 91248  2020-04-26   Erie New York 36029  2954    205       181
## 94071  2020-04-27   Erie New York 36029  3021    208        67
## 96904  2020-04-28   Erie New York 36029  3089    216        68
## 99747  2020-04-29   Erie New York 36029  3196    220       107
## 102596 2020-04-30   Erie New York 36029  3319    227       123
## 105454 2020-05-01   Erie New York 36029  3481    233       162
## 108317 2020-05-02   Erie New York 36029  3598    243       117
## 111186 2020-05-03   Erie New York 36029  3710    250       112
## 114062 2020-05-04   Erie New York 36029  3802    254        92
## 116938 2020-05-05   Erie New York 36029  3891    264        89
## 119821 2020-05-06   Erie New York 36029  4008    338       117
## 122719 2020-05-07   Erie New York 36029  4136    350       128
## 125624 2020-05-08   Erie New York 36029  4255    356       119
## 128534 2020-05-09   Erie New York 36029  4337    368        82
## 131444 2020-05-10   Erie New York 36029  4453    376       116
## 134355 2020-05-11   Erie New York 36029  4483    387        30
## 137266 2020-05-12   Erie New York 36029  4530    395        47
## 140183 2020-05-13   Erie New York 36029  4606    402        76
## 143100 2020-05-14   Erie New York 36029  4671    411        65
## 146023 2020-05-15   Erie New York 36029  4782    417       111
## 148952 2020-05-16   Erie New York 36029  4867    428        85
## 151883 2020-05-17   Erie New York 36029  4954    438        87
## 154819 2020-05-18   Erie New York 36029  4993    444        39
## 157757 2020-05-19   Erie New York 36029  5037    450        44
## 160706 2020-05-20   Erie New York 36029  5131    455        94
## 163659 2020-05-21   Erie New York 36029  5270    463       139

## aggregate() cases in each county to find total (max) number
ny_state <- subset(us, state == "New York")
head( aggregate(cases ~ county, ny_state, max) )
##        county cases
## 1      Albany  1700
## 2    Allegany    44
## 3      Broome   451
## 4 Cattaraugus    71
## 5      Cayuga    72
## 6  Chautauqua    58

User-defined functions

Basic structure:

my_function_name <- function(arg1, arg2, ...)
{
   statements

   return(object)
}

A conrete example:

Functions can be loaded from a separate file using the source command. Enter the temperature conversion function into an R script and save as myFunctions.R.

Statistical functions in R

R has many built-in statistical functions. Some of the more commonly used are listed below:

  • mean() # average
  • median() # median (middle value of sorted data)
  • range() # max - min
  • var() # variance
  • sd() # standard deviation
  • summary() # prints a combination of useful measures

Plotting data

Review of Plot Types

  • Pie chart
    • Display proportions of different values for some variable
  • Bar plot
    • Display counts of values for categorical variables
  • Histogram, density plot
    • Display counts of values for a binned, numeric variable
  • Scatter plot
    • Display y vs. x
  • Box plot
    • Display distributions over different values of a variable

Plotting packages

3 Main Plotting Packages

  • Base graphics, lattice, and ggplot2

ggplot2

  • The “Cadillac” of plotting packages.
  • Part of the “tidyverse”
  • Beautiful plots
  • To install: install.packages('ggplot2')

Words of wisdom on using plotting packages

  • A good approach is to learn by doing but don’t start from scratch
  • Find an example that is similar in appearance to what you are trying to achieve - many R galleries are available on the net.
  • When you find something you like, grab the code and modify it to use your own data.
  • Fine tune things like labels and fonts at the end, after you are sure you like the way the data is being displayed.

Some example plots

Many of these are based on the Diamonds dataset. Others are based on the mtcars dataset and this requires a bit of cleaning in preparation for the corresponding plots:

2.6 Day 13: Basic visualization

Let’s get the current Erie county data, and create the new_cases column

Simple visualization

  • We’ll use the plot() function to create a visualization of the progression of COVID cases in Erie county.

  • plot() can be used with a formula, similar to how we used aggregate().

  • The formula describes the independent (y-axis) variable as a function of the dependent (x-axis) variable

  • For our case, the formula will be cases ~ date, i.e., plot the number of cases on the y-axis, and date on the x-axis.

  • As with aggregate(), we need to provide, in the second argument, the data.frame where the variables to be plotted can be found.

  • Ok, here we go…

  • It might be maybe more informative to plot new cases (so that we can see more easily whether social distancing and other measures are having an effect on the spread of COVID cases. Using log-transformed new cases helps to convey the proportional increase

  • See ?plot.formula for some options available when using the formula interface to plot. Additional arguments are described on the help page ?help.default.

2.7 Day 14: Functions

Yesterday we created a plot for Erie county. The steps to create this plot can be separated into two parts

  1. Get the full data

  2. Subset, update, and plot the data for county of interest

What if we were interested in a different county? We could repeat (cut-and-paste) step 2, updating and generalizing a little

It would be tedious and error-prone to copy and paste this code for each county we were interested in.

A better approach is to write a function that takes as inputs the us data.frame, and the name of the county that we want to plot. Functions are easy to write

Hmm, come to think of it, we might want to write a simple function to get and clean the US data.