Week 2 The data frame

2.1 Day 8 (Monday) Zoom check-in

Logistics

Remember to use the QuaRantine Microsoft Team. Your Roswell credentials are required, and you must have been invited (by Adam.Kisailus at RoswellPark.org)
We’re thinking of having a ‘networking’ hour after Friday’s class (so 3pm and after) where we’ll break into smaller groups (if necessary) and provide an opportunity for people to turn on their video and audio so that we can increase the amount of intereaction. Likely the first networking hour will be a round of introductions / what you hope to get out of the course / etc., and maybe brief discussion of topics arising.

Review and troubleshoot (15 minutes)

Saving and loading objects

Scripts

The data frame (40 minutes)

Concept

Recall from Day 1:

Data frames are handy containers for experimental data.
Like a spreadsheet, a data frame has rows and columns
The columns of a data frame contain measurements describing each individual
- Each measurement could refer to a different type of object (numeric, string, etc.)
- Measurements could be physical observations on samples, e.g., height, weight, age, minutes an activity lasts, etc.
- Measurements might also describe how the row is classified, e.g., activity, is work?, classification, date, etc.
The rows of a data frame represent a ‘tuple’ of measurements corresponding to an experimental observation, e.g.,
- Note: you must ensure units are consistent across tuples!
Rows and columns can be assigned names.

Create a simple data frame

heights <- c(72, 65, 68)
weights <- c(190, 130, 150)
ages <-  c(44, 35, 37)
df <- data.frame(heights, weights, ages)
df
##   heights weights ages
## 1      72     190   44
## 2      65     130   35
## 3      68     150   37

It’s possible to update the column names, and to provide row names…

named_df <- data.frame(heights, weights, ages)
colnames(named_df) <- c("hgt_inches", "wgt_lbs", "age_years")
rownames(named_df) <- c("John Doe", "Pat Jones", "Sara Grant")
named_df
##            hgt_inches wgt_lbs age_years
## John Doe           72     190        44
## Pat Jones          65     130        35
## Sara Grant         68     150        37

…but it’s often better practice to name columns at time of creation, and to store all information as columns (rather than designating one column as a ‘special’ row name)

Here’s our first attempt

data.frame(
    person = c("John Doe", "Pat Jones", "Sara Grant"),
    hgt_inches = heights, wgt_lbs = weights, age_years = ages
)
##       person hgt_inches wgt_lbs age_years
## 1   John Doe         72     190        44
## 2  Pat Jones         65     130        35
## 3 Sara Grant         68     150        37

It’s unsatisfactory because by default R treats character vectors as factor. We’d like them to plain-old character vectors. To accomplish this, we add the stringsAsFactors = FALSE argument

df <- data.frame(
    person = c("John Doe", "Pat Jones", "Sara Grant"),
    hgt_inches = heights, wgt_lbs = weights, age_years = ages,
    stringsAsFactors = FALSE
)
df
##       person hgt_inches wgt_lbs age_years
## 1   John Doe         72     190        44
## 2  Pat Jones         65     130        35
## 3 Sara Grant         68     150        37

Adding and deleting rows

Adding rows

Add a row with rbind()

more_people <- c("Bob Kane", "Kari Patra", "Sam Groe")
more_heights <- c(61, 68, 70)
more_weights <- c(101, 134, 175)
more_ages <-  c(13, 16, 24)
more_df <- data.frame(
    person = more_people,
    hgt_inches = more_heights, wgt_lbs = more_weights, age_years = more_ages,
    stringsAsFactors = FALSE
)

df_all <- rbind(df, more_df)
df_all
##       person hgt_inches wgt_lbs age_years
## 1   John Doe         72     190        44
## 2  Pat Jones         65     130        35
## 3 Sara Grant         68     150        37
## 4   Bob Kane         61     101        13
## 5 Kari Patra         68     134        16
## 6   Sam Groe         70     175        24

R often has more than one way to perform an operation. We’ll see add_rows() later in the course.

Delete rows using a logical vector…

Create a logical or numeric index indicating the rows to be deleted

## suppose the study has some dropouts ....
dropouts <- c("Bob Kane", "John Doe")

## create a logical vector indicating which rows should be dropped
drop <- df_all$person %in% dropouts

## ...but we actually want to know which rows to `keep`
keep <- !drop

Subset the data frame with the logical vector indicating the rows we would like to keep

df_all[keep,]
##       person hgt_inches wgt_lbs age_years
## 2  Pat Jones         65     130        35
## 3 Sara Grant         68     150        37
## 5 Kari Patra         68     134        16
## 6   Sam Groe         70     175        24

…or a numeric vector

Create a vector containing the rows to be deleted

# suppose the study has some dropouts ....
dropouts = c(2, 3)

Use a minus sign - to indicated that these rows should be dropped, rather than kept

df_all # referesh my memory about df contents ....
##       person hgt_inches wgt_lbs age_years
## 1   John Doe         72     190        44
## 2  Pat Jones         65     130        35
## 3 Sara Grant         68     150        37
## 4   Bob Kane         61     101        13
## 5 Kari Patra         68     134        16
## 6   Sam Groe         70     175        24
df_remaining <- df_all[-dropouts, ]

df_remaining # items 2 and 3 are dropped!
##       person hgt_inches wgt_lbs age_years
## 1   John Doe         72     190        44
## 4   Bob Kane         61     101        13
## 5 Kari Patra         68     134        16
## 6   Sam Groe         70     175        24

Some useful data frame operations

Try these out on your simple data frames df and named_df:

str(df) # structure (NOT string!)(sorry Python programmers ;)
dim(df) # dimensions
View(df) # open tabular view of data frame
head(df) # first few rows
tail(df) # last few rows
names(df) # column names
colnames(df) # column names
rownames(df) # row names

Writing, reading, and spreadhseets

Saving a data.frame

We could save the data.frame as an R object, using the methods from quarantine day 7
Often, better practice (e.g., to make it easy to share data with others in our lab) is to save data as a text file
A ‘csv’ file is one example
- A plain text file
- The first line contains column names
- Each line of the text file represents a row of the data frame
- Columns within a row are separated by a comma, ,

Example: save df_all to a temporary file location

file <- tempfile() # temporary file
## file <- file.choose()
## file <- "df_all.csv"
## file <- "/Users/ma38737/MyQuarantine/df_all.csv"
write.csv(df_all, file, row.names = FALSE)

now, read the data back in from the temporary location

df_all_from_file <- read.csv(file, stringsAsFactors = FALSE)
df_all_from_file
##       person hgt_inches wgt_lbs age_years
## 1   John Doe         72     190        44
## 2  Pat Jones         65     130        35
## 3 Sara Grant         68     150        37
## 4   Bob Kane         61     101        13
## 5 Kari Patra         68     134        16
## 6   Sam Groe         70     175        24

R and spreadsheets

A CSV file is a common way to move data from R to a spreadsheet, and vice versa. Following along with the example above, write df_all to a CSV file.
```
file <- "/Users/ma38727/MyQuarantine/df_all.csv" # a location on (my) disk
write.csv(df_all, file, row.names = FALSE)
```
Now open a spreadsheet application like Excel and navigate to the directory containing the file
… and open the file
To go from Excel to R, make sure your spreadsheet is a simple rectangular table of rows and columns, without ‘merged’ cells or fancy column formating, or… i.e., your spreadsheet should be as simple as the one we imported from R. Then save the file in CSV format

…and import it into R

df_all_from_Excel <- read.csv(
    "/Users/ma38727/MyQuarantine/df_all_from_Excel.csv",
    stringsAsFactors = FALSE
)

An alternative way of working with `data.frame()`

with(): column selection and computation
within(): update or add columns
subset(): row and column subset

Our quarantine log, day 1

activity <- c("check e-mail", "breakfast", "conference call", "webinar", "walk")
minutes <- c(20, 30, 60, 60, 60)
is_work <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

levels <- c("connect", "exercise", "consult", "hobby", "essential")
classification <- factor(
    c("connect", "essential", "connect", "consult", "exercise"),
    levels = levels
)

dates <- rep("04-14-2020", length(activity))
date <- as.Date(dates, format = "%m-%d-%Y")

log <- data.frame(
    activity, minutes, is_work, classification, date,
    stringsAsFactors = FALSE
)
log
##          activity minutes is_work classification       date
## 1    check e-mail      20    TRUE        connect 2020-04-14
## 2       breakfast      30   FALSE      essential 2020-04-14
## 3 conference call      60    TRUE        connect 2020-04-14
## 4         webinar      60    TRUE        consult 2020-04-14
## 5            walk      60   FALSE       exercise 2020-04-14

Summarization

Use with() to simplify variable reference

Create a new data.frame() containing the summary

with(log, {
    data.frame(
        days_in_quarantine = length(unique(date)),
        total_minutes = sum(minutes),
        work_activities = sum(is_work),
        other_activities = sum(!is_work)
    )
})
##   days_in_quarantine total_minutes work_activities other_activities
## 1                  1           230               3                2

Summarization by group

aggregate()

## minutes per day spent on each activity, from the quarantine_log
aggregate(minutes ~ activity, log, sum)
##          activity minutes
## 1       breakfast      30
## 2    check e-mail      20
## 3 conference call      60
## 4            walk      60
## 5         webinar      60

## minutes per day spent on each clasasification
aggregate(minutes ~ classification, log, sum)
##   classification minutes
## 1        connect      80
## 2       exercise      60
## 3        consult      60
## 4      essential      30

## non-work activities per day
aggregate(!is_work ~ date, log, sum)
##         date !is_work
## 1 2020-04-14        2

This week’s activities (5 minutes)

Goal: retrieve and summarize COVID 19 cases in Erie county and nationally

2.2 Day 9: Creation and manipulation

Creation

Last week we created vectors summarizing our quarantine activities

activity <- c("check e-mail", "breakfast", "conference call", "webinar", "walk")
minutes <- c(20, 30, 60, 60, 60)
is_work <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

levels <- c("connect", "exercise", "consult", "hobby", "essential")
classification <- factor(
    c("connect", "essential", "connect", "consult", "exercise"),
    levels = levels
)

dates <- rep("04-14-2020", length(activity))
date <- as.Date(dates, format = "%m-%d-%Y")

Each of these vectors is the same length, and are related to one another in a specific way – the first element of activity, ‘check e-mail’, is related to the first element of minutes, ‘20’, and to is_work, etc.

Use data.frame() to construct an object containing each of these vectors

Each argument to data.frame() is a vector representing a column

The stringsAsFactors = FALSE argument says that character vectors should NOT be automatically coerced to factors

activities <- data.frame(
    activity, minutes, is_work, classification, date,
    stringsAsFactors = FALSE
)
activities
##          activity minutes is_work classification       date
## 1    check e-mail      20    TRUE        connect 2020-04-14
## 2       breakfast      30   FALSE      essential 2020-04-14
## 3 conference call      60    TRUE        connect 2020-04-14
## 4         webinar      60    TRUE        consult 2020-04-14
## 5            walk      60   FALSE       exercise 2020-04-14

We can query the object we’ve created for its class(), dim()ensions, take a look at the head() or tail() of the object, etc. names() returns the column names.

class(activities)
## [1] "data.frame"
dim(activities)     # number of rows and columns
## [1] 5 5
head(activities, 3) # first three rows
##          activity minutes is_work classification       date
## 1    check e-mail      20    TRUE        connect 2020-04-14
## 2       breakfast      30   FALSE      essential 2020-04-14
## 3 conference call      60    TRUE        connect 2020-04-14
names(activities)
## [1] "activity"       "minutes"        "is_work"        "classification"
## [5] "date"

Column selection

Use [ to select rows and columns

activities is a two-dimensional object

Subset the data to contain the first and third rows and the first and fourth columns

activities[c(1, 3), c(1, 4)]
##          activity classification
## 1    check e-mail        connect
## 3 conference call        connect

Subset columns by name

activities[c(1, 3), c("activity", "is_work")]
##          activity is_work
## 1    check e-mail    TRUE
## 3 conference call    TRUE

Subset only by row or only by column by omiting the subscript index for that dimension

activities[c(1, 3), ]                  # all columns for rows 1 and 3
##          activity minutes is_work classification       date
## 1    check e-mail      20    TRUE        connect 2020-04-14
## 3 conference call      60    TRUE        connect 2020-04-14
activities[, c("activity", "minutes")] # all rows for columns 1 and 2
##          activity minutes
## 1    check e-mail      20
## 2       breakfast      30
## 3 conference call      60
## 4         webinar      60
## 5            walk      60

Be careful when selecting a single column!

By default, R returns a vector

activities[, "classification"]
## [1] connect   essential connect   consult   exercise 
## Levels: connect exercise consult hobby essential

Use drop = FALSE to return a data.frame

activities[, "classification", drop = FALSE]
##   classification
## 1        connect
## 2      essential
## 3        connect
## 4        consult
## 5       exercise

Use $ or [[ to select a column

Selection of individual columns as vectors is easy

activities$classification
## [1] connect   essential connect   consult   exercise 
## Levels: connect exercise consult hobby essential

An alternative, often used in scripts, is to use [[, which requires the name of a variable provided as a character vector

activities[["classification"]]
## [1] connect   essential connect   consult   exercise 
## Levels: connect exercise consult hobby essential

colname <- "classification"
activities[[colname]]
## [1] connect   essential connect   consult   exercise 
## Levels: connect exercise consult hobby essential

Column selection and subsetting are often combined, e.g., to create a data.frame of work-related activities, or work-related activities lasting 60 minutes or longer

work_related_activities <- activities[ activities$is_work == TRUE, ]
work_related_activities
##          activity minutes is_work classification       date
## 1    check e-mail      20    TRUE        connect 2020-04-14
## 3 conference call      60    TRUE        connect 2020-04-14
## 4         webinar      60    TRUE        consult 2020-04-14

row_idx <- activities$is_work & (activities$minutes >= 60)
activities[row_idx,]
##          activity minutes is_work classification       date
## 3 conference call      60    TRUE        connect 2020-04-14
## 4         webinar      60    TRUE        consult 2020-04-14

Adding or updating columns

Use $ or [ or [[ to add a new column,

activities$is_long_work <- activities$is_work & (activities$minutes >= 60)
activities
##          activity minutes is_work classification       date is_long_work
## 1    check e-mail      20    TRUE        connect 2020-04-14        FALSE
## 2       breakfast      30   FALSE      essential 2020-04-14        FALSE
## 3 conference call      60    TRUE        connect 2020-04-14         TRUE
## 4         webinar      60    TRUE        consult 2020-04-14         TRUE
## 5            walk      60   FALSE       exercise 2020-04-14        FALSE

## ...another way of doing the same thing
activities[["is_long_work"]] <- activities$is_work & (activities$minutes >= 60)

## ...and another way
activities[,"is_long_work"] <- activities$is_work & (activities$minutes >= 60)

Columns can be updated in the same way

activities$activity <- toupper(activities$activity)
activities
##          activity minutes is_work classification       date is_long_work
## 1    CHECK E-MAIL      20    TRUE        connect 2020-04-14        FALSE
## 2       BREAKFAST      30   FALSE      essential 2020-04-14        FALSE
## 3 CONFERENCE CALL      60    TRUE        connect 2020-04-14         TRUE
## 4         WEBINAR      60    TRUE        consult 2020-04-14         TRUE
## 5            WALK      60   FALSE       exercise 2020-04-14        FALSE

Reading and writing

Create a file path to store a ‘csv’ file. From day 7, the path could be temporary, chosen interactively, a relative path, or an absolute path

## could be any of these...
##
## interactive_file_path <- file.choose(new = TRUE)
## getwd()
## relative_file_path <- "my_activity.rds"
## absolute_file_path_on_macOS <- "/Users/ma38727/my_activity.rda"
##
## ... but we'll use
temporary_file_path <- tempfile(fileext = ".csv")

Use write.csv() to save the data.frame to disk as a plain text file in ‘csv’ (comma-separated value) format. The row.names = FALSE argument means that the row indexes are not saved to the file (row names are created when data is read in using read.csv()).

write.csv(activities, temporary_file_path, row.names = FALSE)

If you wish, use RStudio File -> Open File to navigate to the location where you saved the file, and open it. You could also open the file in Excel or other spreadsheet. Conversely, you can take an Excel sheet and export it as a csv file for reading into R.

Use read.csv() to import a plain text file formatted as csv

imported_activities <- read.csv(temporary_file_path, stringsAsFactors = FALSE)
imported_activities
##          activity minutes is_work classification       date is_long_work
## 1    CHECK E-MAIL      20    TRUE        connect 2020-04-14        FALSE
## 2       BREAKFAST      30   FALSE      essential 2020-04-14        FALSE
## 3 CONFERENCE CALL      60    TRUE        connect 2020-04-14         TRUE
## 4         WEBINAR      60    TRUE        consult 2020-04-14         TRUE
## 5            WALK      60   FALSE       exercise 2020-04-14        FALSE

Note that some information has not survived the round-trip – the classification and date columns are plain character vectors.

class(imported_activities$classification)
## [1] "character"
class(imported_activities$date)
## [1] "character"

Update these to be a factor() with specific levels, and a Date. `

levels <- c("connect", "exercise", "consult", "hobby", "essential")
imported_activities$classification <- factor(
    imported_activities$classification,
    levels = levels
)

imported_activities$date <- as.Date(imported_activities$date, format = "%Y-%m-%d")

imported_activities
##          activity minutes is_work classification       date is_long_work
## 1    CHECK E-MAIL      20    TRUE        connect 2020-04-14        FALSE
## 2       BREAKFAST      30   FALSE      essential 2020-04-14        FALSE
## 3 CONFERENCE CALL      60    TRUE        connect 2020-04-14         TRUE
## 4         WEBINAR      60    TRUE        consult 2020-04-14         TRUE
## 5            WALK      60   FALSE       exercise 2020-04-14        FALSE

Reading from a remote file (!)

Visit the New York Times csv file daily tally of COVID-19 cases in all US counties.

Read the data into an R data.frame

url <-
  "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)

Explore the data

class(us)
## [1] "data.frame"
dim(us)
## [1] 164886      6
head(us)
##         date    county      state  fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24      Cook   Illinois 17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25    Orange California  6059     1      0

Subset the data to only New York state or Erie county

ny_state <- us[us$state == "New York",]
dim(ny_state)
## [1] 3813    6

erie <- us[(us$state == "New York") & (us$county == "Erie"), ]
erie
##              date county    state  fips cases deaths
## 2569   2020-03-15   Erie New York 36029     3      0
## 3028   2020-03-16   Erie New York 36029     6      0
## 3544   2020-03-17   Erie New York 36029     7      0
## 4141   2020-03-18   Erie New York 36029     7      0
## 4870   2020-03-19   Erie New York 36029    28      0
## 5717   2020-03-20   Erie New York 36029    31      0
## 6711   2020-03-21   Erie New York 36029    38      0
## 7805   2020-03-22   Erie New York 36029    54      0
## 9003   2020-03-23   Erie New York 36029    87      0
## 10310  2020-03-24   Erie New York 36029   107      0
## 11741  2020-03-25   Erie New York 36029   122      0
## 13345  2020-03-26   Erie New York 36029   134      2
## 15081  2020-03-27   Erie New York 36029   219      6
## 16912  2020-03-28   Erie New York 36029   354      6
## 18839  2020-03-29   Erie New York 36029   380      6
## 20878  2020-03-30   Erie New York 36029   443      8
## 23003  2020-03-31   Erie New York 36029   438      8
## 25193  2020-04-01   Erie New York 36029   553     12
## 27445  2020-04-02   Erie New York 36029   734     19
## 29765  2020-04-03   Erie New York 36029   802     22
## 32150  2020-04-04   Erie New York 36029   945     26
## 34579  2020-04-05   Erie New York 36029  1059     27
## 37047  2020-04-06   Erie New York 36029  1163     30
## 39558  2020-04-07   Erie New York 36029  1163     36
## 42109  2020-04-08   Erie New York 36029  1205     38
## 44685  2020-04-09   Erie New York 36029  1362     46
## 47294  2020-04-10   Erie New York 36029  1409     58
## 49946  2020-04-11   Erie New York 36029  1472     62
## 52616  2020-04-12   Erie New York 36029  1571     75
## 55297  2020-04-13   Erie New York 36029  1624     86
## 57993  2020-04-14   Erie New York 36029  1668     99
## 60705  2020-04-15   Erie New York 36029  1751    110
## 63430  2020-04-16   Erie New York 36029  1850    115
## 66171  2020-04-17   Erie New York 36029  1929    115
## 68926  2020-04-18   Erie New York 36029  1997    115
## 71690  2020-04-19   Erie New York 36029  2070    146
## 74463  2020-04-20   Erie New York 36029  2109    153
## 77241  2020-04-21   Erie New York 36029  2147    161
## 80028  2020-04-22   Erie New York 36029  2233    174
## 82826  2020-04-23   Erie New York 36029  2450    179
## 85627  2020-04-24   Erie New York 36029  2603    184
## 88436  2020-04-25   Erie New York 36029  2773    199
## 91248  2020-04-26   Erie New York 36029  2954    205
## 94071  2020-04-27   Erie New York 36029  3021    208
## 96904  2020-04-28   Erie New York 36029  3089    216
## 99747  2020-04-29   Erie New York 36029  3196    220
## 102596 2020-04-30   Erie New York 36029  3319    227
## 105454 2020-05-01   Erie New York 36029  3481    233
## 108317 2020-05-02   Erie New York 36029  3598    243
## 111186 2020-05-03   Erie New York 36029  3710    250
## 114062 2020-05-04   Erie New York 36029  3802    254
## 116938 2020-05-05   Erie New York 36029  3891    264
## 119821 2020-05-06   Erie New York 36029  4008    338
## 122719 2020-05-07   Erie New York 36029  4136    350
## 125624 2020-05-08   Erie New York 36029  4255    356
## 128534 2020-05-09   Erie New York 36029  4337    368
## 131444 2020-05-10   Erie New York 36029  4453    376
## 134355 2020-05-11   Erie New York 36029  4483    387
## 137266 2020-05-12   Erie New York 36029  4530    395
## 140183 2020-05-13   Erie New York 36029  4606    402
## 143100 2020-05-14   Erie New York 36029  4671    411
## 146023 2020-05-15   Erie New York 36029  4782    417
## 148952 2020-05-16   Erie New York 36029  4867    428
## 151883 2020-05-17   Erie New York 36029  4954    438
## 154819 2020-05-18   Erie New York 36029  4993    444
## 157757 2020-05-19   Erie New York 36029  5037    450
## 160706 2020-05-20   Erie New York 36029  5131    455
## 163659 2020-05-21   Erie New York 36029  5270    463

2.3 Day 10: `subset()`, `with()`, and `within()`

`subset()`

subset()ing a data.frame

Read the New York Times csv file summarizing COVID cases in the US.

url <-
  "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)

Create subsets, e.g., to include only New York state, or only Erie county

ny_state <- subset(us, state == "New York")
dim(ny_state)
## [1] 3813    6
tail(ny_state)
##              date      county    state  fips cases deaths
## 163698 2020-05-21      Warren New York 36113   241     28
## 163699 2020-05-21  Washington New York 36115   222     11
## 163700 2020-05-21       Wayne New York 36117   103      1
## 163701 2020-05-21 Westchester New York 36119 32672   1438
## 163702 2020-05-21     Wyoming New York 36121    79      3
## 163703 2020-05-21       Yates New York 36123    34      6

erie <- subset(us, (state == "New York") & county == "Erie")
dim(erie)
## [1] 68  6
tail(erie)
##              date county    state  fips cases deaths
## 148952 2020-05-16   Erie New York 36029  4867    428
## 151883 2020-05-17   Erie New York 36029  4954    438
## 154819 2020-05-18   Erie New York 36029  4993    444
## 157757 2020-05-19   Erie New York 36029  5037    450
## 160706 2020-05-20   Erie New York 36029  5131    455
## 163659 2020-05-21   Erie New York 36029  5270    463

`with()`

Use with() to simplify column references

Goal: calculate maximum number of cases in the Erie county data subset
First argument: a data.frame containing data to be manipulated – erie
Second argument: an expression to be evaluated, usually referencing columns in the data set – max(cases)
- E.g., Calculate the maximum number of cases in the erie subset
```
with(erie, max(cases))
## [1] 5270
```

Second argument can be more complicated, using {} to enclose several lines.

E.g., Calculate the number of new cases, and then reports the average number of new cases per day. We will use diff()
- diff() calculates the difference between successive values of a vector
```
x <- c(1, 1, 2, 3, 5, 8)
diff(x)
## [1] 0 1 1 2 3
```
- The length of diff(x) is one less than the length of x
```
length(x)
## [1] 6
length(diff(x))
## [1] 5
```

new_cases is the diff() of successive values of cases, with an implicit initial value equal to 0.

with(erie, {
    new_cases <- diff(c(0, cases))
    mean(new_cases)
})
## [1] 77.5

`within()`

Adding and updating columns within() a data.frame

First argument: a data.frame containing data to be updated – erie
Second argument: an expression of one or more variable assignments, the assignments create new columns in the data.frame.

Example: add a new_cases column

erie_new_cases <- within(erie, {
    new_cases <- diff(c(0, cases))
})
head(erie_new_cases)
##            date county    state  fips cases deaths new_cases
## 2569 2020-03-15   Erie New York 36029     3      0         3
## 3028 2020-03-16   Erie New York 36029     6      0         3
## 3544 2020-03-17   Erie New York 36029     7      0         1
## 4141 2020-03-18   Erie New York 36029     7      0         0
## 4870 2020-03-19   Erie New York 36029    28      0        21
## 5717 2020-03-20   Erie New York 36029    31      0         3

2.4 Day 11: `aggregate()` and an initial work flow

`aggregate()` for summarizing columns by group

Goal: summarize maximum number of cases by county in New York state

Setup

Read and subset the New York Times data to contain only New York state data

url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)

ny_state <- subset(us, state == "New York")

aggregate()

First argument: a formula – cases ~ county
- Right-hand side: the variable to be used to subset (group) the data – county
- Left-hand side: the variable to be used in the aggregation function – cases
Second argument: source of data – ny_state
Third argument: the function to be applied to each subset of data – max

Maximum number of cases by county:

max_cases_by_county <- aggregate( cases ~ county, ny_state, max )

Exploring the data summary

Subset to some interesting ‘counties’

head(max_cases_by_county)
##        county cases
## 1      Albany  1700
## 2    Allegany    44
## 3      Broome   451
## 4 Cattaraugus    71
## 5      Cayuga    72
## 6  Chautauqua    58
subset(
    max_cases_by_county,
    county %in% c("New York City", "Westchester", "Erie")
)
##           county  cases
## 14          Erie   5270
## 29 New York City 200507
## 57   Westchester  32672

Help: ?aggregate.formula

An initial work flow

Data input

From a remote location

url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)

class(us)
## [1] "data.frame"
dim(us)
## [1] 164886      6
head(us)
##         date    county      state  fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24      Cook   Illinois 17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25    Orange California  6059     1      0

Cleaning

date is a plain-old character vector, but should be a Date.

class(us$date) # oops, should be 'Date'
## [1] "character"

Update, method 1

us$date <- as.Date(us$date, format = "%Y-%m-%d")
head(us)
##         date    county      state  fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24      Cook   Illinois 17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25    Orange California  6059     1      0

Update, method 2

us <- within(us, {
    date = as.Date(date, format = "%Y-%m-%d")
})
head(us)
##         date    county      state  fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24      Cook   Illinois 17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25    Orange California  6059     1      0

Subset to only Erie county, New York state

Subset, method 1

row_idx <- (us$county == "Erie") & (us$state == "New York")
erie <- us[row_idx,]
dim(erie)
## [1] 68  6

Subset, method 2

erie <- subset(us, (county == "Erie") & (state == "New York"))
dim(erie)
## [1] 68  6

Manipulation

Goal: calculate new_cases as the difference between succesive days, using diff()

Remember use of diff()

## example: `diff()` between successive numbers in a vector
x <- c(1, 1, 2, 3, 5, 8, 13)
diff(x)
## [1] 0 1 1 2 3 5

Update, methods 1 & 2 (prepend a 0 when using diff(), to get the intial number of new cases)

## one way...
erie$new_cases <- diff( c(0, erie$cases) )

## ...or another
erie <- within(erie, {
    new_cases <- diff( c(0, cases) )
})

Summary: calculate maximum (total) number of cases per county in New York state

For Erie county, let’s see how to calculate the maximum (total) number of cases

max(erie$cases)        # one way...
## [1] 5270
with(erie, max(cases)) # ... another
## [1] 5270

Subset US data to New York state

ny_state <- subset(us, state == "New York")

Summarize each county in the state using aggregate().

First argument: summarize cases grouped by county – cases ~ county
Second argument: data source – ny_state

Third argument: function to apply to each subset – max

max_cases_by_county <- aggregate( cases ~ county, ny_state, max)
head(max_cases_by_county)
##        county cases
## 1      Albany  1700
## 2    Allegany    44
## 3      Broome   451
## 4 Cattaraugus    71
## 5      Cayuga    72
## 6  Chautauqua    58

subset() to select counties

subset(
    max_cases_by_county,
    county %in% c("New York City", "Westchester", "Erie")
)
##           county  cases
## 14          Erie   5270
## 29 New York City 200507
## 57   Westchester  32672

Summary: calculate maximum (total) number of cases per state

Use entire data set, us

aggregate() cases by county and state – cases ~ county + state

max_cases_by_county_state <-
    aggregate( cases ~ county + state, us, max )
dim(max_cases_by_county_state)
## [1] 2975    3
head(max_cases_by_county_state)
##    county   state cases
## 1 Autauga Alabama   147
## 2 Baldwin Alabama   270
## 3 Barbour Alabama   100
## 4    Bibb Alabama    52
## 5  Blount Alabama    48
## 6 Bullock Alabama    71

aggregate() a second time, using max_cases_by_county_state and aggregtaing by state

max_cases_by_state <-
    aggregate( cases ~ state, max_cases_by_county_state, max )

Explore the data

head(max_cases_by_state)
##        state cases
## 1    Alabama  1874
## 2     Alaska   207
## 3    Arizona  7835
## 4   Arkansas   966
## 5 California 42037
## 6   Colorado  4948
subset(
    max_cases_by_state,
    state %in% c("California", "Illinois", "New York", "Washington")
)
##         state  cases
## 5  California  42037
## 15   Illinois  67551
## 34   New York 200507
## 52 Washington   7647

2.5 Day 12 (Friday) Zoom check-in

Review and troubleshoot (20 minutes)

## retrieve and clean the current data set
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)
us <- within(us, {
    date = as.Date(date, format = "%Y-%m-%d")
})

## subset
erie <- subset(us, (county == "Erie") & (state == "New York"))

## manipulate
erie <- within(erie, {
    new_cases <- diff( c(0, cases) )
})

## record of cases to date
erie
##              date county    state  fips cases deaths new_cases
## 2569   2020-03-15   Erie New York 36029     3      0         3
## 3028   2020-03-16   Erie New York 36029     6      0         3
## 3544   2020-03-17   Erie New York 36029     7      0         1
## 4141   2020-03-18   Erie New York 36029     7      0         0
## 4870   2020-03-19   Erie New York 36029    28      0        21
## 5717   2020-03-20   Erie New York 36029    31      0         3
## 6711   2020-03-21   Erie New York 36029    38      0         7
## 7805   2020-03-22   Erie New York 36029    54      0        16
## 9003   2020-03-23   Erie New York 36029    87      0        33
## 10310  2020-03-24   Erie New York 36029   107      0        20
## 11741  2020-03-25   Erie New York 36029   122      0        15
## 13345  2020-03-26   Erie New York 36029   134      2        12
## 15081  2020-03-27   Erie New York 36029   219      6        85
## 16912  2020-03-28   Erie New York 36029   354      6       135
## 18839  2020-03-29   Erie New York 36029   380      6        26
## 20878  2020-03-30   Erie New York 36029   443      8        63
## 23003  2020-03-31   Erie New York 36029   438      8        -5
## 25193  2020-04-01   Erie New York 36029   553     12       115
## 27445  2020-04-02   Erie New York 36029   734     19       181
## 29765  2020-04-03   Erie New York 36029   802     22        68
## 32150  2020-04-04   Erie New York 36029   945     26       143
## 34579  2020-04-05   Erie New York 36029  1059     27       114
## 37047  2020-04-06   Erie New York 36029  1163     30       104
## 39558  2020-04-07   Erie New York 36029  1163     36         0
## 42109  2020-04-08   Erie New York 36029  1205     38        42
## 44685  2020-04-09   Erie New York 36029  1362     46       157
## 47294  2020-04-10   Erie New York 36029  1409     58        47
## 49946  2020-04-11   Erie New York 36029  1472     62        63
## 52616  2020-04-12   Erie New York 36029  1571     75        99
## 55297  2020-04-13   Erie New York 36029  1624     86        53
## 57993  2020-04-14   Erie New York 36029  1668     99        44
## 60705  2020-04-15   Erie New York 36029  1751    110        83
## 63430  2020-04-16   Erie New York 36029  1850    115        99
## 66171  2020-04-17   Erie New York 36029  1929    115        79
## 68926  2020-04-18   Erie New York 36029  1997    115        68
## 71690  2020-04-19   Erie New York 36029  2070    146        73
## 74463  2020-04-20   Erie New York 36029  2109    153        39
## 77241  2020-04-21   Erie New York 36029  2147    161        38
## 80028  2020-04-22   Erie New York 36029  2233    174        86
## 82826  2020-04-23   Erie New York 36029  2450    179       217
## 85627  2020-04-24   Erie New York 36029  2603    184       153
## 88436  2020-04-25   Erie New York 36029  2773    199       170
## 91248  2020-04-26   Erie New York 36029  2954    205       181
## 94071  2020-04-27   Erie New York 36029  3021    208        67
## 96904  2020-04-28   Erie New York 36029  3089    216        68
## 99747  2020-04-29   Erie New York 36029  3196    220       107
## 102596 2020-04-30   Erie New York 36029  3319    227       123
## 105454 2020-05-01   Erie New York 36029  3481    233       162
## 108317 2020-05-02   Erie New York 36029  3598    243       117
## 111186 2020-05-03   Erie New York 36029  3710    250       112
## 114062 2020-05-04   Erie New York 36029  3802    254        92
## 116938 2020-05-05   Erie New York 36029  3891    264        89
## 119821 2020-05-06   Erie New York 36029  4008    338       117
## 122719 2020-05-07   Erie New York 36029  4136    350       128
## 125624 2020-05-08   Erie New York 36029  4255    356       119
## 128534 2020-05-09   Erie New York 36029  4337    368        82
## 131444 2020-05-10   Erie New York 36029  4453    376       116
## 134355 2020-05-11   Erie New York 36029  4483    387        30
## 137266 2020-05-12   Erie New York 36029  4530    395        47
## 140183 2020-05-13   Erie New York 36029  4606    402        76
## 143100 2020-05-14   Erie New York 36029  4671    411        65
## 146023 2020-05-15   Erie New York 36029  4782    417       111
## 148952 2020-05-16   Erie New York 36029  4867    428        85
## 151883 2020-05-17   Erie New York 36029  4954    438        87
## 154819 2020-05-18   Erie New York 36029  4993    444        39
## 157757 2020-05-19   Erie New York 36029  5037    450        44
## 160706 2020-05-20   Erie New York 36029  5131    455        94
## 163659 2020-05-21   Erie New York 36029  5270    463       139

## aggregate() cases in each county to find total (max) number
ny_state <- subset(us, state == "New York")
head( aggregate(cases ~ county, ny_state, max) )
##        county cases
## 1      Albany  1700
## 2    Allegany    44
## 3      Broome   451
## 4 Cattaraugus    71
## 5      Cayuga    72
## 6  Chautauqua    58

User-defined functions

Basic structure:

my_function_name <- function(arg1, arg2, ...)
{
   statements

   return(object)
}

A conrete example:

# declare a function to convert temperatures
toFahrenheit <- function(celsius)
{
  f <- (9/5) * celsius + 32
  return(f)
}

# invoke the function
temps <- c(20:25)
toFahrenheit(temps)
## [1] 68.0 69.8 71.6 73.4 75.2 77.0

Functions can be loaded from a separate file using the source command. Enter the temperature conversion function into an R script and save as myFunctions.R.

my_R_funcs <-
    file.path("C:\\Matott\\MyQuarantine", "myFunctions.R")
source(my_R_funcs)

toFahrenheit(c(40, 45, 78, 92, 12, 34))
## [1] 104.0 113.0 172.4 197.6  53.6  93.2

Statistical functions in R

R has many built-in statistical functions. Some of the more commonly used are listed below:

mean() # average
median() # median (middle value of sorted data)
range() # max - min
var() # variance
sd() # standard deviation
summary() # prints a combination of useful measures

Plotting data

Review of Plot Types

Pie chart
- Display proportions of different values for some variable
Bar plot
- Display counts of values for categorical variables
Histogram, density plot
- Display counts of values for a binned, numeric variable
Scatter plot
- Display y vs. x
Box plot
- Display distributions over different values of a variable

Plotting packages

3 Main Plotting Packages

Base graphics, lattice, and ggplot2

ggplot2

The “Cadillac” of plotting packages.
Part of the “tidyverse”
Beautiful plots
To install: install.packages('ggplot2')

Words of wisdom on using plotting packages

A good approach is to learn by doing but don’t start from scratch
Find an example that is similar in appearance to what you are trying to achieve - many R galleries are available on the net.
When you find something you like, grab the code and modify it to use your own data.
Fine tune things like labels and fonts at the end, after you are sure you like the way the data is being displayed.

Some example plots

Many of these are based on the Diamonds dataset. Others are based on the mtcars dataset and this requires a bit of cleaning in preparation for the corresponding plots:

data("mtcars")

# convert gear to a factor-level object
mtcars$gear = factor(
    mtcars$gear,
    levels = c(3, 4, 5),
    labels = c("3 gears", "4 gears", "5 gears")
)

# convert cyl to a factor-level object
mtcars$cyl = factor(
    mtcars$cyl,
    levels = c(4, 6, 8),
    labels = c("4 cyl", "6 cyl", "8 cyl")
)

Scatterplot using base graphics

library(ggplot2) # for diamonds dataset
data("diamonds")

# using base graphics plot() function
plot(formula = price ~ carat,  # price vs. carat
        data = diamonds,
        col  = "slategray",
        pch  = 20,
        main = "Diamond Price vs. Size"
)

Scatterplot using ggplot

library(ggplot2)
data("diamonds")
ggplot(diamonds,
    aes(x=carat, y=price, colour=clarity)) +
    geom_point()

Scatterplot by category (using ggplot)

library(ggplot2)
data("diamonds")
ggplot(diamonds) +
    geom_point(aes(x=carat, y=price, colour=cut)) +
    scale_y_log10()+
    facet_wrap(~cut)+
    ggtitle("Diamonds: Carat vs. Price by Cut")

Scatterplot by category with regression lines (using ggplot2)

library(ggplot2)
ggplot(mtcars, aes(wt, mpg, color=cyl))+
    geom_point()+
    geom_smooth(method="lm")+
    labs(title="Regression of MPG on Weight by # Cylinders",
         x = "Weight",
         y = "Miles per Gallon",
         color = "Cylinders")
## `geom_smooth()` using formula 'y ~ x'

Bar Plot using Base Graphics

library(ggplot2) # for diamonds dataset
data("diamonds")
ideal_cut_colors = diamonds[diamonds$cut == "Ideal", "color"]
# using base graphics barplot() function
barplot(table(ideal_cut_colors),
        xlab = "color",
        ylab = "count",
        main = "Ideal cut diamonds by Color",
        col = "limegreen")

limegreen is one of many cool color choices supported by R. See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for a nice 8-page compendium!

Frequency histogram using base graphics

The MASS library provides datasets for Venables and Ripley’s MASS textbook

library(MASS)
# using the base graphics hist() function
hist(Cars93$RPM,
     xlab = "engine RPM",
     main = "Frequency historgram of engine RPM",
     col = "skyblue")

Density plot using base graphics

library(MASS)
plot(density(Cars93$RPM),
     xlab = "engine RPM",
     main = "Frequency historgram of engine RPM",
     col = "firebrick")

Density plot using ggplot2

The qplot() function is a “quick plot” wrapper for ggplot().

It uses an interface that is similar to the plot() function of the base graphics package.

library(ggplot2)

qplot(mpg,
      data = mtcars,
      geom = "density",
      fill = gear, alpha=I(0.5),
      main = "Distribution of Gas Mileage",
      xlab = "Miles per Gallon",
      ylab = "Density")

Box-and-whisker plot using base graphics

boxplot(formula = mpg ~ gear,
        data = mtcars,
        main = "Mileage by # of Gears",
        xlab = "Num. Gears",
        ylab = "Miles per Gallon",
        col = c("tomato", "forestgreen", "dodgerblue"))

Box-and-whisker plot using ggplot

library(ggplot2)
ggplot(data = mtcars,
       aes(gear, mpg, fill=gear))+
        geom_boxplot()+
        labs(title="Mileage by # of Gears",
             x = "",
             y = "Miles per Gallon")

2.6 Day 13: Basic visualization

Let’s get the current Erie county data, and create the new_cases column

## retrieve and clean the current data set
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)
us <- within(us, {
    date = as.Date(date, format = "%Y-%m-%d")
})

## get the Erie county subset
erie <- subset(us, (county == "Erie") & (state == "New York"))

## add the `new_cases` column
erie <- within(erie, {
    new_cases <- diff( c(0, cases) )
})

Simple visualization

We’ll use the plot() function to create a visualization of the progression of COVID cases in Erie county.
plot() can be used with a formula, similar to how we used aggregate().
The formula describes the independent (y-axis) variable as a function of the dependent (x-axis) variable
For our case, the formula will be cases ~ date, i.e., plot the number of cases on the y-axis, and date on the x-axis.
As with aggregate(), we need to provide, in the second argument, the data.frame where the variables to be plotted can be found.
Ok, here we go…
```
plot( cases ~ date, erie)
```
It might be maybe more informative to plot new cases (so that we can see more easily whether social distancing and other measures are having an effect on the spread of COVID cases. Using log-transformed new cases helps to convey the proportional increase
```
plot( new_cases ~ date, erie, log = "y", main = "New Cases, Erie County" )
## Warning in xy.coords(x, y, xlabel, ylabel, log): 3 y values <= 0 omitted from
## logarithmic plot
```
See ?plot.formula for some options available when using the formula interface to plot. Additional arguments are described on the help page ?help.default.

2.7 Day 14: Functions

Yesterday we created a plot for Erie county. The steps to create this plot can be separated into two parts

Get the full data

url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)
us <- within(us, {
    date = as.Date(date, format = "%Y-%m-%d")
})

Subset, update, and plot the data for county of interest

erie <- subset(us, (county == "Erie") & (state == "New York"))
erie <- within(erie, {
    new_cases <- diff( c(0, cases) )
})
plot( new_cases ~ date, erie, log = "y", main = "New Cases, Erie County" )
## Warning in xy.coords(x, y, xlabel, ylabel, log): 3 y values <= 0 omitted from
## logarithmic plot

What if we were interested in a different county? We could repeat (cut-and-paste) step 2, updating and generalizing a little

Define a new variable to indicate the county we are interested in plotting
```
county_of_interest <- "Westchester"
```
paste() concatenates its arguments together into a single character vector. We use this to construct the title of the plot
```
main_title <- paste("New Cases,", county_of_interest, "County")
```

Now create and update a subset of the data for the county that we are interested in

    county_data <-
   subset(us, (county == county_of_interest) & (state == "New York"))
    county_data <- within(county_data, {
   new_cases <- diff( c(0, cases) )
    })

… and finally plot the county data

plot( new_cases ~ date, county_data, log = "y", main = main_title)

Here is the generalization

county_of_interest <- "Westchester"

main_title <- paste("New Cases,", county_of_interest, "County")
county_data <-
    subset(us_data, (county == county_of_interest) & (state == "New York"))
county_data <- within(county_data, {
    new_cases <- diff( c(0, cases) )
})
plot( new_cases ~ date, county_data, log = "y", main = main_title)

It would be tedious and error-prone to copy and paste this code for each county we were interested in.

A better approach is to write a function that takes as inputs the us data.frame, and the name of the county that we want to plot. Functions are easy to write

Create a variable to contain the function, use the keyword function and then the arguments you want to pass in.
```
plot_county <-
    function(us_data, county_of_interest)
```

… then provide the ‘body’ of the function between curly braces

{
    main_title <- paste("New Cases,", county_of_interest, "County")
    county_data <-
        subset(us_data, (county == county_of_interest) & (state == "New York"))
    county_data <- within(county_data, {
        new_cases <- diff( c(0, cases) )
    })

    plot( new_cases ~ date, county_data, log = "y", main = main_title)
}

Normally, the last evaluated line of the code (the plot() statement in our example) is returned from the function and can be captured by a variable. In our specific case, plot() creates the plot as a side effect, and the return value is actually the special symbol NULL.

Here’s the full definition

plot_county <-
    function(us_data, county_of_interest)
{
    main_title <- paste("New Cases,", county_of_interest, "County")
    county_data <-
        subset(us_data, (county == county_of_interest) & (state == "New York"))
    county_data <- within(county_data, {
        new_cases <- diff( c(0, cases) )
    })

    plot( new_cases ~ date, county_data, log = "y", main = main_title)
}

Run the code defining the function in the R console, then use it to plot different counties:

plot_county(us, "Erie")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 3 y values <= 0 omitted from
## logarithmic plot

plot_county(us, "Westchester")

Hmm, come to think of it, we might want to write a simple function to get and clean the US data.

Get and clean the US data; we don’t need any arguments, and the return value (the last line of code evaluated) is the cleaned data

get_US_data <-
    function()
{
    url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
    us <- read.csv(url, stringsAsFactors = FALSE)
    within(us, {
        date = as.Date(date, format = "%Y-%m-%d")
    })
}

Verify that it is now just two lines to plot county-level data

us <- get_US_data()
plot_county(us, "Erie")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 3 y values <= 0 omitted from
## logarithmic plot

How could you generalize plot_county() to plot county-level data for a county in any state? Hint: add a state = argument, perhaps using default values
```
plot_county <-
    function(us_data, county = "Erie", state = "New York")
{
    ## your code here!
}
```