Week 2 The data frame
2.1 Day 8 (Monday) Zoom check-in
Logistics
Remember to use the QuaRantine Microsoft Team. Your Roswell credentials are required, and you must have been invited (by Adam.Kisailus at RoswellPark.org)
We’re thinking of having a ‘networking’ hour after Friday’s class (so 3pm and after) where we’ll break into smaller groups (if necessary) and provide an opportunity for people to turn on their video and audio so that we can increase the amount of intereaction. Likely the first networking hour will be a round of introductions / what you hope to get out of the course / etc., and maybe brief discussion of topics arising.
Review and troubleshoot (15 minutes)
Saving and loading objects
Scripts
The data frame (40 minutes)
Concept
Recall from Day 1:
Data frames are handy containers for experimental data.
Like a spreadsheet, a data frame has rows and columns
The columns of a data frame contain measurements describing each individual
- Each measurement could refer to a different type of object (numeric, string, etc.)
- Measurements could be physical observations on samples, e.g.,
height
,weight
,age
,minutes
an activity lasts, etc. - Measurements might also describe how the row is classified, e.g.,
activity
,is work?
,classification
,date
, etc.
The rows of a data frame represent a ‘tuple’ of measurements corresponding to an experimental observation, e.g.,
- Note: you must ensure units are consistent across tuples!
Rows and columns can be assigned names.
Create a simple data frame
heights <- c(72, 65, 68)
weights <- c(190, 130, 150)
ages <- c(44, 35, 37)
df <- data.frame(heights, weights, ages)
df
## heights weights ages
## 1 72 190 44
## 2 65 130 35
## 3 68 150 37
It’s possible to update the column names, and to provide row names…
named_df <- data.frame(heights, weights, ages)
colnames(named_df) <- c("hgt_inches", "wgt_lbs", "age_years")
rownames(named_df) <- c("John Doe", "Pat Jones", "Sara Grant")
named_df
## hgt_inches wgt_lbs age_years
## John Doe 72 190 44
## Pat Jones 65 130 35
## Sara Grant 68 150 37
…but it’s often better practice to name columns at time of creation, and to store all information as columns (rather than designating one column as a ‘special’ row name)
Here’s our first attempt
It’s unsatisfactory because by default R treats character vectors as
factor
. We’d like them to plain-old character vectors. To accomplish this, we add thestringsAsFactors = FALSE
argument
Adding and deleting rows
Adding rows
Add a row with
rbind()
more_people <- c("Bob Kane", "Kari Patra", "Sam Groe") more_heights <- c(61, 68, 70) more_weights <- c(101, 134, 175) more_ages <- c(13, 16, 24) more_df <- data.frame( person = more_people, hgt_inches = more_heights, wgt_lbs = more_weights, age_years = more_ages, stringsAsFactors = FALSE ) df_all <- rbind(df, more_df) df_all ## person hgt_inches wgt_lbs age_years ## 1 John Doe 72 190 44 ## 2 Pat Jones 65 130 35 ## 3 Sara Grant 68 150 37 ## 4 Bob Kane 61 101 13 ## 5 Kari Patra 68 134 16 ## 6 Sam Groe 70 175 24
R often has more than one way to perform an operation. We’ll see
add_rows()
later in the course.
Delete rows using a logical vector…
Create a logical or numeric index indicating the rows to be deleted
Subset the data frame with the logical vector indicating the rows we would like to keep
…or a numeric vector
Create a vector containing the rows to be deleted
Use a minus sign
-
to indicated that these rows should be dropped, rather than keptdf_all # referesh my memory about df contents .... ## person hgt_inches wgt_lbs age_years ## 1 John Doe 72 190 44 ## 2 Pat Jones 65 130 35 ## 3 Sara Grant 68 150 37 ## 4 Bob Kane 61 101 13 ## 5 Kari Patra 68 134 16 ## 6 Sam Groe 70 175 24 df_remaining <- df_all[-dropouts, ] df_remaining # items 2 and 3 are dropped! ## person hgt_inches wgt_lbs age_years ## 1 John Doe 72 190 44 ## 4 Bob Kane 61 101 13 ## 5 Kari Patra 68 134 16 ## 6 Sam Groe 70 175 24
Some useful data frame operations
Try these out on your simple data frames df
and named_df
:
str(df) # structure (NOT string!)(sorry Python programmers ;)
dim(df) # dimensions
View(df) # open tabular view of data frame
head(df) # first few rows
tail(df) # last few rows
names(df) # column names
colnames(df) # column names
rownames(df) # row names
Writing, reading, and spreadhseets
Saving a data.frame
We could save the
data.frame
as an R object, using the methods from quarantine day 7Often, better practice (e.g., to make it easy to share data with others in our lab) is to save data as a text file
A ‘csv’ file is one example
- A plain text file
- The first line contains column names
- Each line of the text file represents a row of the data frame
- Columns within a row are separated by a comma,
,
Example: save
df_all
to a temporary file locationnow, read the data back in from the temporary location
R and spreadsheets
A CSV file is a common way to move data from R to a spreadsheet, and vice versa. Following along with the example above, write
df_all
to a CSV file.Now open a spreadsheet application like Excel and navigate to the directory containing the file
… and open the file
To go from Excel to R, make sure your spreadsheet is a simple rectangular table of rows and columns, without ‘merged’ cells or fancy column formating, or… i.e., your spreadsheet should be as simple as the one we imported from R. Then save the file in CSV format
…and import it into R
An alternative way of working with data.frame()
with()
: column selection and computationwithin()
: update or add columnssubset()
: row and column subsetOur quarantine log, day 1
activity <- c("check e-mail", "breakfast", "conference call", "webinar", "walk") minutes <- c(20, 30, 60, 60, 60) is_work <- c(TRUE, FALSE, TRUE, TRUE, FALSE) levels <- c("connect", "exercise", "consult", "hobby", "essential") classification <- factor( c("connect", "essential", "connect", "consult", "exercise"), levels = levels ) dates <- rep("04-14-2020", length(activity)) date <- as.Date(dates, format = "%m-%d-%Y") log <- data.frame( activity, minutes, is_work, classification, date, stringsAsFactors = FALSE ) log ## activity minutes is_work classification date ## 1 check e-mail 20 TRUE connect 2020-04-14 ## 2 breakfast 30 FALSE essential 2020-04-14 ## 3 conference call 60 TRUE connect 2020-04-14 ## 4 webinar 60 TRUE consult 2020-04-14 ## 5 walk 60 FALSE exercise 2020-04-14
Summarization
- Use
with()
to simplify variable reference Create a new
data.frame()
containing the summary
Summarization by group
aggregate()
## minutes per day spent on each activity, from the quarantine_log aggregate(minutes ~ activity, log, sum) ## activity minutes ## 1 breakfast 30 ## 2 check e-mail 20 ## 3 conference call 60 ## 4 walk 60 ## 5 webinar 60 ## minutes per day spent on each clasasification aggregate(minutes ~ classification, log, sum) ## classification minutes ## 1 connect 80 ## 2 exercise 60 ## 3 consult 60 ## 4 essential 30 ## non-work activities per day aggregate(!is_work ~ date, log, sum) ## date !is_work ## 1 2020-04-14 2
This week’s activities (5 minutes)
Goal: retrieve and summarize COVID 19 cases in Erie county and nationally
2.2 Day 9: Creation and manipulation
Creation
Last week we created vectors summarizing our quarantine activities
activity <- c("check e-mail", "breakfast", "conference call", "webinar", "walk")
minutes <- c(20, 30, 60, 60, 60)
is_work <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
levels <- c("connect", "exercise", "consult", "hobby", "essential")
classification <- factor(
c("connect", "essential", "connect", "consult", "exercise"),
levels = levels
)
dates <- rep("04-14-2020", length(activity))
date <- as.Date(dates, format = "%m-%d-%Y")
Each of these vectors is the same length, and are related to one another in a specific way – the first element of activity
, ‘check e-mail’, is related to the first element of minutes
, ‘20’, and to is_work
, etc.
Use data.frame()
to construct an object containing each of these vectors
Each argument to
data.frame()
is a vector representing a columnThe
stringsAsFactors = FALSE
argument says that character vectors should NOT be automatically coerced to factorsactivities <- data.frame( activity, minutes, is_work, classification, date, stringsAsFactors = FALSE ) activities ## activity minutes is_work classification date ## 1 check e-mail 20 TRUE connect 2020-04-14 ## 2 breakfast 30 FALSE essential 2020-04-14 ## 3 conference call 60 TRUE connect 2020-04-14 ## 4 webinar 60 TRUE consult 2020-04-14 ## 5 walk 60 FALSE exercise 2020-04-14
We can query the object we’ve created for its
class()
,dim()
ensions, take a look at thehead()
ortail()
of the object, etc.names()
returns the column names.class(activities) ## [1] "data.frame" dim(activities) # number of rows and columns ## [1] 5 5 head(activities, 3) # first three rows ## activity minutes is_work classification date ## 1 check e-mail 20 TRUE connect 2020-04-14 ## 2 breakfast 30 FALSE essential 2020-04-14 ## 3 conference call 60 TRUE connect 2020-04-14 names(activities) ## [1] "activity" "minutes" "is_work" "classification" ## [5] "date"
Column selection
Use [
to select rows and columns
activities
is a two-dimensional objectSubset the data to contain the first and third rows and the first and fourth columns
Subset columns by name
Subset only by row or only by column by omiting the subscript index for that dimension
activities[c(1, 3), ] # all columns for rows 1 and 3 ## activity minutes is_work classification date ## 1 check e-mail 20 TRUE connect 2020-04-14 ## 3 conference call 60 TRUE connect 2020-04-14 activities[, c("activity", "minutes")] # all rows for columns 1 and 2 ## activity minutes ## 1 check e-mail 20 ## 2 breakfast 30 ## 3 conference call 60 ## 4 webinar 60 ## 5 walk 60
Be careful when selecting a single column!
By default, R returns a vector
Use
drop = FALSE
to return adata.frame
Use $
or [[
to select a column
Selection of individual columns as vectors is easy
An alternative, often used in scripts, is to use
[[
, which requires the name of a variable provided as a character vector
Column selection and subsetting are often combined, e.g., to create a data.frame
of work-related activities, or work-related activities lasting 60 minutes or longer
work_related_activities <- activities[ activities$is_work == TRUE, ]
work_related_activities
## activity minutes is_work classification date
## 1 check e-mail 20 TRUE connect 2020-04-14
## 3 conference call 60 TRUE connect 2020-04-14
## 4 webinar 60 TRUE consult 2020-04-14
row_idx <- activities$is_work & (activities$minutes >= 60)
activities[row_idx,]
## activity minutes is_work classification date
## 3 conference call 60 TRUE connect 2020-04-14
## 4 webinar 60 TRUE consult 2020-04-14
Adding or updating columns
Use $
or [
or [[
to add a new column,
activities$is_long_work <- activities$is_work & (activities$minutes >= 60)
activities
## activity minutes is_work classification date is_long_work
## 1 check e-mail 20 TRUE connect 2020-04-14 FALSE
## 2 breakfast 30 FALSE essential 2020-04-14 FALSE
## 3 conference call 60 TRUE connect 2020-04-14 TRUE
## 4 webinar 60 TRUE consult 2020-04-14 TRUE
## 5 walk 60 FALSE exercise 2020-04-14 FALSE
## ...another way of doing the same thing
activities[["is_long_work"]] <- activities$is_work & (activities$minutes >= 60)
## ...and another way
activities[,"is_long_work"] <- activities$is_work & (activities$minutes >= 60)
Columns can be updated in the same way
activities$activity <- toupper(activities$activity)
activities
## activity minutes is_work classification date is_long_work
## 1 CHECK E-MAIL 20 TRUE connect 2020-04-14 FALSE
## 2 BREAKFAST 30 FALSE essential 2020-04-14 FALSE
## 3 CONFERENCE CALL 60 TRUE connect 2020-04-14 TRUE
## 4 WEBINAR 60 TRUE consult 2020-04-14 TRUE
## 5 WALK 60 FALSE exercise 2020-04-14 FALSE
Reading and writing
Create a file path to store a ‘csv’ file. From day 7, the path could be temporary, chosen interactively, a relative path, or an absolute path
## could be any of these...
##
## interactive_file_path <- file.choose(new = TRUE)
## getwd()
## relative_file_path <- "my_activity.rds"
## absolute_file_path_on_macOS <- "/Users/ma38727/my_activity.rda"
##
## ... but we'll use
temporary_file_path <- tempfile(fileext = ".csv")
Use write.csv()
to save the data.frame to disk as a plain text file in ‘csv’ (comma-separated value) format. The row.names = FALSE
argument means that the row indexes are not saved to the file (row names are created when data is read in using read.csv()
).
If you wish, use RStudio File -> Open File to navigate to the location where you saved the file, and open it. You could also open the file in Excel or other spreadsheet. Conversely, you can take an Excel sheet and export it as a csv file for reading into R.
Use read.csv()
to import a plain text file formatted as csv
imported_activities <- read.csv(temporary_file_path, stringsAsFactors = FALSE)
imported_activities
## activity minutes is_work classification date is_long_work
## 1 CHECK E-MAIL 20 TRUE connect 2020-04-14 FALSE
## 2 BREAKFAST 30 FALSE essential 2020-04-14 FALSE
## 3 CONFERENCE CALL 60 TRUE connect 2020-04-14 TRUE
## 4 WEBINAR 60 TRUE consult 2020-04-14 TRUE
## 5 WALK 60 FALSE exercise 2020-04-14 FALSE
Note that some information has not survived the round-trip – the classification
and date
columns are plain character vectors.
class(imported_activities$classification)
## [1] "character"
class(imported_activities$date)
## [1] "character"
Update these to be a factor()
with specific levels, and a Date
.
`
levels <- c("connect", "exercise", "consult", "hobby", "essential")
imported_activities$classification <- factor(
imported_activities$classification,
levels = levels
)
imported_activities$date <- as.Date(imported_activities$date, format = "%Y-%m-%d")
imported_activities
## activity minutes is_work classification date is_long_work
## 1 CHECK E-MAIL 20 TRUE connect 2020-04-14 FALSE
## 2 BREAKFAST 30 FALSE essential 2020-04-14 FALSE
## 3 CONFERENCE CALL 60 TRUE connect 2020-04-14 TRUE
## 4 WEBINAR 60 TRUE consult 2020-04-14 TRUE
## 5 WALK 60 FALSE exercise 2020-04-14 FALSE
Reading from a remote file (!)
Visit the New York Times csv file daily tally of COVID-19 cases in all US counties.
Read the data into an R
data.frame
Explore the data
class(us) ## [1] "data.frame" dim(us) ## [1] 164886 6 head(us) ## date county state fips cases deaths ## 1 2020-01-21 Snohomish Washington 53061 1 0 ## 2 2020-01-22 Snohomish Washington 53061 1 0 ## 3 2020-01-23 Snohomish Washington 53061 1 0 ## 4 2020-01-24 Cook Illinois 17031 1 0 ## 5 2020-01-24 Snohomish Washington 53061 1 0 ## 6 2020-01-25 Orange California 6059 1 0
Subset the data to only New York state or Erie county
ny_state <- us[us$state == "New York",] dim(ny_state) ## [1] 3813 6 erie <- us[(us$state == "New York") & (us$county == "Erie"), ] erie ## date county state fips cases deaths ## 2569 2020-03-15 Erie New York 36029 3 0 ## 3028 2020-03-16 Erie New York 36029 6 0 ## 3544 2020-03-17 Erie New York 36029 7 0 ## 4141 2020-03-18 Erie New York 36029 7 0 ## 4870 2020-03-19 Erie New York 36029 28 0 ## 5717 2020-03-20 Erie New York 36029 31 0 ## 6711 2020-03-21 Erie New York 36029 38 0 ## 7805 2020-03-22 Erie New York 36029 54 0 ## 9003 2020-03-23 Erie New York 36029 87 0 ## 10310 2020-03-24 Erie New York 36029 107 0 ## 11741 2020-03-25 Erie New York 36029 122 0 ## 13345 2020-03-26 Erie New York 36029 134 2 ## 15081 2020-03-27 Erie New York 36029 219 6 ## 16912 2020-03-28 Erie New York 36029 354 6 ## 18839 2020-03-29 Erie New York 36029 380 6 ## 20878 2020-03-30 Erie New York 36029 443 8 ## 23003 2020-03-31 Erie New York 36029 438 8 ## 25193 2020-04-01 Erie New York 36029 553 12 ## 27445 2020-04-02 Erie New York 36029 734 19 ## 29765 2020-04-03 Erie New York 36029 802 22 ## 32150 2020-04-04 Erie New York 36029 945 26 ## 34579 2020-04-05 Erie New York 36029 1059 27 ## 37047 2020-04-06 Erie New York 36029 1163 30 ## 39558 2020-04-07 Erie New York 36029 1163 36 ## 42109 2020-04-08 Erie New York 36029 1205 38 ## 44685 2020-04-09 Erie New York 36029 1362 46 ## 47294 2020-04-10 Erie New York 36029 1409 58 ## 49946 2020-04-11 Erie New York 36029 1472 62 ## 52616 2020-04-12 Erie New York 36029 1571 75 ## 55297 2020-04-13 Erie New York 36029 1624 86 ## 57993 2020-04-14 Erie New York 36029 1668 99 ## 60705 2020-04-15 Erie New York 36029 1751 110 ## 63430 2020-04-16 Erie New York 36029 1850 115 ## 66171 2020-04-17 Erie New York 36029 1929 115 ## 68926 2020-04-18 Erie New York 36029 1997 115 ## 71690 2020-04-19 Erie New York 36029 2070 146 ## 74463 2020-04-20 Erie New York 36029 2109 153 ## 77241 2020-04-21 Erie New York 36029 2147 161 ## 80028 2020-04-22 Erie New York 36029 2233 174 ## 82826 2020-04-23 Erie New York 36029 2450 179 ## 85627 2020-04-24 Erie New York 36029 2603 184 ## 88436 2020-04-25 Erie New York 36029 2773 199 ## 91248 2020-04-26 Erie New York 36029 2954 205 ## 94071 2020-04-27 Erie New York 36029 3021 208 ## 96904 2020-04-28 Erie New York 36029 3089 216 ## 99747 2020-04-29 Erie New York 36029 3196 220 ## 102596 2020-04-30 Erie New York 36029 3319 227 ## 105454 2020-05-01 Erie New York 36029 3481 233 ## 108317 2020-05-02 Erie New York 36029 3598 243 ## 111186 2020-05-03 Erie New York 36029 3710 250 ## 114062 2020-05-04 Erie New York 36029 3802 254 ## 116938 2020-05-05 Erie New York 36029 3891 264 ## 119821 2020-05-06 Erie New York 36029 4008 338 ## 122719 2020-05-07 Erie New York 36029 4136 350 ## 125624 2020-05-08 Erie New York 36029 4255 356 ## 128534 2020-05-09 Erie New York 36029 4337 368 ## 131444 2020-05-10 Erie New York 36029 4453 376 ## 134355 2020-05-11 Erie New York 36029 4483 387 ## 137266 2020-05-12 Erie New York 36029 4530 395 ## 140183 2020-05-13 Erie New York 36029 4606 402 ## 143100 2020-05-14 Erie New York 36029 4671 411 ## 146023 2020-05-15 Erie New York 36029 4782 417 ## 148952 2020-05-16 Erie New York 36029 4867 428 ## 151883 2020-05-17 Erie New York 36029 4954 438 ## 154819 2020-05-18 Erie New York 36029 4993 444 ## 157757 2020-05-19 Erie New York 36029 5037 450 ## 160706 2020-05-20 Erie New York 36029 5131 455 ## 163659 2020-05-21 Erie New York 36029 5270 463
2.3 Day 10: subset()
, with()
, and within()
subset()
subset()
ing a data.frame
Read the New York Times csv file summarizing COVID cases in the US.
Create subsets, e.g., to include only New York state, or only Erie county
ny_state <- subset(us, state == "New York") dim(ny_state) ## [1] 3813 6 tail(ny_state) ## date county state fips cases deaths ## 163698 2020-05-21 Warren New York 36113 241 28 ## 163699 2020-05-21 Washington New York 36115 222 11 ## 163700 2020-05-21 Wayne New York 36117 103 1 ## 163701 2020-05-21 Westchester New York 36119 32672 1438 ## 163702 2020-05-21 Wyoming New York 36121 79 3 ## 163703 2020-05-21 Yates New York 36123 34 6 erie <- subset(us, (state == "New York") & county == "Erie") dim(erie) ## [1] 68 6 tail(erie) ## date county state fips cases deaths ## 148952 2020-05-16 Erie New York 36029 4867 428 ## 151883 2020-05-17 Erie New York 36029 4954 438 ## 154819 2020-05-18 Erie New York 36029 4993 444 ## 157757 2020-05-19 Erie New York 36029 5037 450 ## 160706 2020-05-20 Erie New York 36029 5131 455 ## 163659 2020-05-21 Erie New York 36029 5270 463
with()
Use with()
to simplify column references
Goal: calculate maximum number of cases in the Erie county data subset
First argument: a
data.frame
containing data to be manipulated –erie
Second argument: an expression to be evaluated, usually referencing columns in the data set –
max(cases)
E.g., Calculate the maximum number of cases in the
erie
subset
Second argument can be more complicated, using {}
to enclose several lines.
E.g., Calculate the number of new cases, and then reports the average number of new cases per day. We will use
diff()
diff()
calculates the difference between successive values of a vectorThe length of
diff(x)
is one less than the length ofx
new_cases
is thediff()
of successive values ofcases
, with an implicit initial value equal to 0.
within()
Adding and updating columns within()
a data.frame
First argument: a
data.frame
containing data to be updated –erie
Second argument: an expression of one or more variable assignments, the assignments create new columns in the
data.frame
.Example: add a
new_cases
columnerie_new_cases <- within(erie, { new_cases <- diff(c(0, cases)) }) head(erie_new_cases) ## date county state fips cases deaths new_cases ## 2569 2020-03-15 Erie New York 36029 3 0 3 ## 3028 2020-03-16 Erie New York 36029 6 0 3 ## 3544 2020-03-17 Erie New York 36029 7 0 1 ## 4141 2020-03-18 Erie New York 36029 7 0 0 ## 4870 2020-03-19 Erie New York 36029 28 0 21 ## 5717 2020-03-20 Erie New York 36029 31 0 3
2.4 Day 11: aggregate()
and an initial work flow
aggregate()
for summarizing columns by group
Goal: summarize maximum number of cases by county in New York state
Setup
Read and subset the New York Times data to contain only New York state data
aggregate()
First argument: a formula –
cases ~ county
Right-hand side: the variable to be used to subset (group) the data –
county
Left-hand side: the variable to be used in the aggregation function –
cases
Second argument: source of data –
ny_state
Third argument: the function to be applied to each subset of data –
max
Maximum number of cases by county:
Exploring the data summary
Subset to some interesting ‘counties’
head(max_cases_by_county) ## county cases ## 1 Albany 1700 ## 2 Allegany 44 ## 3 Broome 451 ## 4 Cattaraugus 71 ## 5 Cayuga 72 ## 6 Chautauqua 58 subset( max_cases_by_county, county %in% c("New York City", "Westchester", "Erie") ) ## county cases ## 14 Erie 5270 ## 29 New York City 200507 ## 57 Westchester 32672
Help: ?aggregate.formula
An initial work flow
Data input
From a remote location
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv" us <- read.csv(url, stringsAsFactors = FALSE)
class(us) ## [1] "data.frame" dim(us) ## [1] 164886 6 head(us) ## date county state fips cases deaths ## 1 2020-01-21 Snohomish Washington 53061 1 0 ## 2 2020-01-22 Snohomish Washington 53061 1 0 ## 3 2020-01-23 Snohomish Washington 53061 1 0 ## 4 2020-01-24 Cook Illinois 17031 1 0 ## 5 2020-01-24 Snohomish Washington 53061 1 0 ## 6 2020-01-25 Orange California 6059 1 0
Cleaning
date
is a plain-oldcharacter
vector, but should be aDate
.Update, method 1
us$date <- as.Date(us$date, format = "%Y-%m-%d") head(us) ## date county state fips cases deaths ## 1 2020-01-21 Snohomish Washington 53061 1 0 ## 2 2020-01-22 Snohomish Washington 53061 1 0 ## 3 2020-01-23 Snohomish Washington 53061 1 0 ## 4 2020-01-24 Cook Illinois 17031 1 0 ## 5 2020-01-24 Snohomish Washington 53061 1 0 ## 6 2020-01-25 Orange California 6059 1 0
Update, method 2
us <- within(us, { date = as.Date(date, format = "%Y-%m-%d") }) head(us) ## date county state fips cases deaths ## 1 2020-01-21 Snohomish Washington 53061 1 0 ## 2 2020-01-22 Snohomish Washington 53061 1 0 ## 3 2020-01-23 Snohomish Washington 53061 1 0 ## 4 2020-01-24 Cook Illinois 17031 1 0 ## 5 2020-01-24 Snohomish Washington 53061 1 0 ## 6 2020-01-25 Orange California 6059 1 0
Subset to only Erie county, New York state
Subset, method 1
Subset, method 2
Manipulation
Goal: calculate
new_cases
as the difference between succesive days, usingdiff()
Remember use of
diff()
Update, methods 1 & 2 (prepend a
0
when usingdiff()
, to get the intial number of new cases)
Summary: calculate maximum (total) number of cases per county in New York state
For Erie county, let’s see how to calculate the maximum (total) number of cases
Subset US data to New York state
Summarize each county in the state using
aggregate()
.First argument: summarize
cases
grouped bycounty
–cases ~ county
Second argument: data source –
ny_state
Third argument: function to apply to each subset –
max
subset()
to select counties
Summary: calculate maximum (total) number of cases per state
Use entire data set,
us
aggregate()
cases by county and state –cases ~ county + state
max_cases_by_county_state <- aggregate( cases ~ county + state, us, max ) dim(max_cases_by_county_state) ## [1] 2975 3 head(max_cases_by_county_state) ## county state cases ## 1 Autauga Alabama 147 ## 2 Baldwin Alabama 270 ## 3 Barbour Alabama 100 ## 4 Bibb Alabama 52 ## 5 Blount Alabama 48 ## 6 Bullock Alabama 71
aggregate()
a second time, usingmax_cases_by_county_state
and aggregtaing by stateExplore the data
head(max_cases_by_state) ## state cases ## 1 Alabama 1874 ## 2 Alaska 207 ## 3 Arizona 7835 ## 4 Arkansas 966 ## 5 California 42037 ## 6 Colorado 4948 subset( max_cases_by_state, state %in% c("California", "Illinois", "New York", "Washington") ) ## state cases ## 5 California 42037 ## 15 Illinois 67551 ## 34 New York 200507 ## 52 Washington 7647
2.5 Day 12 (Friday) Zoom check-in
Review and troubleshoot (20 minutes)
## retrieve and clean the current data set
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)
us <- within(us, {
date = as.Date(date, format = "%Y-%m-%d")
})
## subset
erie <- subset(us, (county == "Erie") & (state == "New York"))
## manipulate
erie <- within(erie, {
new_cases <- diff( c(0, cases) )
})
## record of cases to date
erie
## date county state fips cases deaths new_cases
## 2569 2020-03-15 Erie New York 36029 3 0 3
## 3028 2020-03-16 Erie New York 36029 6 0 3
## 3544 2020-03-17 Erie New York 36029 7 0 1
## 4141 2020-03-18 Erie New York 36029 7 0 0
## 4870 2020-03-19 Erie New York 36029 28 0 21
## 5717 2020-03-20 Erie New York 36029 31 0 3
## 6711 2020-03-21 Erie New York 36029 38 0 7
## 7805 2020-03-22 Erie New York 36029 54 0 16
## 9003 2020-03-23 Erie New York 36029 87 0 33
## 10310 2020-03-24 Erie New York 36029 107 0 20
## 11741 2020-03-25 Erie New York 36029 122 0 15
## 13345 2020-03-26 Erie New York 36029 134 2 12
## 15081 2020-03-27 Erie New York 36029 219 6 85
## 16912 2020-03-28 Erie New York 36029 354 6 135
## 18839 2020-03-29 Erie New York 36029 380 6 26
## 20878 2020-03-30 Erie New York 36029 443 8 63
## 23003 2020-03-31 Erie New York 36029 438 8 -5
## 25193 2020-04-01 Erie New York 36029 553 12 115
## 27445 2020-04-02 Erie New York 36029 734 19 181
## 29765 2020-04-03 Erie New York 36029 802 22 68
## 32150 2020-04-04 Erie New York 36029 945 26 143
## 34579 2020-04-05 Erie New York 36029 1059 27 114
## 37047 2020-04-06 Erie New York 36029 1163 30 104
## 39558 2020-04-07 Erie New York 36029 1163 36 0
## 42109 2020-04-08 Erie New York 36029 1205 38 42
## 44685 2020-04-09 Erie New York 36029 1362 46 157
## 47294 2020-04-10 Erie New York 36029 1409 58 47
## 49946 2020-04-11 Erie New York 36029 1472 62 63
## 52616 2020-04-12 Erie New York 36029 1571 75 99
## 55297 2020-04-13 Erie New York 36029 1624 86 53
## 57993 2020-04-14 Erie New York 36029 1668 99 44
## 60705 2020-04-15 Erie New York 36029 1751 110 83
## 63430 2020-04-16 Erie New York 36029 1850 115 99
## 66171 2020-04-17 Erie New York 36029 1929 115 79
## 68926 2020-04-18 Erie New York 36029 1997 115 68
## 71690 2020-04-19 Erie New York 36029 2070 146 73
## 74463 2020-04-20 Erie New York 36029 2109 153 39
## 77241 2020-04-21 Erie New York 36029 2147 161 38
## 80028 2020-04-22 Erie New York 36029 2233 174 86
## 82826 2020-04-23 Erie New York 36029 2450 179 217
## 85627 2020-04-24 Erie New York 36029 2603 184 153
## 88436 2020-04-25 Erie New York 36029 2773 199 170
## 91248 2020-04-26 Erie New York 36029 2954 205 181
## 94071 2020-04-27 Erie New York 36029 3021 208 67
## 96904 2020-04-28 Erie New York 36029 3089 216 68
## 99747 2020-04-29 Erie New York 36029 3196 220 107
## 102596 2020-04-30 Erie New York 36029 3319 227 123
## 105454 2020-05-01 Erie New York 36029 3481 233 162
## 108317 2020-05-02 Erie New York 36029 3598 243 117
## 111186 2020-05-03 Erie New York 36029 3710 250 112
## 114062 2020-05-04 Erie New York 36029 3802 254 92
## 116938 2020-05-05 Erie New York 36029 3891 264 89
## 119821 2020-05-06 Erie New York 36029 4008 338 117
## 122719 2020-05-07 Erie New York 36029 4136 350 128
## 125624 2020-05-08 Erie New York 36029 4255 356 119
## 128534 2020-05-09 Erie New York 36029 4337 368 82
## 131444 2020-05-10 Erie New York 36029 4453 376 116
## 134355 2020-05-11 Erie New York 36029 4483 387 30
## 137266 2020-05-12 Erie New York 36029 4530 395 47
## 140183 2020-05-13 Erie New York 36029 4606 402 76
## 143100 2020-05-14 Erie New York 36029 4671 411 65
## 146023 2020-05-15 Erie New York 36029 4782 417 111
## 148952 2020-05-16 Erie New York 36029 4867 428 85
## 151883 2020-05-17 Erie New York 36029 4954 438 87
## 154819 2020-05-18 Erie New York 36029 4993 444 39
## 157757 2020-05-19 Erie New York 36029 5037 450 44
## 160706 2020-05-20 Erie New York 36029 5131 455 94
## 163659 2020-05-21 Erie New York 36029 5270 463 139
## aggregate() cases in each county to find total (max) number
ny_state <- subset(us, state == "New York")
head( aggregate(cases ~ county, ny_state, max) )
## county cases
## 1 Albany 1700
## 2 Allegany 44
## 3 Broome 451
## 4 Cattaraugus 71
## 5 Cayuga 72
## 6 Chautauqua 58
User-defined functions
Basic structure:
my_function_name <- function(arg1, arg2, ...)
{
statements
return(object)
}
A conrete example:
# declare a function to convert temperatures
toFahrenheit <- function(celsius)
{
f <- (9/5) * celsius + 32
return(f)
}
# invoke the function
temps <- c(20:25)
toFahrenheit(temps)
## [1] 68.0 69.8 71.6 73.4 75.2 77.0
Functions can be loaded from a separate file using the source
command. Enter the temperature conversion function into an R script and save as myFunctions.R
.
Statistical functions in R
R has many built-in statistical functions. Some of the more commonly used are listed below:
mean() # average
median() # median (middle value of sorted data)
range() # max - min
var() # variance
sd() # standard deviation
summary() # prints a combination of useful measures
Plotting data
Review of Plot Types
- Pie chart
- Display proportions of different values for some variable
- Bar plot
- Display counts of values for categorical variables
- Histogram, density plot
- Display counts of values for a binned, numeric variable
- Scatter plot
- Display y vs. x
- Box plot
- Display distributions over different values of a variable
Plotting packages
3 Main Plotting Packages
- Base graphics, lattice, and ggplot2
ggplot2
- The “Cadillac” of plotting packages.
- Part of the “tidyverse”
- Beautiful plots
- To install:
install.packages('ggplot2')
Words of wisdom on using plotting packages
- A good approach is to learn by doing but don’t start from scratch
- Find an example that is similar in appearance to what you are trying to achieve - many R galleries are available on the net.
- When you find something you like, grab the code and modify it to use your own data.
- Fine tune things like labels and fonts at the end, after you are sure you like the way the data is being displayed.
Some example plots
Many of these are based on the Diamonds dataset. Others are based on the mtcars
dataset and this requires a bit of cleaning in preparation for the corresponding plots:
data("mtcars")
# convert gear to a factor-level object
mtcars$gear = factor(
mtcars$gear,
levels = c(3, 4, 5),
labels = c("3 gears", "4 gears", "5 gears")
)
# convert cyl to a factor-level object
mtcars$cyl = factor(
mtcars$cyl,
levels = c(4, 6, 8),
labels = c("4 cyl", "6 cyl", "8 cyl")
)
Scatterplot using base graphics
Scatterplot using ggplot
Scatterplot by category (using ggplot)
Scatterplot by category with regression lines (using ggplot2)
Bar Plot using Base Graphics
library(ggplot2) # for diamonds dataset data("diamonds") ideal_cut_colors = diamonds[diamonds$cut == "Ideal", "color"] # using base graphics barplot() function barplot(table(ideal_cut_colors), xlab = "color", ylab = "count", main = "Ideal cut diamonds by Color", col = "limegreen")
limegreen
is one of many cool color choices supported by R. See http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf for a nice 8-page compendium!
- Frequency histogram using base graphics
The
MASS
library provides datasets for Venables and Ripley’s MASS textbook
Density plot using base graphics
Density plot using ggplot2
The
qplot()
function is a “quick plot” wrapper forggplot()
.It uses an interface that is similar to the
plot()
function of the base graphics package.
Box-and-whisker plot using base graphics
Box-and-whisker plot using ggplot
2.6 Day 13: Basic visualization
Let’s get the current Erie county data, and create the new_cases
column
## retrieve and clean the current data set
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
us <- read.csv(url, stringsAsFactors = FALSE)
us <- within(us, {
date = as.Date(date, format = "%Y-%m-%d")
})
## get the Erie county subset
erie <- subset(us, (county == "Erie") & (state == "New York"))
## add the `new_cases` column
erie <- within(erie, {
new_cases <- diff( c(0, cases) )
})
Simple visualization
We’ll use the
plot()
function to create a visualization of the progression of COVID cases in Erie county.plot()
can be used with aformula
, similar to how we usedaggregate()
.The
formula
describes the independent (y-axis) variable as a function of the dependent (x-axis) variableFor our case, the formula will be
cases ~ date
, i.e., plot the number of cases on the y-axis, and date on the x-axis.As with
aggregate()
, we need to provide, in the second argument, thedata.frame
where the variables to be plotted can be found.Ok, here we go…
It might be maybe more informative to plot new cases (so that we can see more easily whether social distancing and other measures are having an effect on the spread of COVID cases. Using log-transformed new cases helps to convey the proportional increase
See
?plot.formula
for some options available when using the formula interface to plot. Additional arguments are described on the help page?help.default
.
2.7 Day 14: Functions
Yesterday we created a plot for Erie county. The steps to create this plot can be separated into two parts
Get the full data
Subset, update, and plot the data for county of interest
What if we were interested in a different county? We could repeat (cut-and-paste) step 2, updating and generalizing a little
Define a new variable to indicate the county we are interested in plotting
paste()
concatenates its arguments together into a single character vector. We use this to construct the title of the plotNow create and update a subset of the data for the county that we are interested in
… and finally plot the county data
Here is the generalization
county_of_interest <- "Westchester" main_title <- paste("New Cases,", county_of_interest, "County") county_data <- subset(us_data, (county == county_of_interest) & (state == "New York")) county_data <- within(county_data, { new_cases <- diff( c(0, cases) ) }) plot( new_cases ~ date, county_data, log = "y", main = main_title)
It would be tedious and error-prone to copy and paste this code for each county we were interested in.
A better approach is to write a function
that takes as inputs the us
data.frame, and the name of the county that we want to plot. Functions are easy to write
Create a variable to contain the function, use the keyword
function
and then the arguments you want to pass in.… then provide the ‘body’ of the function between curly braces
Normally, the last evaluated line of the code (the
plot()
statement in our example) is returned from the function and can be captured by a variable. In our specific case,plot()
creates the plot as a side effect, and the return value is actually the special symbolNULL
.Here’s the full definition
plot_county <- function(us_data, county_of_interest) { main_title <- paste("New Cases,", county_of_interest, "County") county_data <- subset(us_data, (county == county_of_interest) & (state == "New York")) county_data <- within(county_data, { new_cases <- diff( c(0, cases) ) }) plot( new_cases ~ date, county_data, log = "y", main = main_title) }
Run the code defining the function in the R console, then use it to plot different counties:
Hmm, come to think of it, we might want to write a simple function to get and clean the US data.
Get and clean the US data; we don’t need any arguments, and the return value (the last line of code evaluated) is the cleaned data
Verify that it is now just two lines to plot county-level data
How could you generalize
plot_county()
to plot county-level data for a county in any state? Hint: add astate =
argument, perhaps using default values