Week 1 Basics

1.1 Day 1 (Monday) Zoom orientation

1.1.1 Logistics (10 minutes)

Course material

Cadence

  • Monday and Friday group zoom sessions – these will review and troubleshoot previous material, and outline goals for the next set of independent activities.
  • Daily independent activities – most of your learning will happen here!

Communicating

  • We’ll use Microsoft Teams (if most participants have access to the course)
  • Visit Microsoft Teams and sign in with your Roswell username (e.g., MA38727@RoswellPark.org) and the password you use to check email, etc. Join the ‘QuaRantine’ team.

1.1.2 Installing R and RStudio (25 minutes, Shawn)

What is R?

  • A programming language for statistical computing, data analysis and scientific graphics.

  • Open-source with a large (and growing) user community.

  • Currently in the top 10 most popular languages according to the tiobe index.

What is RStudio?

  • RStudio provides an integrated editor and shell environment to make R programming easier. Some of the more useful features include:
    • Syntax highlighting and color coding
    • Easy switching between shell and editor
    • Dynamic help and docs

Installing R and RStudio

  • Two ways to “get” RStudio:

The preferred approach for this course is to try to install R and RStudio on your own computer

An ALTERNATIVE, if installing on your own computer does not work:

  • Do the following only if you are NOT ABLE TO INSTALL R and RStudio.
  • Visit rstudio.cloud. Click the ‘Get Started’ button, and create an account (I used my gmail account…). You should end up at a screen like the following.

  • Click on the ‘New Project’ button, to end up with a screen like the one below. Note the ‘Untitled Project’ at the top of the screen; click on it to name your project, e.g., ‘QuaRantine’.

Breakout Room

At this point you should have RStudio running either via your desktop installation or through rstudio.cloud. If not, please let us know via the chat window and we’ll invite you to a breakout room to troubleshoot your installation.

1.1.3 Basics of R (25 minutes)

R as a simple calculator

R Console Output

Enter this in the console:

Q: what’s the [1] all about in the output?

A: It’s the index of the first entry in each line.

This is maybe a better example:

Displaying help in the R Console

? <command-name>

  • Some examples:

    ? cat
    ? print

Variables

Naming variables in R

  • A variable name can contain letters, numbers, and the dot . or underline _ characters. Variables should start with a letter.

  • Try entering these in the console:

    y = 2

    try.this = 33.3

    oneMoreTime = "woohoo"

  • Now try these:

    2y = 2

    _z = 33.3

    function = "oops, my bad"

R is case sensitive (R != r)

Variable Assignment

R’s four basic ‘atomic’ data types

  • Numeric (includes integer, double, etc.)
    • 3.14, 1, 2600
  • Character (string)
    • "hey, I'm a string"
    • 'single quotes are ok too'
  • Logical
    • TRUE or FALSE (note all caps)
  • NA
    • not assigned (no known value)

Use class() to query the class of data:

Use as. to coerce a variable to a specific data type

Using Logical Operators

Equivalence test (==):

Not equal test (!=):

less-than (<) and greater-than (>):

Logical Or (|):

Logical And (&):

Objects and Vectors in R

Objects

  • R stores everything, variables included, in ‘objects’.

Vectors

  • ‘Vectors’ and ‘data frames’ are the bread and butter of R
  • Vectors consist of several elements of the same class
    • e.g. a vector of heart rates, one per patient

Data frames (data.frame)

  • Data frames are structures that can contain columns of various types
    • e.g. height, weight, age, heart rate, etc.
    • Handy containers for experimental data
    • Analogous to spreadsheet data
    • More on Data Frames throughout the week!

Working with Vectors

Creating a Vector

  • Use the c() function

Accessing vector data

  • Use numerical indexing
  • R uses 1-based indexing
    • 1st vector element has index of 1
    • 2nd has an index of 2
    • 3rd has an index of 3
    • and so on
  • R supports “slicing” (i.e. extracting multiple items)
  • Negative indices are omitted

Some Useful Vector Operations

  • length(): number of elements
  • sum(): sum of all element values
  • unique(): distinct values
  • sort(): sort elements, omitting NAs
  • order(): indices of sorted elements, NAs are last
  • rev(): reverse the order
  • summary(): simple statistics

Handling Missing Data

  • First consider the reason(s) for the missing data
    • e.g. concentrations that are below detectable levels?
  • Sometimes NAs in data require special statistical methods
  • Other times we can safely discard / ignore NA entries
  • To remove NAs prior to a calculation:

Wrapping up day 1

The goal for today was to rapidly cover some of the essential aspects of R programming. For the remainder of the week you’ll work at your own pace to get more of a hands-on deep dive into this material. If you run into trouble please don’t hesiate to ask for help via Teams (QuaRantine Team), slack (QuaRantine Course), or email (Drs. Matott and Morgan) — whatever works best for you!

1.2 Day 2: Vectors and variables

Our overall goal for the next few days is to use R to create a daily log of quarantine activities.

Our goal for today is to become familiar with R vectors. Along the way we’ll probably make data entry and other errors that will start to get us comfortable with R.

If you run into problems, reach out to the slack channel for support!

The astronaut Scott Kelly said that to survive a year on the International Space Station he found it essential to

  • Follow a schedule – plan your day, and stick to the plan
  • Pace yourselves – you’ve got a long time to accomplish tasks, so don’t try to get everything done in the first week.
  • Go outside – if Scott can head out to space, we should be able to make it to the back yard or around the block!
  • Get a hobby – something not work related, and away from that evil little screen. Maybe it’s as simple as rediscovering the joy of reading.
  • Keep a journal
  • Take time to connect – on a human level, with people you work with and people you don’t!
  • Listen to experts – Scott talked about relying on the mission controllers; for us maybe that’s watching webinars or taking courses in new topics!
  • Wash your hands!

I wanted to emphasize ‘follow a schedule’ and ‘keep a journal’. How can R help? Well, I want to create a short record of how I spend today, day 2 of my quarantine.

My first goal is to create vectors describing things I plan to do today. Let’s start with some of these. To get up to speed, type the following into the R console, at the > prompt

Press the carriage return and remind yourself that R is a calculator, and knows how to work with numbers!

Now type an activity in your day, for instance I often start with

Now try assigning that to a variable, and displaying the variable, e.g.,

OK, likely you have several activities scheduled. Create a vector of a few of these by concatenating individual values

Assign these to a variable

Create another vector, but this time the vector should contain the minutes spent on each activity

So I spent 20 minutes checking email, 30 minutes having breakfast and things like that, I was in a conference call for 60 minutes, and then attended a webinar where I learned new stuff for another 60 minutes. Finally I went for a walk to clear my head and remember why I’m doing things.

Apply some basic functions to the variables, e.g., use length() to demonstrate that you for each activity you have recorded the minutes.

Use tail() to select the last two activities (or head() to select the first two…)

R has other types of vectors. Create a logical vector that indicates whether each activity was ‘work’ activity’ or something you did for your own survival. We’ll say that checking email is a work-related activity!

1.3 Day 3: factor(), Date(), and NA

Yesterday we learned about character, numeric, and logical vectors in R (you may need to revisit previous notes and re-create these variables)

Today we will learn about slightly more complicated vectors.

We created the logical vector is_work to classify each activity as either work-related or not. What if we had several different categories? For instance, we might want to classify the activities into categories inspired by astronaut Kelly’s guidance. Categories might include: connect with others; go outside and exercise; consult experts; get a hobby; and (my own category, I guess) perform essential functions like eating and sleeping. So the values of activity could be classified as

I want to emphasize a difference between the activity and classification variables. I want activity to be a character vector that could contain any description of an activity. But I want classification to be terms only from a limited set of possibilities. In R, I want classification to be a special type of vector called a factor, with the values of the vector restricted to a set of possible levels that I define. I create a factor by enumerating the possible levels that the factor can take on

And then tell R that the vector classification should be a factor with values taken from a particular set of levels

Notice that activity (a character vector) displays differently from classification (a factor)

Also, some of the levels (e.g., hobby) have not been part of our schedule yet, but the factor still ‘knows’ about the level.

Notice also what happens when I try to use a value (disconnect) that is not a level of a factor

The value with the unknown level is displayed as NA, for ‘not known’. NA values can be present in any vector, e.g.,

This serves as an indication that the value is simply not available. Use NA rather than adopting some special code (e.g., ‘-99’) to indicate when a value is not available.

One other type of vector we will work a lot with are dates. All of my activities are for today, so I’ll start with a character vector with the same length as my activity vector, each indicating the date in a consistent month-day-year format

Incidentally, I could do this more efficiently using the replicate function

And even better use length() to know for sure how many times I should replicate the character vector

dates is a character vector, but it has specially meaning as a calendar date, R has a Date class that knows how to work with dates, for instance to calculate the number of days between two dates. We will coerce date to an object of class Date using a function as.Date. Here’s our first attempt…

… but this results in an error:

Error in charToDate(x) :
  character string is not in a standard unambiguous format

R doesn’t know the format (month-day-year) of the dates we provide. The solution is to add a second argument to as.Date(). The second argument is a character vector that describes the date format. The format we use is "%m-%d-%Y", which says that we provide the %month first, then a hyphen, then the %day, another hyphen, and finally the four-digit %Year.

Notice that the format has been standardized to year-month-day. Also notice that although the original value of date and the return from as.Data() look the same, they are actually of different class.

R will use the information about class to enable specialized calculation on dates, e.g., to sort them or to determine the number of days between different dates. So here’s our date vector as a Date object.

OK, time for a walk! See you tomorrow!

1.4 Day 4: Working with variables

Remember that R can act as a simple calculator, and that one can create new variables by assignment

Let’s apply these ideaas to our minutes vector from earlier in the week.

We can perform basic arithmetic on vectors. Suppose we wanted to increase the time of each activity by 5 minutes

or to increase the time of the first two activities by 5 minutes, and the last three activities by 10 minutes

R has a very large number of functions that can be used on vectors. For instance, the average time spent on activities is

while the total amount of time is

Explore other typical mathematical transformations, e.g., log(), log10(), sqrt() (square root), … Check out the help pages for each, e.g., ?log.

Explore the consequences of NA in a vector for functions like mean() and sum().

R is saying that, since there is an unknown (NA) value in the vector, it cannot possibly know what the mean is! Tell R to remove the missing values before performing the calculation by adding the na.rm = TRUE argument

Check out the help page ?mean to find a description of the na.rm and other arguments.

It’s possible to perform logical operations on vectors, e.g., to ask which activities lasted 60 minutes or more

Here’s our activity vector

The elements of this vector are numbered from 1 to 5. We can create a new vector that is a subset of this vector using [ and an integer index, e.g., the second activity is

The index can actually be a vector, so we could choose the second and fourth activity as

In fact, we can use logical vectors for subsetting. Consider the activities that take sixty minutes or longer:

We had previously characterized the activities as ‘work’ or otherwise.

Use is_work to subset activity and identify the work-related activities

How many minutes were work-related?

What about not work related? ! negates logical vectors, so

Note that it doesn’t make sense to take the mean() of a character vector like activity, and R signals a warning and returns NA

Nonetheless, there are many functions that do work on character vectors, e.g., the number of letters in each element nchar(), or transformation to upper-case

1.5 Day 5 (Friday) Zoom check-in

1.5.1 Logistics

  • Please join Microsoft Teams! Need help? Contact Adam.Kisailus at RoswellPark.org.

1.5.2 Review and trouble shoot (25 minutes; Martin)

Other fun topics

%in%: a binary operator

named vectors (see Annual Estimates… table from census.gov)

1.5.3 Weekend activities (25 minutes; Shawn)

Writing R scripts

R scripts are convenient text files that we can use to save one or more lines of R syntax. Over the weekend you will get some experience working with R scripts. The example below will help you be a bit more prepared.

  • In RStudio, click File --> New File --> R Script to create a new script file and open it in the editor.

If you’ve followed the daily coding activities throughout the week, you should have some R code that keeps track of your daily activities.

  • If so, enter that code into your R script now.

  • Otherwise, feel free to use the code below. Look for a copy to clipboard icon in the top-right of the code block. To copy the code block to your R script:
    • Click on the copy to clipboard icon
    • Place your cursor in your R script
    • Click Edit --> Paste:

Recall the discussion of factors and levels in Day 3; the code below leverages this but adds another level named independent work.

  • If you’ve already got code to assign factors and levels to your daily activity, enter that code into your R script now.

  • Otherwise, feel free to use the code below via the copy to clipboard procedure outlined above:

On day 3 you also got some experience working with dates. The code below stamps our day 1 activity data with an appropriately formatted date.

  • If you’ve already got code to assign dates your daily activity, enter that code into your R script now.

  • Otherwise, feel free to use the code below via the copy to clipboard procedure outlined above:

[OPTIONAL ADVANCED MATERIAL]

Earlier today Dr. Morgan touched on named vectors. We can leverage named vectors to create a more general mapping between activities and levels. The code for this is given below. Try it and compare the result to your manual mapping!

Saving R scripts

If you’ve been following along you should now have an R script that contains a bunch of code for keeping track of your daily activity log. Let’s save this file:

  • In RStudio, place your cursor anywhere in the script file
  • click File --> Save (or press CTRL+S)
    • Name your file something like daily_activity.R.

Running R scripts

Now that we’ve created an R script you may be wondering “How do I run the code in the script?” There’s actually a few ways to do this:

Option #1 (Run)

  • Highlight the first block of the code (e.g the part where you recorded day 1 activity and maybe calculated amount of time worked).
  • Click the --> Run icon in the top-right portion of the script editor window.
    • This will run the highlighted block of code. The output will appear in the RStudio console window along with an echo of the code itself.

Option #2 (Source)

  • Click on the --> Source icon just to the right of the --> Run icon.
  • This will run the entire script.
  • Equivalent to entering into the console

  • Only the output generated by print() and cat() will appear in the RStudio console (i.e. the code in the script is not echoed to the console).

Option #3 (Source with Echo)

  • Click on the downward pointing arrowhead next to the source buton to open a dropdown menu
  • In the dropdown menu, sectect Source with Echo
  • This will run the entire script and the code in the script will be echoed to the RStudio console along with any output generated by print() and cat().
  • The echoed source and the normal output are not color-coded like they are when using the --> Run button.
  • Equivalent to running

Saving data

It can be useful to save objects created in an R script as a data file. These data files can be loaded or re-loaded into a new or existing R session.

For example, let’s suppose you had an R script that mined a trove of Twitter feeds for sentiment data related to government responses to COVID-19. Suppose you ran the script for serveral weeks and collected lot’s of valuable data into a bunch of vectors. Even though the R code is saved as a script file, the data that the script is collecting would be lost once script stops running. Furthermore, due to the temporal nature of Twitter feeds, you wouldn’t be able to collect the same data by simply re-running the script. Luckily, R provides several routines for saving and loading objects. Placing the appropriate code in your R script will ensure that your data is preserved even after the script stops running.

Saving individual R objects

R supports storing a single R object as an .rds file. For example, the code below saves the day1_activity vector to an .rds file. The saveRDS() function is the workhorse in this case and the setwd(), getwd(), and file.path() commands allow us to conveniently specify a name and location for the data file:

Loading individual R objects

The complement to the saveRDS() function is the readRDS() function. It loads the R object stored in the specified file. In the example below a data file is loaded and stored as an object named day1_activity_loaded. Compare this object to the existing day1_activity object - they should be the same!

Saving multiple R objects

The save() function will save one or more objects into a .Rdata file (these are also known as session files). The example below saves various day1 and related factor-level objects to an .Rdata file.

If you have many objects that you want to save, listing them all can be tedious. Fortunately, the ls() command provides a list of all objects in the current R session. The results of ls() can be passed along to the save() command and this will result in all objects being saved. An example of the required syntax is given below.

Loading multiple R objects

The complement to the save() function is the load() function. This will load all objects stored in an .Rdata file into the current R session. Example syntax is given below:

It is also possible to load an .Rdata file using the RStudio interface.

  • Click Session --> Load Workspace ...
  • A file browser dialog will open
  • Navigate to the .rdata file and select

Wrapping up day 5

Today we reviewed the concepts that you worked with throughout the week during your independent activity. We also troubleshooted any problems or questions that may have come up during this time. Finally, we previewed the creation and use of R scripts and learned about saving and loading objects. Over the weekend you will gain some more experience with these topics.

1.6 Day 6: R scripts

Some of you may have already started saving your R commands as script files. As the material gets more complicated (and more interesting) everyone will want to start doing this. Here is an example to get you started:

  • Recall that we can create a script file in RStudio, click “File –> New File –> R Script” to create a new script file and open it in the editor

  • By convention, R scripts have a .R exstension (e.g. my_script.R)
    • In RStudio, click into your untitled script and click “File –> Save”
    • Name your file something fun like my_first_script.R and save it
  • Use the # character for comments. Enter the following into your R Script file:
  • Enter each command on a separate line. It’s also possible to enter multiple (short!) commands on a single line, separated by a semi-colon ;
  • Use the “Run” button in RStudio to run the highlighted portion of an R script file. Try this on your simple R Script.

  • Alternatively, use “Run –> Run All” to run an entire script file.

For today’s exercise, create a script file that summarizes your quarantine activities over several days. Use comments, white space (blank lines and spaces), and variable names to summarize each day. Here’s what I’ve got…

Try concatenating these values, e.g.,

Save your script, quit R and RStudio, and restart R. Re-open and run the script to re-do your original work.

Think about how this makes your work reproducible from one day to the next, and how making your scientific work reproducible would be advantageous.

1.7 Day 7: Saving data

We’ve defined these variables

Individual variables can be saved to a file.

Use save() and load() to save and load several objects.

As an exercise…

See you in zoom on Monday!