Week 3 Packages and the ‘tidyverse’

3.1 Day 15 (Monday) Zoom check-in

Review and troubleshoot (15 minutes)

Over the weekend, I wrote two functions. The first retrieves and ‘cleans’ the US data set.

The second plots data for a particular county and state

I lived in Seattle (King County, Washington), for a while, and this is where the first serious outbreak occurred. Here’s the relevant data:

Packages (20 minutes)

Base R

  • R consists of ‘packages’ that implement different functionality. Each package contains functions that we can use, and perhaps data sets (like the mtcars) data set from Friday’s presentation) and other resources.

  • R comes with several ‘base’ packages installed, and these are available in a new R session.

  • Discover packages that are currently available using the search() function. This shows that the ‘stats’, ‘graphics’, ‘grDevices’, ‘utils’, ‘datasets’, ‘methods’, and ‘base’ packages, among others, are available in our current R session.

  • When we create a variable like

    R creates a new symbol in the .GlobalEnv location on the search path.

  • When we evaluate a function like length(x)

    • R searches for the function length() along the search() path. It doesn’t find length() in the .GlobalEnv (because we didn’t define it there), or in the ‘stats’, ‘graphics’, … packages. Eventually, R finds the definition of length in the ‘base’ package.

    • R then looks for the definition of x, finds it in the .GlobalEnv.

    • Finally, R applies the definition of length found in the base package to the value of x found in the .GlobalEnv.

Contributed packages

  • R would be pretty limited if it could only do things that are defined in the base packages.

  • It is ‘easy’ to write a package, and to make the package available for others to use.

  • A major repository of contributed packages is CRAN – the Comprehensive R Archive Network. There are more than 15,000 packages in CRAN.

  • Many CRAN packages are arranged in task views that highlight the most useful packages.

Installing and attaching packages

  • There are too many packages for all to be distributed with R, so it is necessary to install contributed packages that you might find interesting.

  • once a package is installed (you only need to install a package once), it can be ‘loaded’ and ‘attached’ to the search path using library().

  • As an exercise, try to attach the ‘readr’, ‘dplyr’, and ‘ggplot2’ packages

  • If any of these fails with a message like

    it means that the package has not been installed (or that you have a typo in the name of the library!)

  • Install any package that failed when library() was called with

    Alternatively, use the RStudio interface to select (in the lower right panel, by default) the ‘Packages’ tab, ‘Install’ button.

  • One package may use functions from one or more other packages, so when you install, for instance ‘dplyr’, you may actually install several packages.

The ‘tidyverse’ of packages (20 minutes)

The ‘tidyverse’ of packages provides a very powerful paradigm for working with data.

  • Based on the idea that a first step in data analysis is to transform the data into a standard format. Subsequent steps can then be accomplished in a much more straight-forward way, using a small set of functions.

  • Hadley Wickham’s ‘Tidy Data’ paper provides a kind of manifesto for what constitutes tidy data:

    1. Each variable forms a column.
    2. Each observation forms a row.
    3. Each type of observational unit forms a table
  • We’ll look at the readr package for data input, and the dplyr package for essential data manipulation.

readr for fast data input

dplyr for data manipulation

Other common verbs (see tomorrow’s quarantine)

  • mutate() (add or update) columns
  • summarize() one or more columns
  • group_by() one or more variables when performing computations. ungroup() removes the grouping.
  • arrange() rows based on values in particular column(s); desc() in descending order.
  • count() the number of times values occur

Other ‘tidyverse’ packages

  • Packages adopting the ‘tidy’ approach to data representation and management are sometimes referred to as the tidyverse.

  • ggplot2 implements high-quality data visualization in a way consistent with tidy data representations.

  • The tidyr package implements functions that help to transform data to ‘tidy’ format; we’ll use pivot_longer() later in the week.

3.2 Day 16 Key tidyverse packages: readr and dplyr

Start a script for today. In the script

Work through the following commands, adding appropriate lines to your script

3.3 Day 17 Visualization with ggplot2

ggplot2 essentials

The ‘gg’ in ggplot2

A first plot

COVID-19 in Erie county

New cases

New cases and mortality

‘Long’ data and an alternative approach to plotting multiple curves.

COVID-19 in New York State

We’ll explore ‘facet’ visualizations, which create a panel of related plots

Setup

Visualization

COVID-19 nationally

Setup

3.4 Day 18 Worldwide COVID data

Setup

Source

‘Tidy’ data

Exploration

Visualization

It seems like it would be convenient to capture our data cleaning and visualization steps into separate functions that can be re-used, e.g., on different days or for different visualizations.

3.5 Day 19 (Friday) Zoom check-in

3.5.1 Logistics

  • Stick around after class to ask any questions.

  • Remember Microsoft Teams for questions during the week.

3.5.2 Review and trouble shoot (40 minutes)

Setup

Packages

  • install.packages() versus library()
  • Symbol resolution: dplyr::filter()

The tibble

Verbs

Cleaning: tidyr::pivot_longer()

Visualization

Global data

3.5.3 Weekend activities (15 minutes)

  • Explore global pandemic data
  • Critically reflect on data resources and interpretation

3.6 Day 20 Exploring the course of pandemic in different regions

Use the data and functions from quarantine day 18 to place the pandemic into quantitative perspective. Start by retrieving the current data

Start with the United States

  • When did ‘stay at home’ orders come into effect? Did they appear to be effective?

  • When would the data suggest that the pandemic might be considered ‘under control’, and country-wide stay-at-home orders might be relaxed?

Explore other countries.

Where does your own exploration of the data take you?

3.7 Day 21 Critical evaluation

The work so far has focused on the mechanics of processing and visualizing the COVID-19 state and global data. Use today for reflecting on interpretation of the data. For instance

  • To what extent is the exponential increase in COVID presence in early stages of the pandemic simply due to increased availability of testing?

  • How does presentation of data influence interpretation? It might be interesting to visualize, e.g., US cases on linear as well as log scale, and to investigate different ways of ‘smoothing’ the data, e.g., the trailing 7-day average case number. For the latter, one could write a function

    and apply this to new_cases prior to visualization. Again this could be presented as log-transformed or on a linear scale.

  • Can you collect additional data, e.g., on population numbers from the US census, to explore the relationship between COVID impact and socioeconomic factors?

  • How has COVID response been influenced by social and political factors in different parts of the world? My (Martin) narratives were sketched out to some extend on Day 20. What are your narratives?

  • What opportunities are there to intentionally or unintentionally mis-represent or mis-interpret the seemingly ‘hard facts’ of COVID infection?