Data input and manipulation • BERD

Example data: BRFSS

We use data from the US Center for Disease Control’s Behavioral Risk Factor Surveillance System (BRFSS) annual survey. Check out the web page for a little more information. We are using a small subset of this data, including a random sample of 20000 observations from each of 1990 and 2010.

Data input

From Excel to CSV

Getting a dataset into R is a necessary and sometimes tedious process. Often data starts as an Excel spreadsheet…

Note that this is a straight-forward sheet – the first row provides a column name. Each subsequent row corresponds to a single observation. There are no mouse-over ‘comments’ or secondary tables or… This simple structure is a good starting point for moving data to R.

Our first step is to export the data from Excel to a ‘CSV’ file. We do this because (a) CSV files are easily read into R and (b) many other data sources can be transformed to CSV format. Chose the ‘File’ menu and then ‘Save’. On macOS I then chose to save the data as CSV with a dialog box that looked like…

This procedure would create a file ‘BRFSS-subset.csv’ on our computer.

We can’t start with an Excel file, so instead download the data we will work with using the following command. You might need to make sure that the folder ‘BRFSS-workshop’ exists at the file system location returned by getwd().

getwd()
url <- "https://raw.githubusercontent.com/mtmorgan/BERD/main/inst/extdata/BRFSS-subset.csv"
destination <- file.path("BRFSS-workshop", basename(url))
if (!file.exists(destination)) {
    dir.create("data")
    download.file(url, destination)
}

CSV to R

Read the data into R as a tibble using the readr package.

brfss <- readr::read_csv(destination)
#> Rows: 20000 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Sex
#> dbl (4): Age, Weight, Height, Year
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
brfss
#> # A tibble: 20,000 × 5
#>      Age Weight Sex    Height  Year
#>    <dbl>  <dbl> <chr>   <dbl> <dbl>
#>  1    31   49.0 Female   157.  1990
#>  2    57   81.6 Female   157.  1990
#>  3    43   80.3 Male     178.  1990
#>  4    72   70.3 Male     170.  1990
#>  5    31   49.9 Female   155.  1990
#>  6    58   54.4 Female   155.  1990
#>  7    45   69.9 Male     173.  1990
#>  8    37   68.0 Male     180.  1990
#>  9    33   65.8 Female   170.  1990
#> 10    75   70.8 Female   152.  1990
#> # ℹ 19,990 more rows

Explore the data using dplyr

Attach dplyr to our current R session.

library(dplyr)

Useful functions for data exploration

count() – count occurence of values in one or more columns
summarize() – more flexible summary of columns
group_by() – summarize or perform other operations by group

Functions for data manipulation

mutate() – update or add columns
filter() – remove rows based on column values
select() – remove, rearrange, or re-name columns

`count()`

Count of observed values for one…

brfss |>
    count(Year)
#> # A tibble: 2 × 2
#>    Year     n
#>   <dbl> <int>
#> 1  1990 10000
#> 2  2010 10000
brfss |>
    count(Sex)
#> # A tibble: 2 × 2
#>   Sex        n
#>   <chr>  <int>
#> 1 Female 12039
#> 2 Male    7961

…or several columns

brfss |>
    count(Year, Sex)
#> # A tibble: 4 × 3
#>    Year Sex        n
#>   <dbl> <chr>  <int>
#> 1  1990 Female  5718
#> 2  1990 Male    4282
#> 3  2010 Female  6321
#> 4  2010 Male    3679

`summarize()`

More flexible summary, e.g., number of observations n() or mean values of columns, removing NA, mean(, na.rm = TRUE).

brfss |>
    summarize(
        n = n(),
        ave_age = mean(Age, na.rm = TRUE),
        ave_wt = mean(Weight, na.rm = TRUE),
        ave_ht = mean(Height, na.rm = TRUE)
    )
#> # A tibble: 1 × 4
#>       n ave_age ave_wt ave_ht
#>   <int>   <dbl>  <dbl>  <dbl>
#> 1 20000    51.0   75.4   169.

`group_by()`

Often data sets contain groups that can be summarized separately, e.g., calculating mean values by Year and Sex.

brfss |>
    group_by(Year, Sex) |>
    summarize(
        n = n(),
        ave_age = mean(Age, na.rm = TRUE),
        ave_wt = mean(Weight, na.rm = TRUE),
        ave_ht = mean(Height, na.rm = TRUE)
    )
#> `summarise()` has grouped output by 'Year'. You can override using the
#> `.groups` argument.
#> # A tibble: 4 × 6
#> # Groups:   Year [2]
#>    Year Sex        n ave_age ave_wt ave_ht
#>   <dbl> <chr>  <int>   <dbl>  <dbl>  <dbl>
#> 1  1990 Female  5718    46.2   64.8   163.
#> 2  1990 Male    4282    43.9   81.2   178.
#> 3  2010 Female  6321    57.1   73.0   163.
#> 4  2010 Male    3679    56.2   88.8   178.

`mutate()`

Year and Sex should really be factors. Also, add a column for log-10 transformed weight.

brfss_clean <-
    brfss |>
    mutate(
        Year = factor(Year, levels = c("1990", "2010")),
        Sex = factor(Sex, levels = c("Female", "Male")),
        Log10Weight = log10(Weight)
    )
brfss_clean
#> # A tibble: 20,000 × 6
#>      Age Weight Sex    Height Year  Log10Weight
#>    <dbl>  <dbl> <fct>   <dbl> <fct>       <dbl>
#>  1    31   49.0 Female   157. 1990         1.69
#>  2    57   81.6 Female   157. 1990         1.91
#>  3    43   80.3 Male     178. 1990         1.90
#>  4    72   70.3 Male     170. 1990         1.85
#>  5    31   49.9 Female   155. 1990         1.70
#>  6    58   54.4 Female   155. 1990         1.74
#>  7    45   69.9 Male     173. 1990         1.84
#>  8    37   68.0 Male     180. 1990         1.83
#>  9    33   65.8 Female   170. 1990         1.82
#> 10    75   70.8 Female   152. 1990         1.85
#> # ℹ 19,990 more rows

`filter()`

Create a subset of the data that includes only Male samples, …

brfss_male <-
    brfss_clean |>
    filter(Sex == "Male")
brfss_male
#> # A tibble: 7,961 × 6
#>      Age Weight Sex   Height Year  Log10Weight
#>    <dbl>  <dbl> <fct>  <dbl> <fct>       <dbl>
#>  1    43   80.3 Male    178. 1990         1.90
#>  2    72   70.3 Male    170. 1990         1.85
#>  3    45   69.9 Male    173. 1990         1.84
#>  4    37   68.0 Male    180. 1990         1.83
#>  5    56   88.5 Male    180. 1990         1.95
#>  6    74   81.6 Male    183. 1990         1.91
#>  7    19   93.0 Male    183. 1990         1.97
#>  8    35   97.5 Male    193. 1990         1.99
#>  9    60   78.0 Male    170. 1990         1.89
#> 10    29   77.1 Male    175. 1990         1.89
#> # ℹ 7,951 more rows

…or only Female samples from 2010.

brfss_female_2010 <-
    brfss_clean |>
    filter(Sex == "Female", Year == "2010")
brfss_female_2010
#> # A tibble: 6,321 × 6
#>      Age Weight Sex    Height Year  Log10Weight
#>    <dbl>  <dbl> <fct>   <dbl> <fct>       <dbl>
#>  1    46   45.4 Female    165 2010         1.66
#>  2    49  145.  Female    183 2010         2.16
#>  3    53   54.4 Female    155 2010         1.74
#>  4    63   77.1 Female    170 2010         1.89
#>  5    53   82.1 Female    168 2010         1.91
#>  6    64   90.7 Female    163 2010         1.96
#>  7    49   77.1 Female    173 2010         1.89
#>  8    77   72.6 Female    168 2010         1.86
#>  9    70   79.4 Female    178 2010         1.90
#> 10    83   93.0 Female    157 2010         1.97
#> # ℹ 6,311 more rows

`select()`

Use select() to choose particular columns, or to re-order columns

brfss_male |>
    select(Year, Age, Weight, Log10Weight, Height)
#> # A tibble: 7,961 × 5
#>    Year    Age Weight Log10Weight Height
#>    <fct> <dbl>  <dbl>       <dbl>  <dbl>
#>  1 1990     43   80.3        1.90   178.
#>  2 1990     72   70.3        1.85   170.
#>  3 1990     45   69.9        1.84   173.
#>  4 1990     37   68.0        1.83   180.
#>  5 1990     56   88.5        1.95   180.
#>  6 1990     74   81.6        1.91   183.
#>  7 1990     19   93.0        1.97   183.
#>  8 1990     35   97.5        1.99   193.
#>  9 1990     60   78.0        1.89   170.
#> 10 1990     29   77.1        1.89   175.
#> # ℹ 7,951 more rows

Initial visualization

What is the relationship between Height and Weight in Female samples from 2010?

plot(Weight ~ Height, brfss_female_2010)

With Weight on a log scale?

plot(Weight ~ Height, brfss_female_2010, log = "y")

## another way, but maybe the Y-axis scale is harder to interpret?
## plot(Log10Weight ~ Height, brfss_female_2010)

Does it look like Males differ in weight between 1990 and 2010, plotted on a log scale?

brfss_male |>
    group_by(Year) |>
    summarize(
        n = n(),
        ave_wt = mean(Weight, na.rm = TRUE),
        ave_log10_wt = mean(Log10Weight, na.rm = TRUE)
    )
#> # A tibble: 2 × 4
#>   Year      n ave_wt ave_log10_wt
#>   <fct> <int>  <dbl>        <dbl>
#> 1 1990   4282   81.2         1.90
#> 2 2010   3679   88.8         1.94
plot(Weight ~ Year, brfss_male, log = "y")

Session information

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] bit_4.0.5         jsonlite_1.8.8    highr_0.10        compiler_4.3.3   
#>  [5] crayon_1.5.2      tidyselect_1.2.1  parallel_4.3.3    jquerylib_0.1.4  
#>  [9] systemfonts_1.0.6 textshaping_0.3.7 yaml_2.3.8        fastmap_1.1.1    
#> [13] readr_2.1.5       R6_2.5.1          generics_0.1.3    knitr_1.45       
#> [17] tibble_3.2.1      desc_1.4.3        bslib_0.6.2       pillar_1.9.0     
#> [21] tzdb_0.4.0        rlang_1.1.3       utf8_1.2.4        cachem_1.0.8     
#> [25] xfun_0.43         fs_1.6.3          sass_0.4.9        bit64_4.0.5      
#> [29] memoise_2.0.1     cli_3.6.2         withr_3.0.0       pkgdown_2.0.7    
#> [33] magrittr_2.0.3    digest_0.6.35     vroom_1.6.5       hms_1.1.3        
#> [37] lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.23     glue_1.7.0       
#> [41] ragg_1.3.0        fansi_1.0.6       rmarkdown_2.26    purrr_1.0.2      
#> [45] tools_4.3.3       pkgconfig_2.0.3   htmltools_0.5.8