Statistical Analysis and Additional Visualization • BERD

Recap: BRFSS data input

Download (if necessary) and input the data from day 2 using readr

url <- "https://raw.githubusercontent.com/mtmorgan/BERD/main/inst/extdata/BRFSS-subset.csv"
destination <- file.path("data", basename(url))
if (!file.exists(destination)) {
    dir.create("data")
    download.file(url, destination)
}

brfss <- readr::read_csv(destination)

Attach the dplyr package and perform the same initial data transformations, and create subsets of data.

library(dplyr)

brfss_clean <-
    brfss |>
    mutate(
        Year = factor(Year, levels = c("1990", "2010")),
        Sex = factor(Sex, levels = c("Female", "Male")),
        Log10Weight = log10(Weight)
    )

brfss_male <-
    brfss_clean |>
    filter(Sex == "Male")

brfss_female_2010 <-
    brfss_clean |>
    filter(Sex == "Female", Year == "2010")

Statistical analysis

t-test

We saw that Males in 2010 seemed to be heavier than Males in 1990

brfss_male |>
    group_by(Year) |>
    summarize(
        n = n(),
        ave_wt = mean(Weight, na.rm = TRUE),
        ave_log10_wt = mean(Log10Weight, na.rm = TRUE)
    )
#> # A tibble: 2 × 4
#>   Year      n ave_wt ave_log10_wt
#>   <fct> <int>  <dbl>        <dbl>
#> 1 1990   4282   81.2         1.90
#> 2 2010   3679   88.8         1.94
plot(Weight ~ Year, brfss_male, log = "y")

But is this difference statistically significant? Answer this using a t-test to compare the mean of two groups.

Start by consulting the help page ?t.test with ‘Usage’ labeled “S3 method for class ‘formula’”. Use the Log10Weight column so that we are comparing log-transformed Weight.

t.test(Log10Weight ~ Year, brfss_male)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  Log10Weight by Year
#> t = -20.202, df = 6985.4, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group 1990 and group 2010 is not equal to 0
#> 95 percent confidence interval:
#>  -0.03971348 -0.03268805
#> sample estimates:
#> mean in group 1990 mean in group 2010 
#>           1.903711           1.939912

The p-value displayed here is < 2.2e-16, so we are very confident that Males in 2010 are heavier than Males in 1990.

The ‘sample estimates’ agree with the values we calculated using group_by() and summarize(); this provides us with some confidence that we have not done something very wrong.

Note that t.test() actually returns an object that we can capture and manipulate, e.g., to obtain a vector of group mean estimates.

t_test <- t.test(Log10Weight ~ Year, brfss_male)
t_test$estimate
#> mean in group 1990 mean in group 2010 
#>           1.903711           1.939912

Linear regression

It seems like there was a positive relationship between Height and (log 10) Weight in Females in 2010.

plot(Weight ~ Height, brfss_female_2010, log = "y")

Fitting a linear regression in R is a two-step process

Use lm() to fit the linear model
Use summary() or anova() to assess the statistical significance of the fitted model.

Start by fitting the model, using notation like that used to plot the data; use Log10Weight so the fit is to the log-transformed data.

fit <- lm(Log10Weight ~ Height, brfss_female_2010)
fit
#> 
#> Call:
#> lm(formula = Log10Weight ~ Height, data = brfss_female_2010)
#> 
#> Coefficients:
#> (Intercept)       Height  
#>    1.114139     0.004515

The fit object doesn’t initially seem very useful – it tells us what we did, and summarizes the intercept (1.1141389) and slope (0.0045155), but does not provide an ANOVA table or other summary about whether the fit is statistically significant.

Use anova() to obtain an ANOVA table summarizing the fit.

anova(fit)
#> Analysis of Variance Table
#> 
#> Response: Log10Weight
#>             Df Sum Sq Mean Sq F value    Pr(>F)    
#> Height       1  6.330  6.3304  709.98 < 2.2e-16 ***
#> Residuals 5929 52.865  0.0089                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table indicates that there is a highly significant relationship (Pr(>F) is < 2.2e-16) between Weight and Height. summary() provides the ANOVA table and additional information

summary(fit)
#> 
#> Call:
#> lm(formula = Log10Weight ~ Height, data = brfss_female_2010)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.28440 -0.06600 -0.01124  0.05810  0.48532 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.1141389  0.0276899   40.24   <2e-16 ***
#> Height      0.0045155  0.0001695   26.65   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.09443 on 5929 degrees of freedom
#>   (390 observations deleted due to missingness)
#> Multiple R-squared:  0.1069, Adjusted R-squared:  0.1068 
#> F-statistic:   710 on 1 and 5929 DF,  p-value: < 2.2e-16

e.g., showing that about 10% (‘Adjusted R-squared’) of the variation in log 10 Weight is explained by Height. There are likely many other factors contributing to Weight variation.

To visualize the linear regression, plot the data points and then use abline() to add the regression line; we also add a ‘main’ title to the plot.

plot(
    Log10Weight ~ Height, brfss_female_2010,
    main = "Log 10 Weight vs Height, BRFSS Female 2010"
)
abline(fit, col = "red", lwd = 4) # red color, 4x default line width

Additional analysis and visualization

Let’s use the ggplot2 package to perform additional analysis and visualization

library(ggplot2)

ggplot2 implements a ‘grammar of graphics’. The author of the package has provided a unique way to create plots – start with a ggplot() graph, and then add aes()thetics (what will be plotted) and geom()etries (how to plot the aesthetics), and additional layers to, e.g., transform the y-axis to a log scale.

ggplot(brfss_female_2010) +
    aes(x = Height, y = Weight) +
    geom_point() +
    scale_y_log10()

Add a linear regression using geom_smooth(method = "lm")

ggplot(brfss_female_2010) +
    aes(x = Height, y = Weight) +
    geom_point() +
    geom_smooth(method = "lm") +
    scale_y_log10()
#> `geom_smooth()` using formula = 'y ~ x'

Note that the fitted regression includes confidence bands, which can be very helpful when trying to interpret the fit.

Let’s not specify a ‘method’ for geom_smooth(), letting ggplot2 use it’s default method…

ggplot(brfss_female_2010) +
    aes(x = Height, y = Weight) +
    geom_point() +
    geom_smooth() +
    scale_y_log10()
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot2 has chosen to use a ‘generalized additive model’ that fits a smoothed curve. This makes fewer a priori assumptions about the relationship between Weight and Height, and in particular suggests that (a) the linear relationship is appropriate for the central part of the data, but not at the extremes of Height and (b) there is considerable uncertainty at the extremes of Height. This approach (making fewer assumptions about the relationship between variables) is very appropriate in early stages of exploratory analysis, where one is still trying to understand data without making additional or unnecessary assumptions.

Using AI

Session information

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.0 dplyr_1.1.4  
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    lattice_0.22-5   
#>  [5] hms_1.1.3         digest_0.6.35     magrittr_2.0.3    evaluate_0.23    
#>  [9] grid_4.3.3        fastmap_1.1.1     Matrix_1.6-5      jsonlite_1.8.8   
#> [13] mgcv_1.9-1        purrr_1.0.2       fansi_1.0.6       scales_1.3.0     
#> [17] textshaping_0.3.7 jquerylib_0.1.4   cli_3.6.2         rlang_1.1.3      
#> [21] crayon_1.5.2      splines_4.3.3     bit64_4.0.5       munsell_0.5.0    
#> [25] withr_3.0.0       cachem_1.0.8      yaml_2.3.8        tools_4.3.3      
#> [29] parallel_4.3.3    tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-0 
#> [33] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   fs_1.6.3         
#> [37] bit_4.0.5         vroom_1.6.5       ragg_1.3.0        pkgconfig_2.0.3  
#> [41] desc_1.4.3        pkgdown_2.0.7     pillar_1.9.0      bslib_0.6.2      
#> [45] gtable_0.3.4      glue_1.7.0        systemfonts_1.0.6 xfun_0.43        
#> [49] tibble_3.2.1      tidyselect_1.2.1  highr_0.10        knitr_1.45       
#> [53] farver_2.1.1      nlme_3.1-164      htmltools_0.5.8   rmarkdown_2.26   
#> [57] labeling_0.4.3    readr_2.1.5       compiler_4.3.3