Register or log-in
choose the ‘ARTNet 2023’
Wait a minute or so
Click to open RStudio
In RStudio, choose ‘File’ / ‘Open File…’ / ‘vignettes/a_r.Rmd’
This article introduces R as an essential tool in
exploratory analysis of bioinformatic data. No previous experience with
R is required. We start with a very short introduction to
R, mentioning vectors, functions, and the
data.frame
for representing tabular data. We introduces
‘packages’ that extend base R functionality, and explore some
essential data management tasks, reading a ‘CSV’ (comma-separated value)
file, and visualizing data. We conclude in a short work-flow to read in
data and perform a Kaplan-Meier survival analysis, including generating
a graphical summary.
A simple calculator
1 + 1
#> [1] 2
‘Vectors’ as building blocks
c(1, 2, 3)
#> [1] 1 2 3
c("January", "February", "March")
#> [1] "January" "February" "March"
c(TRUE, FALSE)
#> [1] TRUE FALSE
Variables, missing values and ‘factors’
age <- c(27, NA, 32, 29)
gender <- factor(
c("Female", "Male", "Non-binary", NA),
levels = c("Female", "Male", "Non-binary")
)
Data structures to coordinate related vectors – the
data.frame
df <- data.frame(
age = c(27, NA, 32, 29),
gender = gender
)
df
#> age gender
#> 1 27 Female
#> 2 NA Male
#> 3 32 Non-binary
#> 4 29 <NA>
Key operations on data.frame
df[1:3, c("gender", "age")]
– subset on rows and
columnsdf[["age"]]
, df$age
– select columnsFunctions
rnorm(5) # 5 random normal deviates
#> [1] -0.83431802 0.05965308 -0.62848408 1.01184318 -0.45826876
x <- rnorm(100) # 100 random normal deviates
hist(x) # histogram, approximately normal
‘Vectorized’ operations, e.g., element-wise addition without an explicit ‘for’ loop
y <- x + rnorm(100)
plot(y ~ x)
fit <- lm(y ~ x)
fit # an R 'object' containing information about the
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Coefficients:
#> (Intercept) x
#> 0.2055 0.9410
# regression of y on x
abline(fit) # plot points and fitted regression line
anova(fit) # statistical summary of linear regression
#> Analysis of Variance Table
#>
#> Response: y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> x 1 78.946 78.946 78.973 3.147e-14 ***
#> Residuals 98 97.966 1.000
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Write your own functions
hello <- function(who) {
paste("hello", who, "with", nchar(who), "letters in your name")
}
hello("Martin")
#> [1] "hello Martin with 6 letters in your name"
Iterate, usually with lapply()
although
for()
is available
Extend functionality of base R. Can be part of the ‘base’ distribution…
df <- data.frame(
x = rnorm(100),
y = x + rnorm(100)
)
… or a package contributed by users to the Comprehensive R Archive Network (CRAN), or to Bioconductor or other repositories. A particularly common suite of packages is the ‘tidyverse’. To use the dplyr contributed CRAN package, load the package
A dplyr ‘tibble’ is like a ‘data.frame’, but more user-friendly
tbl <- tibble(
x = rnorm(100),
y = x + rnorm(100)
)
tbl # e.g., only displays the first 10 rows
#> # A tibble: 100 × 2
#> x y
#> <dbl> <dbl>
#> 1 1.94 3.99
#> 2 0.0804 1.02
#> 3 -0.534 -0.707
#> 4 0.980 0.714
#> 5 -1.36 -2.13
#> 6 -1.54 -0.592
#> 7 0.0789 0.120
#> 8 -0.993 0.621
#> 9 0.0219 -0.0292
#> 10 0.405 2.54
#> # ℹ 90 more rows
The tidyverse makes use of ‘pipes’ |>
(the older
syntax is %>%
). A pipe takes the left-hand side and pass
through to the right-hand side. Key dplyr ‘verbs’ can be
piped together
tibble()
– representation of a data.frame
,
with better display of long and wide data frames. tribble()
constructs a tibble in a way that makes the relationship between data
across rows more transparent.glimpse()
– providing a quick look into the columns and
data in the tibble by transposing the tibble and display each ‘column’
on a single line.select()
– column selection.filter()
, slice()
– row selection.pull()
– extract a single column as a vector.mutate()
– column transformation.count()
– count occurrences in one or more
columns.arrange()
– order rows by values in one or more
columns.distinct()
– reduce a tibble to only unique rows.group_by()
– perform computations on groups defined by
one or several columns.summarize()
– calculate summary statistics for
groups.left_join()
, right_join()
,
inner_join()
– merge two tibbles based on shared columns,
preserving all rows in the first (left_join()
) or second
(right_join()
) or both (inner_join()
)
tibble.
tbl |>
## e.g., just rows with non-negative values of x and y
filter(x > 0, y > 0) |>
## add a column
mutate(distance_from_origin = sqrt(x^2 + y^2))
#> # A tibble: 39 × 3
#> x y distance_from_origin
#> <dbl> <dbl> <dbl>
#> 1 1.94 3.99 4.44
#> 2 0.0804 1.02 1.02
#> 3 0.980 0.714 1.21
#> 4 0.0789 0.120 0.144
#> 5 0.405 2.54 2.57
#> 6 0.800 1.27 1.50
#> 7 0.559 0.668 0.871
#> 8 0.206 1.34 1.36
#> 9 1.87 0.972 2.10
#> 10 0.165 0.424 0.455
#> # ℹ 29 more rows
A ‘classic’ built-in data set – Motor Trend ‘cars’ from 1974…
‘tidyverse’ eschews rownames, so make these a column. Use
group_by()
to summarize by group (cyl
).
n()
is a function from dplyr that returns the number of
records in a group.
mtcars_tbl <-
mtcars |>
as_tibble(rownames = "model") |>
mutate(cyl = factor(cyl))
mtcars_tbl
#> # A tibble: 32 × 12
#> model mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
mtcars_tbl |>
group_by(cyl) |>
summarize(
n = n(),
mean_mpg = mean(mpg, na.rm = TRUE),
var_mpg = var(mpg, na.rm = TRUE)
)
#> # A tibble: 3 × 4
#> cyl n mean_mpg var_mpg
#> <fct> <int> <dbl> <dbl>
#> 1 4 11 26.7 20.3
#> 2 6 7 19.7 2.11
#> 3 8 14 15.1 6.55
Let’s return to a basic tibble
Use the contributed package ggplot2 for visualization.
library(ggplot2)
ggplot(tbl) +
aes(x, y) + # use 'x' and 'y' columns for plotting...
geom_point() + # ...plot points...
geom_smooth(method = "lm") # ...linear regresion
Check out plotly, especially for interactive visualization (e.g., ‘tooltips’ when mousing over points, or dragging to subset and zoom in)
library(plotly)
plt <-
ggplot(mtcars_tbl) +
aes(x = cyl, y = mpg, text = model) +
geom_jitter(width = .25) +
geom_boxplot()
ggplotly(plt)
CRAN: Comprehensive R Archive Network. More than 18,000 packages. Some help from CRAN Task Views in identifying relevant packages.
Bioconductor: More than 2100 packages relevant to high-throughput genomic analysis. Vignettes are an important part of Bioconductor packages.
Install packages once per R installation, using
BiocManager::install(<package-name>)
(CRAN or
Bioconductor)
What about GitHub? Packages haven’t been checked by a formal system, so may have incomplete code, documentation, dependencies on other packages, etc. Authors may not yet be committed to long-term maintenance of their package.
Help pages, e.g., ?lm
Vignettes, e.g.,
Google, StackOverflow, etc…
ChatGPT – my three types of experiences
These notes are from Survival Analysis with R
We use data derived from the built-in ‘veterans’ dataset from the
‘Veterans’ Administration Lung Cancer study’; see ?veterans
for details.
## use `file.choose()` or similar for your own data sets
veteran_csv <- system.file(package = "ARTNet2023", "extdata/veteran.csv")
veteran <- readr::read_csv(veteran_csv)
#> Rows: 137 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): celltype
#> dbl (7): trt, time, status, karno, diagtime, age, prior
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
veteran
#> # A tibble: 137 × 8
#> trt celltype time status karno diagtime age prior
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 squamous 72 1 60 7 69 0
#> 2 1 squamous 411 1 70 5 64 10
#> 3 1 squamous 228 1 60 3 38 0
#> 4 1 squamous 126 1 60 9 63 10
#> 5 1 squamous 118 1 70 11 65 10
#> 6 1 squamous 10 1 20 5 49 0
#> 7 1 squamous 82 1 40 10 69 10
#> 8 1 squamous 110 1 80 29 68 0
#> 9 1 squamous 314 1 50 18 43 0
#> 10 1 squamous 100 0 70 6 70 0
#> # ℹ 127 more rows
Columns in the datasset are described as:
Some of the covariate can be summarized using ‘tidy’ functions, e.g.,
veteran |>
count(celltype, trt) |>
tidyr::pivot_wider(names_from = trt, values_from = n)
#> # A tibble: 4 × 3
#> celltype `1` `2`
#> <chr> <int> <int>
#> 1 adeno 9 18
#> 2 large 15 12
#> 3 smallcell 30 18
#> 4 squamous 15 20
For continuous variables like age
, it may be useful to
stratify the data, e.g., separating individuals as less than 60
(LT60
), or greater than or equal to 60 (GE60
).
For trt
, the encoding is a double
, but it
makes more sense for this to be a categorical variable
veteran <-
veteran |>
mutate(
age_group = factor(ifelse((age < 60), "LT60", "GE60")),
trt = factor(trt, labels = c("control", "treatment"))
)
The celltype
and trt
summary table is more
informative
veteran |>
count(celltype, trt) |>
tidyr::pivot_wider(names_from = trt, values_from = n)
#> # A tibble: 4 × 3
#> celltype control treatment
#> <chr> <int> <int>
#> 1 adeno 9 18
#> 2 large 15 12
#> 3 smallcell 30 18
#> 4 squamous 15 20
Fit a standard survival model to the entire dataset
It is useful to know the ‘class’ of km
class(km)
#> [1] "survfit"
And to get a summary of the data after different lengths of time…
summary(km, times = c(1, 30, 60, 90, 180, 360, 720))
#> Call: survfit(formula = Surv(time, status) ~ 1, data = veteran)
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 1 137 2 0.985 0.0102 0.96552 1.0000
#> 30 97 39 0.700 0.0392 0.62774 0.7816
#> 60 73 22 0.538 0.0427 0.46070 0.6288
#> 90 62 10 0.464 0.0428 0.38731 0.5560
#> 180 27 30 0.222 0.0369 0.16066 0.3079
#> 360 10 15 0.090 0.0265 0.05061 0.1602
#> 720 2 8 0.018 0.0126 0.00459 0.0707
… but nothing communicates like a figure (for help with the
autoplot
function applied to survfit
object,
see ?autoplot.survfit
).
autoplot(km)
It is easy to fit Kaplan-Meier curves to groups, e.g,. treatment…
…or age group
Kaplan-Meier plots are useful for visualization, but is there
statistical support for an effect of treatment? Use the Cox proportional
hazards model as a basis for statistical understand. In the summary
table, trttest
indicates that the summary reports the
effect of test
compared to standard
as the
base line.
cox <- coxph(Surv(time, status) ~ trt, veteran)
summary(cox)
#> Call:
#> coxph(formula = Surv(time, status) ~ trt, data = veteran)
#>
#> n= 137, number of events= 128
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> trttreatment 0.01774 1.01790 0.18066 0.098 0.922
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> trttreatment 1.018 0.9824 0.7144 1.45
#>
#> Concordance= 0.525 (se = 0.026 )
#> Likelihood ratio test= 0.01 on 1 df, p=0.9
#> Wald test = 0.01 on 1 df, p=0.9
#> Score (logrank) test = 0.01 on 1 df, p=0.9
It is straight-forward to test more complicated models, e.g., whether
age
is a statistically meaningful contribution to survival,
although interpretation even of this ‘simple’ model should probably be
discussed with a statistician.
cox <- coxph(Surv(time, status) ~ age + trt, veteran)
summary(cox)
#> Call:
#> coxph(formula = Surv(time, status) ~ age + trt, data = veteran)
#>
#> n= 137, number of events= 128
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> age 0.007527 1.007556 0.009661 0.779 0.436
#> trttreatment -0.003654 0.996352 0.182514 -0.020 0.984
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> age 1.0076 0.9925 0.9887 1.027
#> trttreatment 0.9964 1.0037 0.6967 1.425
#>
#> Concordance= 0.514 (se = 0.029 )
#> Likelihood ratio test= 0.63 on 2 df, p=0.7
#> Wald test = 0.62 on 2 df, p=0.7
#> Score (logrank) test = 0.62 on 2 df, p=0.7
This document was produced with the following R software:
sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggfortify_0.4.16 survival_3.5-5 plotly_4.10.2 ggplot2_3.4.2
#> [5] dplyr_1.1.2
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.3 xfun_0.39 bslib_0.5.0 htmlwidgets_1.6.2
#> [5] lattice_0.21-8 tzdb_0.4.0 vctrs_0.6.3 tools_4.3.0
#> [9] crosstalk_1.2.0 generics_0.1.3 parallel_4.3.0 tibble_3.2.1
#> [13] fansi_1.0.4 highr_0.10 pkgconfig_2.0.3 Matrix_1.5-4.1
#> [17] data.table_1.14.8 desc_1.4.2 lifecycle_1.0.3 compiler_4.3.0
#> [21] farver_2.1.1 stringr_1.5.0 textshaping_0.3.6 munsell_0.5.0
#> [25] htmltools_0.5.5 sass_0.4.6 yaml_2.3.7 lazyeval_0.2.2
#> [29] crayon_1.5.2 pillar_1.9.0 pkgdown_2.0.7 jquerylib_0.1.4
#> [33] tidyr_1.3.0 ellipsis_0.3.2 cachem_1.0.8 nlme_3.1-162
#> [37] tidyselect_1.2.0 digest_0.6.32 stringi_1.7.12 purrr_1.0.1
#> [41] labeling_0.4.2 splines_4.3.0 rprojroot_2.0.3 fastmap_1.1.1
#> [45] grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
#> [49] utf8_1.2.3 readr_2.1.4 withr_2.5.0 scales_1.2.1
#> [53] bit64_4.0.5 rmarkdown_2.22 httr_1.4.6 bit_4.0.5
#> [57] gridExtra_2.3 ragg_1.2.5 hms_1.1.3 memoise_2.0.1
#> [61] evaluate_0.21 knitr_1.43 viridisLite_0.4.2 mgcv_1.8-42
#> [65] rlang_1.1.1 glue_1.6.2 vroom_1.6.3 jsonlite_1.8.5
#> [69] R6_2.5.1 systemfonts_1.0.4 fs_1.6.2