/
ggbetweenstats.Rmd
363 lines (295 loc) 路 11.9 KB
/
ggbetweenstats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
---
title: "ggbetweenstats"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{ggbetweenstats}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r}
#| label = "setup",
#| include = FALSE
source("../setup.R")
```
```{r}
#| label = "suggested_pkgs",
#| include = FALSE
pkgs <- c(
"gapminder",
"PMCMRplus"
)
successfully_loaded <- purrr::map_lgl(pkgs, requireNamespace, quietly = TRUE)
can_evaluate <- all(successfully_loaded)
if (can_evaluate) {
purrr::walk(pkgs, library, character.only = TRUE)
} else {
knitr::opts_chunk$set(eval = FALSE)
}
```
---
You can cite this package/vignette as:
```{r}
#| label = "citation",
#| echo = FALSE,
#| comment = ""
citation("ggstatsplot")
```
---
Lifecycle: [![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://lifecycle.r-lib.org/articles/stages.html)
The function `ggbetweenstats` is designed to facilitate **data exploration**,
and for making highly customizable **publication-ready plots**, with relevant
statistical details included in the plot itself if desired. We will see examples
of how to use this function in this vignette.
To begin with, here are some instances where you would want to use
`ggbetweenstats`-
- to check if a continuous variable differs across multiple groups/conditions
- to compare distributions visually
**Note**: This vignette uses the pipe operator (`%>%`), if you are not
familiar with this operator, here is a good explanation:
<http://r4ds.had.co.nz/pipes.html>
## Comparisons between groups with `ggbetweenstats`
To illustrate how this function can be used, we will use the `gapminder` dataset
throughout this vignette. This dataset provides values for life expectancy, GDP
per capita, and population, at 5 year intervals, from 1952 to 2007, for each of
142 countries (courtesy [Gapminder Foundation](https://www.gapminder.org/)).
Let's have a look at the data-
```{r}
#| label = "gapminder"
library(gapminder)
dplyr::glimpse(gapminder::gapminder)
```
**Note**: For the remainder of the vignette, we're going to exclude *Oceania*
from the analysis simply because there are so few observations (countries).
Suppose the first thing we want to inspect is the distribution of life
expectancy for the countries of a continent in 2007. We also want to know if the
mean differences in life expectancy between the continents is statistically
significant.
The simplest form of the function call is-
```{r}
#| label = "ggbetweenstats1",
#| fig.height = 6,
#| fig.width = 8
ggbetweenstats(
data = dplyr::filter(gapminder::gapminder, year == 2007, continent != "Oceania"),
x = continent,
y = lifeExp
)
```
**Note**:
- The function automatically decides whether an independent samples *t*-test
is preferred (for 2 groups) or a Oneway ANOVA (3 or more groups). based on
the number of levels in the grouping variable.
- The output of the function is a `ggplot` object which means that it can be
further modified with `{ggplot2}` functions.
As can be seen from the plot, the function by default returns Bayes Factor for
the test. If the null hypothesis can't be rejected with the null hypothesis
significance testing (NHST) approach, the Bayesian approach can help index
evidence in favor of the null hypothesis (i.e., $BF_{01}$).
By default, natural logarithms are shown because Bayes Factor values can
sometimes be pretty large. Having values on logarithmic scale also makes it easy
to compare evidence in favor alternative ($BF_{10}$) versus null ($BF_{01}$)
hypotheses (since $log_{e}(BF_{01}) = - log_{e}(BF_{10})$).
We can make the output much more aesthetically pleasing as well as informative
by making use of the many optional parameters in `ggbetweenstats`. We'll add a
title and caption, better `x` and `y` axis labels. We can and will change the overall theme as well as the color palette in use.
```{r}
#| label = "ggbetweenstats2",
#| fig.height = 6,
#| fig.width = 8
ggbetweenstats(
data = dplyr::filter(gapminder, year == 2007, continent != "Oceania"),
x = continent, ## grouping/independent variable
y = lifeExp, ## dependent variables
type = "robust", ## type of statistics
xlab = "Continent", ## label for the x-axis
ylab = "Life expectancy", ## label for the y-axis
## turn off messages
ggtheme = ggplot2::theme_gray(), ## a different theme
package = "yarrr", ## package from which color palette is to be taken
palette = "info2", ## choosing a different color palette
title = "Comparison of life expectancy across continents (Year: 2007)",
caption = "Source: Gapminder Foundation"
) + ## modifying the plot further
ggplot2::scale_y_continuous(
limits = c(35, 85),
breaks = seq(from = 35, to = 85, by = 5)
)
```
As can be appreciated from the effect size (partial eta squared) of 0.635, there
are large differences in the mean life expectancy across continents.
Importantly, this plot also helps us appreciate the distributions within any
given continent. For example, although Asian countries are doing much better
than African countries, on average, Afghanistan has a particularly grim average
for the Asian continent, possibly reflecting the war and the political turmoil.
So far we have only used a classic parametric test and a boxviolin plot,
but we can also use other available options:
- The `type` (of test) argument also accepts the following abbreviations:
`"p"` (for *parametric*), `"np"` (for *nonparametric*), `"r"` (for
*robust*), `"bf"` (for *Bayes Factor*).
- The type of plot to be displayed can also be modified (`"box"`, `"violin"`,
or `"boxviolin"`).
- The color palettes can be modified.
Let's use the `combine_plots` function to make one plot from four separate
plots that demonstrates all of these options. Let's compare life expectancy for
all countries for the first and last year of available data 1957 and 2007. We
will generate the plots one by one and then use `combine_plots` to merge them
into one plot with some common labeling. It is possible, but not necessarily
recommended, to make each plot have different colors or themes.
For example,
```{r}
#| label = "ggbetweenstats3",
#| fig.height = 10,
#| fig.width = 12
## selecting subset of the data
df_year <- dplyr::filter(gapminder::gapminder, year == 2007 | year == 1957)
p1 <- ggbetweenstats(
data = df_year,
x = year,
y = lifeExp,
xlab = "Year",
ylab = "Life expectancy",
# to remove violin plot
violin.args = list(width = 0),
type = "p",
conf.level = 0.99,
title = "Parametric test",
package = "ggsci",
palette = "nrc_npg"
)
p2 <- ggbetweenstats(
data = df_year,
x = year,
y = lifeExp,
xlab = "Year",
ylab = "Life expectancy",
# to remove box plot
boxplot.args = list(width = 0),
type = "np",
conf.level = 0.99,
title = "Non-parametric Test",
package = "ggsci",
palette = "uniform_startrek"
)
p3 <- ggbetweenstats(
data = df_year,
x = year,
y = lifeExp,
xlab = "Year",
ylab = "Life expectancy",
type = "r",
conf.level = 0.99,
title = "Robust Test",
tr = 0.005,
package = "wesanderson",
palette = "Royal2",
digits = 3
)
## Bayes Factor for parametric t-test and boxviolin plot
p4 <- ggbetweenstats(
data = df_year,
x = year,
y = lifeExp,
xlab = "Year",
ylab = "Life expectancy",
type = "bayes",
violin.args = list(width = 0),
boxplot.args = list(width = 0),
point.args = list(alpha = 0),
title = "Bayesian Test",
package = "ggsci",
palette = "nrc_npg"
)
## combining the individual plots into a single plot
combine_plots(
list(p1, p2, p3, p4),
plotgrid.args = list(nrow = 2),
annotation.args = list(
title = "Comparison of life expectancy between 1957 and 2007",
caption = "Source: Gapminder Foundation"
)
)
```
## Grouped analysis with `grouped_ggbetweenstats`
What if we want to analyze both by continent and between 1957 and 2007? A
combination of our two previous efforts.
`{ggstatsplot}` provides a special helper function for such instances:
`grouped_ggbetweenstats`. This is merely a wrapper function around
`combine_plots`. It applies `ggbetweenstats` across all **levels**
of a specified **grouping variable** and then combines list of individual plots
into a single plot. Note that the grouping variable can be anything: conditions
in a given study, groups in a study sample, different studies, etc.
Let's focus on the same 4 continents for the following years: 1967, 1987, 2007.
Also, let's carry out pairwise comparisons to see if there differences between
every pair of continents.
```{r}
#| label = "grouped1",
#| fig.height = 18,
#| fig.width = 8
## select part of the dataset and use it for plotting
gapminder::gapminder %>%
dplyr::filter(year %in% c(1967, 1987, 2007), continent != "Oceania") %>%
grouped_ggbetweenstats(
## arguments relevant for ggbetweenstats
x = continent,
y = lifeExp,
grouping.var = year,
xlab = "Continent",
ylab = "Life expectancy",
pairwise.display = "significant", ## display only significant pairwise comparisons
p.adjust.method = "fdr", ## adjust p-values for multiple tests using this method
# ggtheme = ggthemes::theme_tufte(),
package = "ggsci",
palette = "default_jco",
## arguments relevant for combine_plots
annotation.args = list(title = "Changes in life expectancy across continents (1967-2007)"),
plotgrid.args = list(nrow = 3)
)
```
As seen from the plot, although the life expectancy has been improving steadily
across all continents as we go from 1967 to 2007, this improvement has not been
happening at the same rate for all continents. Additionally, irrespective of
which year we look at, we still find significant differences in life expectancy
across continents which have been surprisingly consistent across five decades
(based on the observed effect sizes).
## Grouped analysis with `ggbetweenstats` + `{purrr}`
Although this grouping function provides a quick way to explore the data, it
leaves much to be desired. For example, the same type of plot and test is
applied for all years, but maybe we want to change this for different years, or
maybe we want to gave different effect sizes for different years. This type of
customization for different levels of a grouping variable is not possible with
`grouped_ggbetweenstats`, but this can be easily achieved using the `{purrr}`
package.
See the associated vignette here:
<https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html>
## Within-subjects designs
For repeated measures designs, `ggwithinstats()` function can be used:
<https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggwithinstats.html>
## Summary of graphics and tests
Details about underlying functions used to create graphics and statistical tests carried out can be found in the function documentation:
<https://indrajeetpatil.github.io/ggstatsplot/reference/ggbetweenstats.html>
## Reporting
```{asis, file="../../man/md-fragments/reporting.md"}
```
For example, let's see the following example:
```{r}
#| label = "reporting",
#| fig.height = 6,
#| fig.width = 8
ggbetweenstats(ToothGrowth, supp, len)
```
The narrative context (assuming `type = "parametric"`) can complement this plot
either as a figure caption or in the main text-
> Welch's *t*-test revealed that, across 60 guinea pigs, although the tooth
length was higher when the animal received vitamin C via orange juice as
compared to via ascorbic acid, this effect was not statistically significant.
The effect size $(g = 0.49)$ was medium, as per Cohen鈥檚 (1988) conventions. The
Bayes Factor for the same analysis revealed that the data were `r round(exp(0.18), 2)` times more probable under the alternative hypothesis as
compared to the null hypothesis. This can be considered weak evidence
(Jeffreys, 1961) in favor of the alternative hypothesis.
Similar reporting style can be followed when the function performs one-way ANOVA
instead of a *t*-test.
## Suggestions
If you find any bugs or have any suggestions/remarks, please file an issue on
`GitHub`: <https://github.com/IndrajeetPatil/ggstatsplot/issues>