-
Notifications
You must be signed in to change notification settings - Fork 25
/
Copy pathgroup_by.qmd
296 lines (206 loc) · 11.6 KB
/
group_by.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
output: html_document
editor_options:
chunk_output_type: console
---
# Grouping and summarizing
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr"))
```
----
In Chapter 9, you were introduced to the `summarise()` function, which aggregates an entire column using functions like `mean()` or `sum()`. This summarization can also be performed within groups allowing calculations to be done separately for each category of a grouping variable.
## Summarizing data by group
Let's first create a dataframe listing the average delay time in minutes, by day of the week and by quarter, for Logan airport's 2014 outbound flights.
```{r}
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed", "Thurs", "Fri"), times = 4),
levels = c("Mon", "Tues", "Wed", "Thurs", "Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 4.9, 8.8, 12.2, 12.2, 5.4, 9.7, 11.1, 10.2, 8.1, 8.8, 7.9, 10.2,
9.2, 7.9, 6.9, 5.0, 9.3, 9.7, 5.6))
```
The goal will be to summarize the table by `Weekday` as shown in the following graphic.
<img src="img/Summarize_by_one_variable.png" style="width: 73%; height: auto;">
The data table has three variables: `Weekday`, `Quarter` and `Delay`. `Delay` is the value we will summarize which leaves us with one variable to *collapse*: `Quarter`. In doing so, we will compute the `Delay` statistics for all quarters associated with a unique `Weekday` value.
This workflow requires two operations: a grouping operation using the `group_by` function and a summary operation using the `summarise`/`summarize` function. Here, we'll compute two summary statistics: minimum delay time and maximum delay time.
```{r}
library(dplyr)
df %>%
group_by(Weekday) %>%
summarise(min_delay = min(Delay), max_delay = max(Delay))
```
Note that the weekday follows the chronological order as defined in the `Weekday` factor.
You'll also note that the output is a `tibble`. This data class is discussed at the end of this page.
### Grouping by multiple variables
You can group by more than one variable. For example, let's build another dataframe listing the average delay time in minutes, by quarter, by weekend/weekday and by inbound/outbound status for Logan airport's 2014 outbound flights.
```{r}
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
```
The goal will be to summarize the delay time by `Quarter` and by `Week` type as shown in the following graphic.
<img src="img/Summarize_by_two_variable.png" style="width: 73%; height: auto;">
This time, the data table has four variables. We are wanting to summarize by `Quater` and `Week` which leaves one variable, `Direction`, that needs to be collapsed.
```{r}
df2 %>%
group_by(Quarter, Week) %>%
summarise(min_delay = min(Delay), max_delay = max(Delay))
```
The following section demonstrates other grouping/summarizing operations on a larger dataset.
### Functions used in `summarise()`
The `summarise()` function typically uses functions that **aggregate** or **extract** meaningful summaries from grouped data.
#### **Numeric Aggregation Functions**
- `sum(x)`: Computes the sum of values in `x`.
- `mean(x)`: Computes the mean (average) of `x`.
- `median(x)`: Computes the median of `x`.
- `min(x)`: Finds the minimum value in `x`.
- `max(x)`: Finds the maximum value in `x`.
- `sd(x)`: Computes the standard deviation of `x`.
- `var(x)`: Computes the variance of `x`.
- `n()`: Counts the number of rows in each group.
- `n_distinct(x)`: Counts the number of unique values in `x`.
Note that some of these aggregation functions will return `NA` if at least one element in the input vector is `NA`. To have the function ignore the `NA` values, add the `na.rm = TRUE` argument. For example:
```{r}
z <- c(3, NA, 10, 4, 0)
mean(z) # returns NA
mean(z , na.rm = TRUE) # ignores NA
```
#### **Extracting Single Values from Groups**
- `first(x)`: Retrieves the first value in `x`.
- `last(x)`: Retrieves the last value in `x`.
- `unique(x)`: Returns unique values (useful if all values in a group are identical).
#### **Boolean and Logical Summaries**
- `any(x)`: Returns `TRUE` if any value in `x` is `TRUE`.
- `all(x)`: Returns `TRUE` if all values in `x` are `TRUE`.
### Example
```{r}
dat <- data.frame(group = c("A", "A", "B", "B", "B"),
x = c(10, 20, 30, 40, 50),
total = c(30, 30, 120, 120, 120),
stat = c(FALSE, FALSE, FALSE, TRUE, FALSE))
# Summarizing data
dat %>%
group_by(group) %>%
summarise(sum_x = sum(x),
total = unique(total), # Could have also used first() or last()
frac = sum_x / total, # Check that sum(x) == total
stat = any(stat))
```
NOTE: In this example, `total` represents the sum of values within each group, so it should not be summed again. Since the rows are being collapsed, we need to explicitly tell `summarise()` how to retain a single value for the total column. Using `unique(total)` ensures we keep the intended group total, assuming it remains consistent within each group. Alternatively, `first(total)` or `last(total)` could also be used if the values are identical across rows.
## Other ways to use `group_by`
The grouping operation does not always need to be followed by a summarization function. The `group_by()` function can also be used in conjunction with `mutate()` to perform calculations within each group while preserving the original number of rows.
For example, you might use `mutate()` after `group_by()` to create group-relative values, like percentages or rankings within each category, or apply conditional transformations where the operation depends on the group a row belongs to.
In the following example, the delay times are reported as the difference between delay time and the associate quarter's mean delay time.
<img src="img/group_by_mutate.png" style="width: 60%; height: auto;">
```{r}
df %>%
group_by(Quarter) %>%
mutate(Relative_delay = Delay - mean(Delay))
```
## Ungrouping a dataframe in `dplyr`
When using `group_by()` in dplyr, the behavior of `summarise()` depends on the number of grouping variables.
### Grouping by a Single Variable
If you group by one variable, `summarise()` automatically removes the grouping after aggregation. For example:
```{r}
mtcars %>% group_by(gear) %>% summarise(mpg = mean(mpg))
```
In this case, the resulting data frame is ungrouped as `summarise()` drops the only grouping variable (gear). This is confirmed by the absence of any group in the table header.
### Grouping by Multiple Variables
If you group by more than one variable, `summarise()` only removes the last grouping variable. For example
```{r}
mtcars %>% group_by(am, gear) %>% summarise(mpg = mean(mpg))
```
Here, the result remains grouped by `am` as only `gear` was removed from the grouping structure. You can confirm this by checking the table header or by using `group_vars()` on the dataframe.
### Grouping and `mutate()`
Note that this behavior does not apply to `mutate()`. When using `mutate()` on a grouped data frame, all grouping variables persist in the output.
```{r}
mtcars %>% group_by(am, gear) %>% mutate(mpg_scaled = mpg / mean(mpg)) %>% head()
```
In the output, the first few rows are displayed along with the header information, which confirms that the grouping variables (`am` and `gear`) remain intact.
### Removing al grouping
To ensure that the table is completely free of any grouping elements, use `ungroup`:
```{r}
mtcars %>% group_by(am, gear) %>% summarise(mpg = mean(mpg)) %>% ungroup()
```
Now, the resulting data frame has no remaining grouping structure.
## A working example
The data file *FAO_grains_NA.csv* will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
Run the following line to load the FAO data file into your current R session.
```{r}
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
```
Make sure to load the `dplyr` package before proceeding with the following examples.
```{r, message=FALSE}
library(dplyr)
```
### Summarizing by crop type
The `group_by` function will split any operations applied to the dataframe into groups defined by one or more columns. For example, if we wanted to get the minimum and maximum years from the `Year` column for which crop data are available *by crop type*, we would type the following:
```{r}
dat %>%
group_by(Crop) %>%
summarise(yr_min = min(Year), yr_max=max(Year))
```
### Count the number of records in each group
In this example, we are identifying the number of records by `Crop` type. There are two ways this can be accomplished:
```{r eval = FALSE}
dat %>%
filter(Information == "Yield (Hg/Ha)",
Year >= 2005 & Year <=2010,
Country=="United States of America") %>%
group_by(Crop) %>%
count()
```
Or,
```{r}
dat %>%
filter(Information == "Yield (Hg/Ha)",
Year >= 2005 & Year <=2010,
Country=="United States of America") %>%
group_by(Crop) %>%
summarise(Count = n())
```
The former uses the `count()` function and the latter uses the `summarise()` and `n()` functions.
### Summarize by mean yield and year range
Here's another example where *two* variables are summarized in a single pipe.
```{r}
dat.grp <- dat %>%
filter(Information == "Yield (Hg/Ha)",
Year >= 2005 & Year <=2010,
Country=="United States of America") %>%
group_by(Crop) %>%
summarise( Yield = mean(Value), `Number of Years` = max(Year) - min(Year))
dat.grp
```
### Normalizing each value in a group by the group median
In this example, we are subtracting each value in a group by that group's median. This can be useful in identifying which year yields are higher than or lower than the median yield value within each crop group. We will concern ourselves with US yields only and sort the output by crop type. We'll save the output dataframe as `dat2`.
```{r}
dat2 <- dat %>%
filter(Information == "Yield (Hg/Ha)",
Country == "United States of America") %>%
select(Crop, Year, Value) %>%
group_by(Crop) %>%
mutate(NormYield = Value - median(Value)) %>%
arrange(Crop)
```
Let's plot the normalized yields by year for `Barley` and add a `0` line representing the (normalized) central value.
```{r fig.width=5, fig.height=3,small.mar=TRUE}
plot( NormYield ~ Year, dat2[dat2$Crop == "Barley",] )
abline(h = 0, col="red")
```
The relative distribution of points does not change, but the values do (they are re-scaled) allowing us to compare values based on some localized (group) context. This technique will prove very useful later in the course when EDA topics are explored.
### `dplyr`'s output data structure
Some of `dplyr`'s functions such as `group_by`/`summarise` generate a **tibble** data table. For example, the `dat.grp` object created in the last chunk of code is associated with a `tb_df` (a tibble).
```{r}
class(dat.grp)
```
A *tibble* table will behave a little differently than a *data frame* table when printing to a screen or subsetting its elements. In most cases, a tibble rendering of the table will not pose a problem in a workflow, however, this format may prove problematic with some older functions. To remedy this, you can force the `dat.grp` object to a standalone `dataframe` as follows:
```{r}
dat.df <- as.data.frame(dat.grp)
class(dat.df)
```