09-30 08:20 阅读 65

R 数据处理（十）—— dplyr

6 summarise

最后一个重要的动词 summarise，它将所有信息汇总为一行

> summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
  delay
  <dbl>
1  12.6

一般很少单独使用 summarise()，而是配合 group_by() 函数使用。这样就从对所有数据统计变成对每个分组进行统计，有助于直观了解组与组之间的差异。

当你在分组数据上使用 dplyr 的动词函数时，它们会自动应用的每个分组上。例如，我们将相同的代码应用于按日期分组的数据中，我们将获得每个日期的平均延迟。

> flights %>% group_by(year, month, day) 
    %>% summarise(delay=mean(dep_delay, na.rm = TRUE))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day delay
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# … with 355 more rows

group_by() 和 summarise() 结合使用是 dplyr 最常用的工具之一。

6.1 用管道连接多个操作

比如，我们想要了解每个地方的距离和平均延迟之间的关系

> by_dest <- group_by(flights, dest)
> delay <- summarise(by_dest,
+    count = n(), # 计算每个分组的大小
+    dist = mean(distance, na.rm = TRUE),
+    delay = mean(arr_delay, na.rm = TRUE)
+ )
`summarise()` ungrouping output (override with `.groups` argument)
> delay <- filter(delay, count > 20, dest != "HNL")
> 
> ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
+   geom_point(aes(size = count), alpha = 1/3) +
+   geom_smooth(se = FALSE)
`geom_smooth()` using method = 'loess' and formula 'y ~ x'

image

我们分三步获取数据：

将航班根据目的地分组
计算距离、平均延误和航班数的汇总信息
过滤掉噪声点和檀香山机场，该机场距离下一个最近的机场几乎是其两倍。

编写这段代码是很让人沮丧的，因为你需要给中间结果指定变量，取名字又是一个纠结的问题。尽管你可能觉得取名字无关紧要，但是最好也是要见名知意的比较好。所以，这也会影响我们的开发效率（个人是很赞成这种说法的，哈哈）。

所以我们引入的管道操作符 %>%，让我们来修改一下上面的代码

> delays <- flights %>% 
+     group_by(dest) %>% 
+     summarise(
+         count = n(),
+         dist = mean(distance, na.rm = TRUE),
+         delay = mean(arr_delay, na.rm = TRUE)
+     ) %>% 
+     filter(count > 20, dest != "HNL")
`summarise()` ungrouping output (override with `.groups` argument)

这使我们的注意力集中在数据的转换上，而不是转换为什么东西，可以让代码变得更加容易阅读。

但是，你也看到了 ggplot2 并没有使用管道操作，因为 ggplot2 出现在管道操作之前，而它的下一代 ggvis 已经可以支持管道操作了，但是这个包还没完全成熟。

6.2 缺失值

你可能想知道我们上面使用的 na.rm 参数，如果我们不使用它会怎么样？

> flights %>% 
+     group_by(year, month, day) %>% 
+     summarise(mean = mean(dep_delay))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day  mean
   <int> <int> <int> <dbl>
 1  2013     1     1    NA
 2  2013     1     2    NA
 3  2013     1     3    NA
 4  2013     1     4    NA
 5  2013     1     5    NA
 6  2013     1     6    NA
 7  2013     1     7    NA
 8  2013     1     8    NA
 9  2013     1     9    NA
10  2013     1    10    NA
# … with 355 more rows

我们得到了一列缺失值，因为聚合函数通常遵循缺失值规则：

如果输入中包含任何缺失值，那么输出的结果也将是缺失值。幸运的是，所有聚合函数都有一个 na.rm 参数，在计算之前删除缺失值

> flights %>% 
+     group_by(year, month, day) %>% 
+     summarise(mean = mean(dep_delay, na.rm = TRUE))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day  mean
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# … with 355 more rows

在这种情况下，缺失值表示取消的航班，我们也可以通过首先删除取消的航班来解决问题。

我们将保存这个数据集，以便在接下来的几个示例中重用它。

> not_cancelled <- flights %>% 
+   filter(!is.na(dep_delay), !is.na(arr_delay))
> 
> not_cancelled %>% 
+   group_by(year, month, day) %>% 
+   summarise(mean = mean(dep_delay))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day  mean
   <int> <int> <int> <dbl>
 1  2013     1     1 11.4 
 2  2013     1     2 13.7 
 3  2013     1     3 10.9 
 4  2013     1     4  8.97
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.56
 9  2013     1     9  2.30
10  2013     1    10  2.84
# … with 355 more rows

6.3 计数

无论何时进行任何聚合操作，都最好包含一个计数（n()），或计算非缺失值（sum(!is.na(x))），这样你就可以确认支持你的结论的数据基数。

例如，让我们看一下平均延迟最高的飞机（通过其尾号标识）

> delays <- not_cancelled %>% 
+     group_by(tailnum) %>% 
+     summarise(
+         delay = mean(arr_delay)
+     )
`summarise()` ungrouping output (override with `.groups` argument)
> #> `summarise()` ungrouping output (override with `.groups` argument)
> 
> ggplot(data = delays, mapping = aes(x = delay)) + 
+     geom_freqpoly(binwidth = 10)

image

从图上可以看到，有些飞机平均延误 5 小时（300 分钟）

实际上，这个事有些微妙。如果我们画一个航班数量与平均延误的散点图，我们可以得到更多的信息

delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )
#> `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = delays, mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10, color='blue')

image

这一点也不奇怪，当航班很少时，平均延误的变化要大得多.

这个图的形状非常有特点：你会发现随着样本大小的增加，变化会减小。

在观察这类图时，可以先筛选出观察次数最少的组，这样你可以在最小的组中看到更多的模式和更少的极端变化情况。

下面，我们将向你展示将 ggplot2 集成到 dplyr 流中的简便方式

> delays %>% 
+     filter(n > 25) %>% 
+     ggplot(mapping = aes(x = n, y = delay)) + 
+     geom_point(alpha = 1/10, color='green')

image.png

这种模式还有一种常见的变体。让我们看看棒球击球手平均表现与他们击球次数的关系

在这里，我使用 Lahman 软件包中的数据来计算每个棒球大联盟运动员的击球平均值（命中次数/尝试次数）

当我们将击球手的技术（以平均击球数 ba 来衡量）与击球机会（以击球 ab 来衡量）来绘图时，您会看到两种模式

如上所述，当我们获得更多的数据点时，组中的变化会减小
技能(ba)和击球机会(ab)之间存在正相关关系。显然，这是因为球队控制着谁可以上场，他们肯定会挑选自己最好的球员。

batting <- as_tibble(Lahman::Batting)

batters <- batting %>% 
  group_by(playerID) %>% 
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )

batters %>% 
  filter(ab > 100) %>% 
  ggplot(mapping = aes(x = ab, y = ba)) +
  geom_point(color='sienna') + 
  geom_smooth(se = FALSE)

image.png

6.4 汇总函数

尽管 means、counts 和 sum 三个汇总函数已经满足大多数要求了，但是 R 还提供了许多汇总函数

位置：
- mean：均值
- median：中位值

> not_cancelled %>% 
+   group_by(year, month, day) %>% 
+   summarise(
+     avg_delay1 = mean(arr_delay),
+     avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
+   )
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 5
# Groups:   year, month [12]
    year month   day avg_delay1 avg_delay2
   <int> <int> <int>      <dbl>      <dbl>
 1  2013     1     1     12.7         32.5
 2  2013     1     2     12.7         32.0
 3  2013     1     3      5.73        27.7
 4  2013     1     4     -1.93        28.3
 5  2013     1     5     -1.53        22.6
 6  2013     1     6      4.24        24.4
 7  2013     1     7     -4.95        27.8
 8  2013     1     8     -3.23        20.8
 9  2013     1     9     -0.264       25.6
10  2013     1    10     -5.90        27.3
# … with 355 more rows

散度：
- sd(x)：标准差
- IQR(x)：四分位范围
- mad(x)：绝对中位差

> not_cancelled %>% 
+     group_by(dest) %>% 
+     summarise(distance_sd = sd(distance)) %>% 
+     arrange(desc(distance_sd))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 104 x 2
   dest  distance_sd
   <chr>       <dbl>
 1 EGE         10.5 
 2 SAN         10.4 
 3 SFO         10.2 
 4 HNL         10.0 
 5 SEA          9.98
 6 LAS          9.91
 7 PDX          9.87
 8 PHX          9.86
 9 LAX          9.66
10 IND          9.46
# … with 94 more rows

秩次：
- min：最小值
- max：最大值
- quantile：分位数

> not_cancelled %>% 
+     group_by(year, month, day) %>% 
+     summarise(
+         first = min(dep_time),
+         last = max(dep_time)
+     )
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 5
# Groups:   year, month [12]
    year month   day first  last
   <int> <int> <int> <int> <int>
 1  2013     1     1   517  2356
 2  2013     1     2    42  2354
 3  2013     1     3    32  2349
 4  2013     1     4    25  2358
 5  2013     1     5    14  2357
 6  2013     1     6    16  2355
 7  2013     1     7    49  2359
 8  2013     1     8   454  2351
 9  2013     1     9     2  2252
10  2013     1    10     3  2320
# … with 355 more rows

位置:
- first(x)：第一个数
- nth(x, 2)：第二个数
- last(x)：最后一个数

> not_cancelled %>% 
+     group_by(year, month, day) %>% 
+     summarise(
+         first_dep = first(dep_time), 
+         last_dep = last(dep_time)
+     )
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 5
# Groups:   year, month [12]
    year month   day first_dep last_dep
   <int> <int> <int>     <int>    <int>
 1  2013     1     1       517     2356
 2  2013     1     2        42     2354
 3  2013     1     3        32     2349
 4  2013     1     4        25     2358
 5  2013     1     5        14     2357
 6  2013     1     6        16     2355
 7  2013     1     7        49     2359
 8  2013     1     8       454     2351
 9  2013     1     9         2     2252
10  2013     1    10         3     2320
# … with 355 more rows

计数：
- n()：返回当前组的大小
- n_distinct(x)：计算不同值的数量
- sum(!is.na(x))：计算非缺失值的数量

> not_cancelled %>% 
+     group_by(dest) %>% 
+     summarise(carriers = n_distinct(carrier)) %>% 
+     arrange(desc(carriers))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 104 x 2
   dest  carriers
   <chr>    <int>
 1 ATL          7
 2 BOS          7
 3 CLT          7
 4 ORD          7
 5 TPA          7
 6 AUS          6
 7 DCA          6
 8 DTW          6
 9 IAD          6
10 MSP          6
# … with 94 more rows

简单统计到达的目的地的数目

> not_cancelled %>% 
+     count(dest)
# A tibble: 104 x 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     264
 3 ALB     418
 4 ANC       8
 5 ATL   16837
 6 AUS    2411
 7 AVL     261
 8 BDL     412
 9 BGR     358
10 BHM     269
# … with 94 more rows

通过添加权重系数，可以计算飞行的总英里数

> not_cancelled %>% 
+     count(tailnum, wt = distance)
# A tibble: 4,037 x 2
   tailnum      n
   <chr>    <dbl>
 1 D942DN    3418
 2 N0EGMQ  239143
 3 N10156  109664
 4 N102UW   25722
 5 N103US   24619
 6 N104UW   24616
 7 N10575  139903
 8 N105UW   23618
 9 N107US   21677
10 N108UW   32070
# … with 4,027 more rows

逻辑值：
当使用数值型的函数时，TRUE 会变成 1，FALSE 变为 0。可以方便的与 sum、mean 结合使用

> not_cancelled %>% 
+     group_by(year, month, day) %>% 
+     summarise(n_early = sum(dep_time < 500))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day n_early
   <int> <int> <int>   <int>
 1  2013     1     1       0
 2  2013     1     2       3
 3  2013     1     3       4
 4  2013     1     4       3
 5  2013     1     5       3
 6  2013     1     6       2
 7  2013     1     7       2
 8  2013     1     8       1
 9  2013     1     9       3
10  2013     1    10       3
# … with 355 more rows
# 计算航班延误超过 1 小时的比例
> not_cancelled %>% 
+     group_by(year, month, day) %>% 
+     summarise(hour_prop = mean(arr_delay > 60))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day hour_prop
   <int> <int> <int>     <dbl>
 1  2013     1     1    0.0722
 2  2013     1     2    0.0851
 3  2013     1     3    0.0567
 4  2013     1     4    0.0396
 5  2013     1     5    0.0349
 6  2013     1     6    0.0470
 7  2013     1     7    0.0333
 8  2013     1     8    0.0213
 9  2013     1     9    0.0202
10  2013     1    10    0.0183
# … with 355 more rows

6.5 多变量分组

通过对多变量分组，可以进行逐步汇总

> daily <- group_by(flights, year, month, day)
> (per_day   <- summarise(daily, flights = n()))
`summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# A tibble: 365 x 4
# Groups:   year, month [12]
    year month   day flights
   <int> <int> <int>   <int>
 1  2013     1     1     842
 2  2013     1     2     943
 3  2013     1     3     914
 4  2013     1     4     915
 5  2013     1     5     720
 6  2013     1     6     832
 7  2013     1     7     933
 8  2013     1     8     899
 9  2013     1     9     902
10  2013     1    10     932
# … with 355 more rows

> (per_month <- summarise(per_day, flights = sum(flights)))
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 12 x 3
# Groups:   year [1]
    year month flights
   <int> <int>   <int>
 1  2013     1   27004
 2  2013     2   24951
 3  2013     3   28834
 4  2013     4   28330
 5  2013     5   28796
 6  2013     6   28243
 7  2013     7   29425
 8  2013     8   29327
 9  2013     9   27574
10  2013    10   28889
11  2013    11   27268
12  2013    12   28135

> (per_year  <- summarise(per_month, flights = sum(flights)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 1 x 2
   year flights
  <int>   <int>
1  2013  336776

6.6 取消分组

如果您想要删除分组并对未分组的数据操作，可以使用 ungroup

> daily %>% 
+     ungroup() %>%
+     summarise(flights = n())
# A tibble: 1 x 1
  flights
    <int>
1  336776

6.7 思考练习

至少用 5 种不同方式评估一组航班的延误情况。并考虑以下情形：

航班有 50% 的概率提前 15 分钟，50% 的概率晚点 15 分钟
航班总是晚点 10 分钟
航班有 50% 的概率提前 30 分钟，50% 的概率晚点 30 分钟
航班用 99% 的概率准点，迟到两小时的几率只有 1%

到达延迟和出发延迟，哪个更重要

想出另一种方法，得到与下面代码相同的结果（不使用 count()）

not_cancelled %>% count(dest)
not_cancelled %>% count(tailnum, wt = distance)

我们对航班取消的定义（is.na(dep_delay) | is.na(arr_delay)）并不是最好的，为什么？哪一列才是最重要的
看看每天取消的航班数量，是否有规律？取消航班的比例与平均延误是否有关？
哪家航空公司的延误最严重。挑战：你能分清坏机场和坏航空公司的影响吗？（提示：flights %>% group_by(carrier, dest) %>% summarise(n())）
count 的 sort 参数的用处是什么，怎么使用？

7. group + mutate/filter

分组与 summarise() 结合使用是最有用的，但是你也可以与 mutate() 和 filter() 结合使用

找到每个组中最差的成员

> flights %>% 
+     group_by(year, month, day) %>%
+     filter(rank(desc(arr_delay)) < 10)
# A tibble: 3,306 x 19
# Groups:   year, month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     1      848           1835       853     1001           1950       851 MQ        3944
 2  2013     1     1     1815           1325       290     2120           1542       338 EV        4417
 3  2013     1     1     1842           1422       260     1958           1535       263 EV        4633
 4  2013     1     1     1942           1705       157     2124           1830       174 MQ        4410
 5  2013     1     1     2006           1630       216     2230           1848       222 EV        4644
 6  2013     1     1     2115           1700       255     2330           1920       250 9E        3347
 7  2013     1     1     2205           1720       285       46           2040       246 AA        1999
 8  2013     1     1     2312           2000       192       21           2110       191 EV        4312
 9  2013     1     1     2343           1724       379      314           1938       456 EV        4321
10  2013     1     2     1244            900       224     1431           1104       207 EV        4412
# … with 3,296 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

查找所有大于阈值的组

> popular_dests <- flights %>% 
+     group_by(dest) %>% 
+     filter(n() > 365)
> popular_dests
# A tibble: 332,577 x 19
# Groups:   dest [77]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1  2013     1     1      517            515         2      830            819        11 UA        1545
 2  2013     1     1      533            529         4      850            830        20 UA        1714
 3  2013     1     1      542            540         2      923            850        33 AA        1141
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725
 5  2013     1     1      554            600        -6      812            837       -25 DL         461
 6  2013     1     1      554            558        -4      740            728        12 UA        1696
 7  2013     1     1      555            600        -5      913            854        19 B6         507
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708
 9  2013     1     1      557            600        -3      838            846        -8 B6          79
10  2013     1     1      558            600        -2      753            745         8 AA         301
# … with 332,567 more rows, and 8 more variables: tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

标准化每组指标

> popular_dests %>% 
+     filter(arr_delay > 0) %>% 
+     mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
+     select(year:day, dest, arr_delay, prop_delay)
# A tibble: 131,106 x 6
# Groups:   dest [77]
    year month   day dest  arr_delay prop_delay
   <int> <int> <int> <chr>     <dbl>      <dbl>
 1  2013     1     1 IAH          11  0.000111 
 2  2013     1     1 IAH          20  0.000201 
 3  2013     1     1 MIA          33  0.000235 
 4  2013     1     1 ORD          12  0.0000424
 5  2013     1     1 FLL          19  0.0000938
 6  2013     1     1 ORD           8  0.0000283
 7  2013     1     1 LAX           7  0.0000344
 8  2013     1     1 DFW          31  0.000282 
 9  2013     1     1 ATL          12  0.0000400
10  2013     1     1 DTW          16  0.000116 
# … with 131,096 more rows

7.1 思考练习

哪架飞机（tailnum）的准点率最底？
如果要尽量避免延误，应该在一天中的什么时间乘坐飞机？
计算每个目的地的总延迟分钟数。计算每个航班其占目的地总延迟的比例。
使用 lag()，探索某一航班的延迟与其前一个航班的延迟相关性。
统计至少有两家航空公司飞过的目的地。利用目的地数量给航空公司排名。

感谢花花同学的上期参考答案：

http://note.youdao.com/s/NXHvKdvA

作者:名本无名

原文链接:https://www.jianshu.com/p/22db5ecc257a