R语言 Mutate家族
首先是加载相关的包,mutate主要属于dplyr包里,这里我们统一使用tidyverse包。
tidyverse包中含有各种数据整理以及画图的包,如下加载tidyverse包:
> library(tidyverse)-- Attaching packages ------------------------ tidyverse 1.3.0 -- √ ggplot2 3.3.3 √ purrr 0.3.4 √ tibble 3.0.5 √ dplyr 1.0.3 √ tidyr 1.1.2 √ stringr 1.4.0 √ readr 1.4.0 √ forcats 0.5.1 -- Conflicts --------------------------- tidyverse_conflicts() -- x dplyr::filter() masks stats::filter()x dplyr::lag() masks stats::lag()
参考
https://dplyr.tidyverse.org/reference/mutate_all.html
教材《R数据科学》
mutate函数
mutate() 的主要功能是为数据框增加列。mutate总是把新的列加在数据集的最后。新列一旦创建就可以立即使用。
一个简单的栗子:
> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa #在最后的地方增加新列 > mutate(iris, new_col = Petal.Length + Petal.Width) %>% head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_col 1 5.1 3.5 1.4 0.2 setosa 1.6 2 4.9 3.0 1.4 0.2 setosa 1.6 3 4.7 3.2 1.3 0.2 setosa 1.5 4 4.6 3.1 1.5 0.2 setosa 1.7 5 5.0 3.6 1.4 0.2 setosa 1.6 6 5.4 3.9 1.7 0.4 setosa 2.1
PS:%>%是管道符号,用于把前面的数据向后传递,避免函数嵌套,增加代码的可阅读性。
mutate还有三个衍生函数:
mutate_at(); mutate_if(); mutate_all()
在官网上的关于这三个后缀的解释如下:
_all: affects every variable
_at: affects variables selected with a character vector or vars()
_if : affects variables selected with a predicate function:
其中,all是针对所有列,at是针对特定的列,if的满足特定条件的列
参数如下:
mutate_all(.tbl, .funs, ...)
mutate_if(.tbl, .predicate, .funs, ...)
mutate_at(.tbl, .vars, .funs, ..., .cols = NULL)
Arguments
image.png
解释一下官网给出的例子
mutate_at
scale2 <- function(x, na.rm = FALSE)(x - mean(x, na.rm = na.rm)) / sd(x, na.rm) starwars %>% mutate_at(c("height", "mass"), scale2) # A tibble: 87 x 14 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke S~ NA NA blond fair blue 19 male mascu~ 2 C-3PO NA NA NA gold yellow 112 none mascu~ 3 R2-D2 NA NA NA white, bl~ red 33 none mascu~ 4 Darth ~ NA NA none white yellow 41.9 male mascu~ 5 Leia O~ NA NA brown light brown 19 fema~ femin~ 6 Owen L~ NA NA brown, gr~ light blue 52 male mascu~ 7 Beru W~ NA NA brown light blue 47 fema~ femin~ 8 R5-D4 NA NA NA white, red red NA none mascu~ 9 Biggs ~ NA NA black light brown 24 male mascu~ 10 Obi-Wa~ NA NA auburn, w~ fair blue-gray 57 male mascu~ # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>, # films <list>, vehicles <list>, starships <list>
在height,mass列执行scale2
以下两个命令是等同的
starwars %>% mutate_at(c(height,mass), scale2) starwars %>% mutate(across(c("height", "mass"), scale2))
PS: across() 即让函数穿过所选择的列,即同时对所选择的多列应用若干函数,这里和mutate联合使用,达到mutate_at的作用。
mutate_at的参数中使用vars(), funs()来完善整个函数
eg:
> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > mutate_at(iris, vars(-Species), funs(log(.))) %>% head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 1.629241 1.252763 0.3364722 -1.6094379 setosa 2 1.589235 1.098612 0.3364722 -1.6094379 setosa 3 1.547563 1.163151 0.2623643 -1.6094379 setosa 4 1.526056 1.131402 0.4054651 -1.6094379 setosa 5 1.609438 1.280934 0.3364722 -1.6094379 setosa 6 1.686399 1.360977 0.5306283 -0.9162907 setosa
mutate_if
starwars %>% mutate_if(is.numeric, scale2, na.rm = TRUE) # A tibble: 87 x 14 name height mass hair_color skin_color eye_color birth_year sex <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> 1 Luke Skyw~ -0.0678 -0.120 blond fair blue -0.443 male 2 C-3PO -0.212 -0.132 NA gold yellow 0.158 none 3 R2-D2 -2.25 -0.385 NA white, bl~ red -0.353 none 4 Darth Vad~ 0.795 0.228 none white yellow -0.295 male 5 Leia Orga~ -0.701 -0.285 brown light brown -0.443 fema~ 6 Owen Lars 0.105 0.134 brown, grey light blue -0.230 male 7 Beru Whit~ -0.269 -0.132 brown light blue -0.262 fema~ 8 R5-D4 -2.22 -0.385 NA white, red red NA none 9 Biggs Dar~ 0.249 -0.0786 black light brown -0.411 male 10 Obi-Wan K~ 0.220 -0.120 auburn, wh~ fair blue-gray -0.198 male # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>, # species <chr>, films <list>, vehicles <list>, starships <list>
同理,这两行代码的性质也是一样的
starwars %>% mutate_if(is.numeric, scale2, na.rm = TRUE)starwars %>% mutate(across(where(is.numeric), scale2, na.rm = TRUE))
使用where函数筛选出numeric的列,再使用across联合这些列,因此函数可以特定的穿过这些列,达到mutate_if的作用。
如果你想对数据框中的某列同时使用多个函数,使用list()。当同时使用多个function时,将会创建一个新的列,而不是像之前那样在原列上进行修饰。
eg:
> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > iris %>% mutate_if(is.numeric, list(scale2, log)) %>% head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_fn1 1 5.1 3.5 1.4 0.2 setosa -0.8976739 2 4.9 3.0 1.4 0.2 setosa -1.1392005 3 4.7 3.2 1.3 0.2 setosa -1.3807271 4 4.6 3.1 1.5 0.2 setosa -1.5014904 5 5.0 3.6 1.4 0.2 setosa -1.0184372 6 5.4 3.9 1.7 0.4 setosa -0.5353840 Sepal.Width_fn1 Petal.Length_fn1 Petal.Width_fn1 Sepal.Length_fn2 1 1.01560199 -1.335752 -1.311052 1.629241 2 -0.13153881 -1.335752 -1.311052 1.589235 3 0.32731751 -1.392399 -1.311052 1.547563 4 0.09788935 -1.279104 -1.311052 1.526056 5 1.24503015 -1.335752 -1.311052 1.609438 6 1.93331463 -1.165809 -1.048667 1.686399 Sepal.Width_fn2 Petal.Length_fn2 Petal.Width_fn2 1 1.252763 0.3364722 -1.6094379 2 1.098612 0.3364722 -1.6094379 3 1.163151 0.2623643 -1.6094379 4 1.131402 0.4054651 -1.6094379 5 1.280934 0.3364722 -1.6094379 6 1.360977 0.5306283 -0.9162907
还可以进一步对function进行命名,注意下面的dataframe的列名与上面的不一样,冠以函数名。
> iris %>% mutate_if(is.numeric, list(scale = scale2, log = log)) %>% head() Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_scale1 5.1 3.5 1.4 0.2 setosa -0.89767392 4.9 3.0 1.4 0.2 setosa -1.13920053 4.7 3.2 1.3 0.2 setosa -1.38072714 4.6 3.1 1.5 0.2 setosa -1.50149045 5.0 3.6 1.4 0.2 setosa -1.01843726 5.4 3.9 1.7 0.4 setosa -0.5353840 Sepal.Width_scale Petal.Length_scale Petal.Width_scale Sepal.Length_log1 1.01560199 -1.335752 -1.311052 1.6292412 -0.13153881 -1.335752 -1.311052 1.5892353 0.32731751 -1.392399 -1.311052 1.5475634 0.09788935 -1.279104 -1.311052 1.5260565 1.24503015 -1.335752 -1.311052 1.6094386 1.93331463 -1.165809 -1.048667 1.686399 Sepal.Width_log Petal.Length_log Petal.Width_log1 1.252763 0.3364722 -1.60943792 1.098612 0.3364722 -1.60943793 1.163151 0.2623643 -1.60943794 1.131402 0.4054651 -1.60943795 1.280934 0.3364722 -1.60943796 1.360977 0.5306283 -0.9162907
mutate_all
mutate_all网页上没有过多的例子,但是根据其解释,应该是对所有的变量进行操作。
> a = matrix(rep(1:5,each =10),10) %>% as.data.frame()> a V1 V2 V3 V4 V51 1 2 3 4 52 1 2 3 4 53 1 2 3 4 54 1 2 3 4 55 1 2 3 4 56 1 2 3 4 57 1 2 3 4 58 1 2 3 4 59 1 2 3 4 510 1 2 3 4 5> mutate_all(a,funs(sum(.))) V1 V2 V3 V4 V51 10 20 30 40 502 10 20 30 40 503 10 20 30 40 504 10 20 30 40 505 10 20 30 40 506 10 20 30 40 507 10 20 30 40 508 10 20 30 40 509 10 20 30 40 5010 10 20 30 40 50
补充一点:
调用funs时,可以按照例子那样自己写一个function,多个function使用list(),也可以使用~fun(.)调用。
image.png
starwars %>% mutate_at(c("height", "mass"), ~scale2(., na.rm = TRUE))
总结
与mutate增加新变量不同,mutate的衍生函数主要是按列对数据赋予function,如果想增加按行,可以增加group_by以及rowwise函数。
作者:日月其除
链接:https://www.jianshu.com/p/86b30b81d2e0