84-预测分析-R语言实现-支持向量机
> library(pacman)
> p_load(dplyr, readr, caret)
1、导入数据
使用生物降解数据集,包含41个数值型变量,描述了1055种化学品的分子组成及属性。
建模的任务是根据这些属性来预测某个特定化学品是否会生物降解。
> bdf <- read_csv2("data_set/biodeg.csv", col_names = F)
> attr(bdf, "spec") <- NULL
> str(bdf)
## tibble [1,055 × 42] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ X1 : num [1:1055] 3919 417 3932 3 4236 ...
## $ X2 : chr [1:1055] "2.6909" "2.1144" "3.2512" "2.7098" ...
## $ X3 : num [1:1055] 0 0 0 0 0 0 1 0 0 0 ...
## $ X4 : num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X5 : num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X6 : num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X7 : num [1:1055] 0 0 0 0 0 0 0 0 2 2 ...
## $ X8 : num [1:1055] 314 308 267 20 294 286 111 316 444 412 ...
## $ X9 : num [1:1055] 2 1 2 0 2 2 0 3 2 0 ...
## $ X10: num [1:1055] 0 1 4 2 4 4 3 2 0 4 ...
## $ X11: num [1:1055] 0 0 0 0 0 0 0 0 0 3 ...
## $ X12: chr [1:1055] "0" "0" "0" "0" ...
## $ X13: num [1:1055] 3106 2461 3279 21 3449 ...
## $ X14: chr [1:1055] "2.55" "1.393" "2.585" "0.918" ...
## $ X15: num [1:1055] 9002 8723 911 6594 9528 ...
## $ X16: num [1:1055] 0 1 0 0 2 1 0 5 0 8 ...
## $ X17: chr [1:1055] "0.96" "0.989" "1.009" "1.108" ...
## $ X18: num [1:1055] 1142 1144 1152 1167 1147 ...
## $ X19: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X20: num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X21: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X22: chr [1:1055] "1.201" "1.104" "1.092" "1.024" ...
## $ X23: num [1:1055] 0 1 0 0 0 0 0 0 1 1 ...
## $ X24: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X25: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X26: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X27: num [1:1055] 1932 2214 1942 1414 1985 ...
## $ X28: chr [1:1055] "0.011" "-0.204" "-0.008" "1.073" ...
## $ X29: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X30: num [1:1055] 0 0 0 8361 10348 ...
## $ X31: chr [1:1055] "4.489" "1.542" "4.891" "1.333" ...
## $ X32: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X33: num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X34: num [1:1055] 0 0 0 0 0 0 0 0 0 2 ...
## $ X35: num [1:1055] 0 0 1 1 0 0 1 0 0 1 ...
## $ X36: num [1:1055] 2949 3315 3076 3046 3351 ...
## $ X37: num [1:1055] 1591 1967 2417 5 2405 ...
## $ X38: num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X39: num [1:1055] 7253 7257 7601 669 8003 ...
## $ X40: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X41: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X42: chr [1:1055] "RB" "RB" "RB" "RB" ...
> table(bdf$X42)
##
## NRB RB
## 699 356
最后一列X42包含了输出变量,不可生物降解为NRB,可生物降解为RB,需要转换为因子型。其他数值型中有的也被识别为字符型,需要手动转换为数值型。
> bdf$X42 <- factor(bdf$X42, levels = c("RB", "NRB"), labels = c(0, 1))
>
> bdf <- mutate(bdf, across(where(is.character), as.numeric))
>
> str(bdf)
## tibble [1,055 × 42] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ X1 : num [1:1055] 3919 417 3932 3 4236 ...
## $ X2 : num [1:1055] 2.69 2.11 3.25 2.71 3.39 ...
## $ X3 : num [1:1055] 0 0 0 0 0 0 1 0 0 0 ...
## $ X4 : num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X5 : num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X6 : num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X7 : num [1:1055] 0 0 0 0 0 0 0 0 2 2 ...
## $ X8 : num [1:1055] 314 308 267 20 294 286 111 316 444 412 ...
## $ X9 : num [1:1055] 2 1 2 0 2 2 0 3 2 0 ...
## $ X10: num [1:1055] 0 1 4 2 4 4 3 2 0 4 ...
## $ X11: num [1:1055] 0 0 0 0 0 0 0 0 0 3 ...
## $ X12: num [1:1055] 0 0 0 0 -0.271 -0.275 0 -0.039 0 -1.29 ...
## $ X13: num [1:1055] 3106 2461 3279 21 3449 ...
## $ X14: num [1:1055] 2.55 1.393 2.585 0.918 2.753 ...
## $ X15: num [1:1055] 9002 8723 911 6594 9528 ...
## $ X16: num [1:1055] 0 1 0 0 2 1 0 5 0 8 ...
## $ X17: num [1:1055] 0.96 0.989 1.009 1.108 1.004 ...
## $ X18: num [1:1055] 1142 1144 1152 1167 1147 ...
## $ X19: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X20: num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X21: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X22: num [1:1055] 1.2 1.1 1.09 1.02 1.14 ...
## $ X23: num [1:1055] 0 1 0 0 0 0 0 0 1 1 ...
## $ X24: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X25: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X26: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X27: num [1:1055] 1932 2214 1942 1414 1985 ...
## $ X28: num [1:1055] 0.011 -0.204 -0.008 1.073 -0.002 ...
## $ X29: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X30: num [1:1055] 0 0 0 8361 10348 ...
## $ X31: num [1:1055] 4.49 1.54 4.89 1.33 5.59 ...
## $ X32: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X33: num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X34: num [1:1055] 0 0 0 0 0 0 0 0 0 2 ...
## $ X35: num [1:1055] 0 0 1 1 0 0 1 0 0 1 ...
## $ X36: num [1:1055] 2949 3315 3076 3046 3351 ...
## $ X37: num [1:1055] 1591 1967 2417 5 2405 ...
## $ X38: num [1:1055] 0 0 0 0 0 0 0 0 0 1 ...
## $ X39: num [1:1055] 7253 7257 7601 669 8003 ...
## $ X40: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X41: num [1:1055] 0 0 0 0 0 0 0 0 0 0 ...
## $ X42: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
> # 检查缺失值
> DataExplorer::profile_missing(bdf)
## # A tibble: 42 x 3
## feature num_missing pct_missing
## <fct> <int> <dbl>
## 1 X1 0 0
## 2 X2 0 0
## 3 X3 0 0
## 4 X4 0 0
## 5 X5 0 0
## 6 X6 0 0
## 7 X7 0 0
## 8 X8 0 0
## 9 X9 0 0
## 10 X10 0 0
## # … with 32 more rows
数据集不存在缺失值。
2、拆分训练集和测试集
> set.seed(123)
> ind <- createDataPartition(bdf$X42, p = 0.8, list = F)
> dtrain <- bdf[ind, ]
> dtest <- bdf[-ind, ]
>
> dim(dtrain)
## [1] 845 42
> dim(dtest)
## [1] 210 42
3、使用支持向量机模型
3.1 线性核
对特征进行归一化处理(让它们具有0均值和单位方差),消除量纲的影响,尤其是使用径向基核函数时。
> # 先不使用交叉验证
> ct <- trainControl(method = "none")
> set.seed(123)
> fit.linear <- train(X42 ~ ., method = "svmLinear", data = dtrain,
+ preProc = c("center", "scale"),
+ trControl = ct)
> fit.linear$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 297
##
## Objective Function Value : -270.7407
## Training error : 0.130178
Number of Support Vectors : 297,表示成为支持向量的数据点的数量为297个。
查看训练集的正确率。
> mean(predict(fit.linear, newdata = dtrain, type = "raw") == dtrain$X42)
## [1] 0.8698225
训练集上的准确率为86.98%,还不错。看看在测试集上的准确率。
> mean(predict(fit.linear, newdata = dtest, type = "raw") == dtest$X42)
## [1] 0.8857143
测试集上表现略好于训练集,为88.57%。
3.2 径向基核
> set.seed(123)
> fit.radial <- train(X42 ~ ., data = dtrain, method = "svmRadial",
+ preProc = c("center", "scale"),
+ trControl = ct)
> fit.radial$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 0.25
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0261890036312633
##
## Number of Support Vectors : 522
##
## Objective Function Value : -99.2956
## Training error : 0.156213
检查准确率:
> mean(predict(fit.radial, newdata = dtrain, type = "raw") == dtrain$X42)
## [1] 0.843787
> mean(predict(fit.radial, newdata = dtest, type = "raw") == dtest$X42)
## [1] 0.8666667
可以看到,径向基核函数的性能比线性核函数要差一些。
4、交叉验证
K折交叉验证:将训练数据拆分为K个同样大小且不会重叠的分区,然后,用K-1个分区训练模型,并把剩下的1个分区作为测试集。这样一共进行K次,每次留出一个分区作为测试集。最后,对K个不同测试集上获得的所有估计进行汇总,来计算在未知数据上的精确度的整体估计。
使用10折交叉验证,再次对比两种核函数模型的性能。
> ctrl <- trainControl(method = "cv", number = 10)
> fit.linear.cv10 <- train(X42 ~ ., method = "svmLinear", data = dtrain,
+ preProc = c("center", "scale"),
+ trControl = ctrl)
>
> set.seed(123)
> fit.radial.cv10 <- train(X42 ~ ., data = dtrain, method = "svmRadial",
+ preProc = c("center", "scale"),
+ trControl = ctrl)
> dotplot(resamples(list(linear = fit.linear.cv10, radial = fit.radial.cv10)))
使用10折交叉验证后,可以看到,径向基核函数的性能要优于线性核函数。
查看径向基核函数下,支持向量机模型的性能。
> # 最终模型参数
> fit.radial.cv10$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0261890036312633
##
## Number of Support Vectors : 459
##
## Objective Function Value : -294.5405
## Training error : 0.113609
> mean(predict(fit.radial.cv10, newdata = dtrain, type = "raw") == dtrain$X42)
## [1] 0.8863905
> mean(predict(fit.radial.cv10, newdata = dtest, type = "raw") == dtest$X42)
## [1] 0.8952381
ROC曲线:
> ROSE::roc.curve(predict(fit.radial.cv10, newdata = dtrain, type = "raw"),
+ dtrain$X42)
## Area under the curve (AUC): 0.876
作者:wonphen
原文链接:https://www.jianshu.com/p/9d20968d9032