11-08 01:06 阅读 82

机器学习模型评估方法入门到实战[二]

运用交叉验证进行数据集划分

1 KFold方法 k折交叉验证

上面说的将数据集划分为k等份的方法叫做k折交叉验证，在第三部分“运用交叉验证进行模型评估”中，会介绍cross_value_score方法，该方法的参数cv负责制定数据集划分方法，若输入任一整型数字k，则使用KFold方法。

该方法的sklearn实现如下(但通常如上一部分所描述的在cross_value_score方法中使用)：

from sklearn.model_selection import KFold import numpy as np  X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4])  kf = KFold(n_splits=2) for train_index, test_index in kf.split(X):     print('train_index', train_index, 'test_index', test_index)     train_X, train_y = X[train_index], y[train_index]     test_X, test_y = X[test_index], y[test_index] 复制代码

n_splits 参数是指希望分为几份；

2 RepeatedKFold p次k折交叉验证

在实际当中，我们只进行一次k折交叉验证还是不够的，我们需要进行多次，最典型的是：10次10折交叉验证，RepeatedKFold方法可以控制交叉验证的次数。

from sklearn.model_selection import RepeatedKFold import numpy as np  X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4])  kf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=0) for train_index, test_index in kf.split(X):     print('train_index', train_index, 'test_index', test_index) 复制代码

n_repeats　参数是希望验证的次数；

3 LeaveOneOut 留一法

留一法是k折交叉验证当中，k=n（n为数据集的样本个数）的情形，即我们每次只留一个样本（我一开始理解成了留一个数据集，心想这不和KFold一样了么）来进行验证，这种方法仅适用于样本数较少的情况；

from sklearn.model_selection import LeaveOneOut  X = [1, 2, 3, 4]  loo = LeaveOneOut() for train_index, test_index in loo.split(X):     print('train_index', train_index, 'test_index', test_index) 复制代码

4 LeavePOut 留P法

参考上一节

from sklearn.model_selection import LeavePOut  X = [1, 2, 3, 4]  lpo = LeavePOut(p=2) for train_index, test_index in lpo.split(X):     print('train_index', train_index, 'test_index', test_index) 复制代码

5 ShuffleSplit 随机分配

使用ShuffleSplit方法，可以随机的把数据打乱，然后分为训练集和测试集。它还有一个好处是可以通过random_state这个种子来重现我们的分配方式，如果没有指定，那么每次都是随机的。
你可以将这个方法理解为随机版的KFold k折交叉验证，或是运行n_splits次版的train_test_split留出法。

import numpy as np from sklearn.model_selection import ShuffleSplit  X=np.random.randint(1,100,20).reshape(10,2) rs = ShuffleSplit(n_splits=10, test_size=0.25)  for train , test in rs.split(X):     print(f'train: {train} , test: {test}') 复制代码

6 其它特殊情况的数据划分方法

对于分类数据来说，它们的target可能分配是不均匀的，比如在医疗数据当中得癌症的人比不得癌症的人少很多，这个时候，使用的数据划分方法有 StratifiedKFold ，StratifiedShuffleSplit
对于分组数据来说，它的划分方法是不一样的，主要的方法有 GroupKFold，LeaveOneGroupOut，LeavePGroupOut，GroupShuffleSplit
对于时间关联的数据，方法有TimeSeriesSplit

作者：在这里唱歌不一定都是神经病
链接：https://juejin.cn/post/7027809115606876191