ML之sklearn:sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解


乌龟魁梧
乌龟魁梧 2022-09-19 12:14:54 52184
分类专栏: 资讯

ML之sklearn:sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解

目录

sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解

ShuffleSplit()函数

StratifiedShuffleSplit()函数


sklearn库中的ShuffleSplit()函数和StratifiedShuffleSplit()函数的讲解

from sklearn.model_selection import ShuffleSplit,StratifiedShuffleSplit
         这两个函数均是实现了对数据集进行打乱划分,即在数据集在进行划分之前,先进行打乱操作,否则容易产生过拟合,模型泛化能力下降。其中,StratifiedShuffleSplit函数是StratifiedKFold和ShuffleSplit的合并,它将返回StratifiedKFold。折叠是通过保存每个类的样本百分比来实现的。
​​​​​​​         首先将样本随机打乱,然后根据设置参数划分出train/test对。通过n_splits产生指定数量的独立的【train/test】数据集,划分数据集划分成n组(n组索引值),其创建的每一组划分将保证每组类比的比例相同。比如第一组训练数据类别比例为2:1,则后面每组类别都满足这个比例。


ShuffleSplit()函数

cv_split = ShuffleSplit(n_splits=6, train_size=0.7, test_size=0.2)

class ShuffleSplit(BaseShuffleSplit):

    """Random permutation cross-validator

 Yields indices to split data into training and test sets.

  Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters

    ----------

n_splits : int, default=10. Number of re-shuffling & splitting iterations.

 

test_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will  be set to 0.1.

 

train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If  int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

 

    random_state : int or RandomState instance, default=None.  Controls the randomness of the training and testing indices  produced. Pass an int for reproducible output across multiple function calls.

    See :term:`Glossary <random_state>`.

类ShuffleSplit (BaseShuffleSplit):

  随机排列交叉验证

生成将数据分割为训练集和测试集的索引。

注:与其他交叉验证策略相反,随机分割并不能保证所有的折叠都是不同的,尽管对于较大的数据集,这种情况仍然很可能发生。

更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>

参数

----------

n_splits : int,默认=10。重新洗牌和分裂迭代的数量。将训练数据分成【train/test】对的组数。

 

test_size: float或int,默认=None。如果是浮动的,则应该在0.0和1.0之间,并表示要包含在test分割中的数据集的比例。如果int,表示测试样本的绝对数量。如果没有,则将该值设置为train_size的补集。如果train_size也是None,它将被设置为0.1。

test_size用来设置train/test对中test所占的比例

 

train_size: float或int,默认=None。如果是浮点数,则应该在0.0和1.0之间,并表示要包含在train分割序列中的数据集的比例。如果int,表示train样本的绝对数量。如果没有,该值将自动设置为train size的补集。train_size用来设置train/test对中train所占的比例

 

random_state: int或RandomState实例,默认为None。控制产生的训练和测试指标的随机性。在多个函数调用之间传递可重复输出的int。

控制将样本随机打乱,用于随机抽样的伪随机数发生器状态。

看:术语:“术语表< random_state >”。

    Examples

    --------

    >>> import numpy as np

    >>> from sklearn.model_selection import ShuffleSplit

    >>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])

    >>> y = np.array([1, 2, 1, 2, 1, 2])

    >>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)

    >>> rs.get_n_splits(X)

    5

    >>> print(rs)

    ShuffleSplit(n_splits=5, random_state=0, test_size=0.25,  train_size=None)

    >>> for train_index, test_index in rs.split(X):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    TRAIN: [1 3 0 4] TEST: [5 2]

    TRAIN: [4 0 2 5] TEST: [1 3]

    TRAIN: [1 2 4 0] TEST: [3 5]

    TRAIN: [3 4 1 0] TEST: [5 2]

    TRAIN: [3 5 1 0] TEST: [2 4]

    >>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0)

    >>> for train_index, test_index in rs.split(X):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    TRAIN: [1 3 0] TEST: [5 2]

    TRAIN: [4 0 2] TEST: [1 3]

    TRAIN: [1 2 4] TEST: [3 5]

    TRAIN: [3 4 1] TEST: [5 2]

    TRAIN: [3 5 1] TEST: [2 4]

    """

 

    @_deprecate_positional_args

    def __init__(self, n_splits=10, *, test_size=None, train_size=None,

        random_state=None):

        super().__init__(n_splits=n_splits, test_size=test_size,  train_size=train_size, random_state=random_state)

        self._default_test_size = 0.1

    

    def _iter_indices(self, X, y=None, groups=None):

        n_samples = _num_samples(X)

        n_train, n_test = _validate_shuffle_split(

            n_samples, self.test_size, self.train_size,

            default_test_size=self._default_test_size)

        rng = check_random_state(self.random_state)

        for i in range(self.n_splits):

             random partition

            permutation = rng.permutation(n_samples)

            ind_test = permutation[:n_test]

            ind_train = permutation[n_test:n_test + n_train]

            yield ind_train, ind_test

 

    Examples

    --------

    >>> import numpy as np

    >>> from sklearn.model_selection import ShuffleSplit

    >>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])

    >>> y = np.array([1, 2, 1, 2, 1, 2])

    >>> rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)

    >>> rs.get_n_splits(X)

    5

    >>> print(rs)

    ShuffleSplit(n_splits=5, random_state=0, test_size=0.25,  train_size=None)

    >>> for train_index, test_index in rs.split(X):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    TRAIN: [1 3 0 4] TEST: [5 2]

    TRAIN: [4 0 2 5] TEST: [1 3]

    TRAIN: [1 2 4 0] TEST: [3 5]

    TRAIN: [3 4 1 0] TEST: [5 2]

    TRAIN: [3 5 1 0] TEST: [2 4]

    >>> rs = ShuffleSplit(n_splits=5, train_size=0.5, test_size=.25, random_state=0)

    >>> for train_index, test_index in rs.split(X):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    TRAIN: [1 3 0] TEST: [5 2]

    TRAIN: [4 0 2] TEST: [1 3]

    TRAIN: [1 2 4] TEST: [3 5]

    TRAIN: [3 4 1] TEST: [5 2]

    TRAIN: [3 5 1] TEST: [2 4]

    """

 

    @_deprecate_positional_args

    def __init__(self, n_splits=10, *, test_size=None, train_size=None,

        random_state=None):

        super().__init__(n_splits=n_splits, test_size=test_size,  train_size=train_size, random_state=random_state)

        self._default_test_size = 0.1

    

    def _iter_indices(self, X, y=None, groups=None):

        n_samples = _num_samples(X)

        n_train, n_test = _validate_shuffle_split(

            n_samples, self.test_size, self.train_size,

            default_test_size=self._default_test_size)

        rng = check_random_state(self.random_state)

        for i in range(self.n_splits):

             random partition

            permutation = rng.permutation(n_samples)

            ind_test = permutation[:n_test]

            ind_train = permutation[n_test:n_test + n_train]

            yield ind_train, ind_test

 

StratifiedShuffleSplit()函数

StratifiedShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)

class StratifiedShuffleSplit(BaseShuffleSplit):

    """Stratified Shuffle Split cross-validator

    Provides train/test indices to split data in train/test sets.

    

    This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

    

    Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

    

    Read more in the :ref:`User Guide <cross_validation>`.

    

    Parameters

    ----------

    n_splits : int, default=10

    Number of re-shuffling & splitting iterations.

    

    test_size : float or int, default=None.  If float, should be between 0.0 and 1.0 and represent the  proportion  of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.1.

    

    train_size : float or int, default=None. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If  int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

    

    random_state : int or RandomState instance, default=None. Controls the randomness of the training and testing indices  produced. Pass an int for reproducible output across multiple function calls.

    See :term:`Glossary <random_state>`.

 

分层洗牌分裂交叉验证器

提供训练/测试索引来分割训练/测试集中的数据。

 

这个交叉验证对象是StratifiedKFold和ShuffleSplit的合并,它将返回StratifiedKFold。折叠是通过保存每个类的样本百分比来实现的。

 

注意:就像ShuffleSplit策略一样,分层随机分割不能保证所有的折叠都是不同的,尽管这对于相当大的数据集仍然很有可能。

 

更多信息请参见:ref: ' User Guide <cross_validation> '。</cross_validation>

 

参数

----------

int,默认=10

重新洗牌和分裂迭代的数量。

 

test_size: float或int,默认=None。如果是浮动的,则应该在0.0和1.0之间,并表示要包含在测试分割中的数据集的比例。如果int,表示测试样本的绝对数量。如果没有,则将该值设置为train size的补集。如果' ' train_size ' '也是None,它将被设置为0.1。

 

train_size: float或int,默认=None。如果是浮点数,则应该在0.0和1.0之间,并表示要包含在分割序列中的数据集的比例。如果int,表示train样本的绝对数量。如果没有,该值将自动设置为train size的补集。

 

random_state: int或RandomState实例,默认为None。控制产生的训练和测试指标的随机性。在多个函数调用之间传递可重复输出的int。

看:术语:“术语表< random_state >”。

    Examples

    --------

    >>> import numpy as np

    >>> from sklearn.model_selection import StratifiedShuffleSplit

    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

    >>> y = np.array([0, 0, 0, 1, 1, 1])

    >>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5,

     random_state=0)

    >>> sss.get_n_splits(X, y)

    5

    >>> print(sss)

    StratifiedShuffleSplit(n_splits=5, random_state=0, ...)

    >>> for train_index, test_index in sss.split(X, y):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    ...     X_train, X_test = X[train_index], X[test_index]

    ...     y_train, y_test = y[train_index], y[test_index]

    TRAIN: [5 2 3] TEST: [4 1 0]

    TRAIN: [5 1 4] TEST: [0 2 3]

    TRAIN: [5 0 2] TEST: [4 3 1]

    TRAIN: [4 1 0] TEST: [2 3 5]

    TRAIN: [0 5 1] TEST: [3 4 2]

 

    """

    @_deprecate_positional_args

    def __init__(self, n_splits=10, *, test_size=None, train_size=None,

        random_state=None):

        super().__init__(n_splits=n_splits, test_size=test_size,

         train_size=train_size, random_state=random_state)

        self._default_test_size = 0.1

    

    def _iter_indices(self, X, y, groups=None):

        n_samples = _num_samples(X)

        y = check_array(y, ensure_2d=False, dtype=None)

        n_train, n_test = _validate_shuffle_split(

            n_samples, self.test_size, self.train_size,

            default_test_size=self._default_test_size)

        if y.ndim == 2:

             for multi-label y, map each distinct row to a string repr

             using join because str(row) uses an ellipsis if len(row) >

             1000

            y = np.array([' '.join(row.astype('str')) for row in y])

        classes, y_indices = np.unique(y, return_inverse=True)

        n_classes = classes.shape[0]

        class_counts = np.bincount(y_indices)

        if np.min(class_counts) < 2:

            raise ValueError("The least populated class in y has only 1"

                " member, which is too few. The minimum"

                " number of groups for any class cannot"

                " be less than 2.")

        if n_train < n_classes:

            raise ValueError(

                'The train_size = %d should be greater or '

                'equal to the number of classes = %d' %

                (n_train, n_classes))

        if n_test < n_classes:

            raise ValueError('The test_size = %d should be greater or '

                'equal to the number of classes = %d' %

                (n_test, n_classes)) Find the sorted list of instances for

                 each class:

         (np.unique above performs a sort, so code is O(n logn)

         already)

        class_indices = np.split(np.argsort(y_indices,

         kind='mergesort'), np.cumsum(class_counts)[:-1])

        rng = check_random_state(self.random_state)

        for _ in range(self.n_splits):

             if there are ties in the class-counts, we want

             to make sure to break them anew in each iteration

            n_i = _approximate_mode(class_counts, n_train, rng)

            class_counts_remaining = class_counts - n_i

            t_i = _approximate_mode(class_counts_remaining, n_test,

             rng)

            train = []

            test = []

            for i in range(n_classes):

                permutation = rng.permutation(class_counts[i])

                perm_indices_class_i = class_indices[i].take(permutation,

                    mode='clip')

                train.extend(perm_indices_class_i[:n_i[i]])

                test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])

            

            train = rng.permutation(train)

            test = rng.permutation(test)

            yield train, test

    

    def split(self, X, y, groups=None):

        """Generate indices to split data into training and test set.

 

        Parameters

        ----------

        X : array-like of shape (n_samples, n_features)

            Training data, where n_samples is the number of samples

            and n_features is the number of features.

 

            Note that providing ``y`` is sufficient to generate the splits

             and

            hence ``np.zeros(n_samples)`` may be used as a placeholder

             for

            ``X`` instead of actual training data.

 

        y : array-like of shape (n_samples,) or (n_samples, n_labels)

            The target variable for supervised learning problems.

            Stratification is done based on the y labels.

 

        groups : object

            Always ignored, exists for compatibility.

 

        Yields

        ------

        train : ndarray

            The training set indices for that split.

 

        test : ndarray

            The testing set indices for that split.

 

        Notes

        -----

        Randomized CV splitters may return different results for each

         call of

        split. You can make the results identical by setting

         `random_state`

        to an integer.

        """

        y = check_array(y, ensure_2d=False, dtype=None)

        return super().split(X, y, groups)

 

    Examples

    --------

    >>> import numpy as np

    >>> from sklearn.model_selection import StratifiedShuffleSplit

    >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

    >>> y = np.array([0, 0, 0, 1, 1, 1])

    >>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5,

     random_state=0)

    >>> sss.get_n_splits(X, y)

    5

    >>> print(sss)

    StratifiedShuffleSplit(n_splits=5, random_state=0, ...)

    >>> for train_index, test_index in sss.split(X, y):

    ...     print("TRAIN:", train_index, "TEST:", test_index)

    ...     X_train, X_test = X[train_index], X[test_index]

    ...     y_train, y_test = y[train_index], y[test_index]

    TRAIN: [5 2 3] TEST: [4 1 0]

    TRAIN: [5 1 4] TEST: [0 2 3]

    TRAIN: [5 0 2] TEST: [4 3 1]

    TRAIN: [4 1 0] TEST: [2 3 5]

    TRAIN: [0 5 1] TEST: [3 4 2]

 

    """

    @_deprecate_positional_args

    def __init__(self, n_splits=10, *, test_size=None, train_size=None,

        random_state=None):

        super().__init__(n_splits=n_splits, test_size=test_size,

         train_size=train_size, random_state=random_state)

        self._default_test_size = 0.1

    

    def _iter_indices(self, X, y, groups=None):

        n_samples = _num_samples(X)

        y = check_array(y, ensure_2d=False, dtype=None)

        n_train, n_test = _validate_shuffle_split(

            n_samples, self.test_size, self.train_size,

            default_test_size=self._default_test_size)

        if y.ndim == 2:

             for multi-label y, map each distinct row to a string repr

             using join because str(row) uses an ellipsis if len(row) >

             1000

            y = np.array([' '.join(row.astype('str')) for row in y])

        classes, y_indices = np.unique(y, return_inverse=True)

        n_classes = classes.shape[0]

        class_counts = np.bincount(y_indices)

        if np.min(class_counts) < 2:

            raise ValueError("The least populated class in y has only 1"

                " member, which is too few. The minimum"

                " number of groups for any class cannot"

                " be less than 2.")

        if n_train < n_classes:

            raise ValueError(

                'The train_size = %d should be greater or '

                'equal to the number of classes = %d' %

                (n_train, n_classes))

        if n_test < n_classes:

            raise ValueError('The test_size = %d should be greater or '

                'equal to the number of classes = %d' %

                (n_test, n_classes)) Find the sorted list of instances for

                 each class:

         (np.unique above performs a sort, so code is O(n logn)

         already)

        class_indices = np.split(np.argsort(y_indices,

         kind='mergesort'), np.cumsum(class_counts)[:-1])

        rng = check_random_state(self.random_state)

        for _ in range(self.n_splits):

             if there are ties in the class-counts, we want

             to make sure to break them anew in each iteration

            n_i = _approximate_mode(class_counts, n_train, rng)

            class_counts_remaining = class_counts - n_i

            t_i = _approximate_mode(class_counts_remaining, n_test,

             rng)

            train = []

            test = []

            for i in range(n_classes):

                permutation = rng.permutation(class_counts[i])

                perm_indices_class_i = class_indices[i].take(permutation,

                    mode='clip')

                train.extend(perm_indices_class_i[:n_i[i]])

                test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]])

            

            train = rng.permutation(train)

            test = rng.permutation(test)

            yield train, test

    

    def split(self, X, y, groups=None):

        """Generate indices to split data into training and test set.

 

        Parameters

        ----------

        X : array-like of shape (n_samples, n_features)

            Training data, where n_samples is the number of samples

            and n_features is the number of features.

 

            Note that providing ``y`` is sufficient to generate the splits

             and

            hence ``np.zeros(n_samples)`` may be used as a placeholder

             for

            ``X`` instead of actual training data.

 

        y : array-like of shape (n_samples,) or (n_samples, n_labels)

            The target variable for supervised learning problems.

            Stratification is done based on the y labels.

 

        groups : object

            Always ignored, exists for compatibility.

 

        Yields

        ------

        train : ndarray

            The training set indices for that split.

 

        test : ndarray

            The testing set indices for that split.

 

        Notes

        -----

        Randomized CV splitters may return different results for each

         call of

        split. You can make the results identical by setting

         `random_state`

        to an integer.

        """

        y = check_array(y, ensure_2d=False, dtype=None)

        return super().split(X, y, groups)

 
文章知识点与官方知识档案匹配,可进一步学习相关知识
Python入门技能树基础语法函数123796 人正在系统学习中

网站声明:如果转载,请联系本站管理员。否则一切后果自行承担。

本文链接:https://www.xckfsq.com/news/show.html?id=2291
赞同 0
评论 0 条
乌龟魁梧L0
粉丝 0 发表 12 + 关注 私信
上周热门
如何使用 StarRocks 管理和优化数据湖中的数据?  2951
【软件正版化】软件正版化工作要点  2872
统信UOS试玩黑神话:悟空  2833
信刻光盘安全隔离与信息交换系统  2728
镜舟科技与中启乘数科技达成战略合作,共筑数据服务新生态  1261
grub引导程序无法找到指定设备和分区  1226
华为全联接大会2024丨软通动力分论坛精彩议程抢先看!  165
2024海洋能源产业融合发展论坛暨博览会同期活动-海洋能源与数字化智能化论坛成功举办  163
点击报名 | 京东2025校招进校行程预告  163
华为纯血鸿蒙正式版9月底见!但Mate 70的内情还得接着挖...  159
本周热议
我的信创开放社区兼职赚钱历程 40
今天你签到了吗? 27
如何玩转信创开放社区—从小白进阶到专家 15
信创开放社区邀请他人注册的具体步骤如下 15
方德桌面操作系统 14
用抖音玩法闯信创开放社区——用平台宣传企业产品服务 13
我有15积分有什么用? 13
如何让你先人一步获得悬赏问题信息?(创作者必看) 12
2024中国信创产业发展大会暨中国信息科技创新与应用博览会 9
中央国家机关政府采购中心:应当将CPU、操作系统符合安全可靠测评要求纳入采购需求 8

加入交流群

请使用微信扫一扫!