ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略


俄罗斯方块
俄罗斯方块 2022-09-19 10:28:54 51977
分类专栏: 资讯

ML:LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

目录

LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier

XGBClassifier

CatBoostClassifier


LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略

LGBMClassifier

LGBMClassifier.feature_importances_函数,采用split方式计算

LGBMC.feature_importances_

importance_type='split',

    def feature_importances_(self):
        """Get feature importances.

        Note
        ----
        Feature importance in sklearn interface used to normalize to 1,it's deprecated after 2.0.4 and is the same as Booster.feature_importance() now.
        ``importance_type`` attribute is passed to the function to configure the type of importance values to be extracted.
        """
        if self._n_features is None:
            raise LGBMNotFittedError('No feature_importances found. Need to call fit beforehand.')
        return self.booster_.feature_importance(importance_type=self.importance_type)

    @property
    def booster_(self):
        """Get the underlying lightgbm Booster of this model."""
        if self._Booster is None:
            raise LGBMNotFittedError('No booster found. Need to call fit beforehand.')
        return self._Booster

    def num_feature(self):
        """Get number of features.

        Returns
        -------
        num_feature : int
            The number of features.
        """
        out_num_feature = ctypes.c_int(0)
        _safe_call(_LIB.LGBM_BoosterGetNumFeature(
            self.handle,
            ctypes.byref(out_num_feature)))
        return out_num_feature.value

self.booster_.feature_importance
(importance_type=

self.importance_type)

    def feature_importance(self, importance_type='split', iteration=None):
        """Get feature importances.

        Parameters
        ----------
        importance_type : string, optional (default="split"). How the importance is calculated.  字符串,可选(默认值=“split”)。如何计算重要性。
If "split", result contains numbers of times the feature is used in a model. 如果“split”,则结果包含该特征在模型中使用的次数
If "gain", result contains total gains of splits which use the feature.如果“gain”,则结果包含使用该特征的拆分的总增益。
        iteration : int or None, optional (default=None).Limit number of iterations in the feature importance calculation. If None, if the best iteration exists, it is used; otherwise, all trees are used.  If <= 0, all trees are used (no limits).

        Returns
        -------
        result : numpy array
            Array with feature importances.
        """
        if iteration is None:
            iteration = self.best_iteration
        if importance_type == "split":
            importance_type_int = 0
        elif importance_type == "gain":
            importance_type_int = 1
        else:
            importance_type_int = -1
        result = np.zeros(self.num_feature(), dtype=np.float64)
        _safe_call(_LIB.LGBM_BoosterFeatureImportance(
            self.handle,
            ctypes.c_int(iteration),
            ctypes.c_int(importance_type_int),
            result.ctypes.data_as(ctypes.POINTER(ctypes.c_double))))
        if importance_type_int == 0:
            return result.astype(int)
        else:
            return result

XGBClassifier

XGBClassifier.feature_importances_函数,采用weight方式计算

XGBC.

feature_importances_

importance_type="weight"    默认 gain、weight、cover、total_gain、total_cover 

    def feature_importances_(self):
        """
        Feature importances property

        .. note:: Feature importance is defined only for tree boosters

            Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners .仅当选择决策树模型作为基础学习者(`booster=gbtree`)时,才定义特征重要性。它不适用于其他基本学习者类型,例如线性学习者(`booster=gblinear`).

        Returns
        -------
        feature_importances_ : array of shape ``[n_features]``

        """
        if getattr(self, 'booster', None) is not None and self.booster != 'gbtree':
            raise AttributeError('Feature importance is not defined for Booster type {}'
                                 .format(self.booster))
        b = self.get_booster()
        score = b.get_score(importance_type=self.importance_type)
        all_features = [score.get(f, 0.) for f in b.feature_names]
        all_features = np.array(all_features, dtype=np.float32)
        return all_features / all_features.sum()
 

get_score

    def get_score(self, fmap='', importance_type='weight'):
        """Get feature importance of each feature.
        Importance type can be defined as:

  •         * 'weight': the number of times a feature is used to split the data across all trees.一个特征用于在所有树上分割数据的次数
  •         * 'gain': the average gain across all splits the feature is used in.使用该特征的所有拆分的平均增益。
  •         * 'cover': the average coverage across all splits the feature is used in.使用该特征的所有拆分的平均覆盖率。
  •         * 'total_gain': the total gain across all splits the feature is used in.该特征在所有分割中使用的总增益。
  •         * 'total_cover': the total coverage across all splits the feature is used in.使用该特征的所有拆分的总覆盖率。

        .. note:: Feature importance is defined only for tree boosters

            Feature importance is only defined when the decision tree model is chosen as base learner (`booster=gbtree`). It is not defined for other base learner types, such as linear learners (`booster=gblinear`).

        Parameters
        ----------
        fmap: str (optional)
           The name of feature map file.
        importance_type: str, default 'weight'
            One of the importance types defined above.
        """
        if getattr(self, 'booster', None) is not None and self.booster not in {'gbtree', 'dart'}: raise ValueError('Feature importance is not defined for Booster type {}' .format(self.booster))

        allowed_importance_types = ['weight', 'gain', 'cover', 'total_gain', 'total_cover']
        if importance_type not in allowed_importance_types: msg = ("importance_type mismatch, got '{}', expected one of " + repr(allowed_importance_types))
            raise ValueError(msg.format(importance_type))

        if it's weight, then omap stores the number of missing values
        if importance_type == 'weight':
            do a simpler tree dump to save time
            trees = self.get_dump(fmap, with_stats=False)

            fmap = {}
            for tree in trees:
                for line in tree.split('\n'):
                    look for the opening square bracket
                    arr = line.split('[')
                    if no opening bracket (leaf node), ignore this line
                    if len(arr) == 1:
                        continue

                    extract feature name from string between []
                    fid = arr[1].split(']')[0].split('<')[0]

                    if fid not in fmap:
                        if the feature hasn't been seen yet
                        fmap[fid] = 1
                    else:
                        fmap[fid] += 1

            return fmap

        else:
            average_over_splits = True
            if importance_type == 'total_gain':
                importance_type = 'gain'
                average_over_splits = False
            elif importance_type == 'total_cover':
                importance_type = 'cover'
                average_over_splits = False

            trees = self.get_dump(fmap, with_stats=True)

            importance_type += '='
            fmap = {}
            gmap = {}
            for tree in trees:
                for line in tree.split('\n'):
                    look for the opening square bracket
                    arr = line.split('[')
                    if no opening bracket (leaf node), ignore this line
                    if len(arr) == 1:
                        continue

                    look for the closing bracket, extract only info within that bracket
                    fid = arr[1].split(']')

                    extract gain or cover from string after closing bracket
                    g = float(fid[1].split(importance_type)[1].split(',')[0])

                    extract feature name from string before closing bracket
                    fid = fid[0].split('<')[0]

                    if fid not in fmap:
                        if the feature hasn't been seen yet
                        fmap[fid] = 1
                        gmap[fid] = g
                    else:
                        fmap[fid] += 1
                        gmap[fid] += g

            calculate average value (gain/cover) for each feature
            if average_over_splits:
                for fid in gmap:
                    gmap[fid] = gmap[fid] / fmap[fid]

            return gmap

CatBoostClassifier

CatBoostClassifier.feature_importances_函数,采用is_groupwise_metric(loss)方式计算

CatC.feature_importances_    def feature_importances_(self):
        loss = self._object._get_loss_function_name()
        if loss and is_groupwise_metric(loss):
            return np.array(getattr(self, "_loss_value_change", None))
        else:
            return np.array(getattr(self, "_prediction_values_change", None))

       CatBoost简单地利用了在正常情况下(当我们包括特征时)使用模型获得的度量(损失函数)与不使用该特征的模型(模型建立大约与此功能从所有的树在合奏)。差别越大,特征就越重要。

网站声明:如果转载,请联系本站管理员。否则一切后果自行承担。

本文链接:https://www.xckfsq.com/news/show.html?id=1778
赞同 0
评论 0 条
俄罗斯方块L2
粉丝 1 发表 86 + 关注 私信
上周热门
如何使用 StarRocks 管理和优化数据湖中的数据?  2944
【软件正版化】软件正版化工作要点  2864
统信UOS试玩黑神话:悟空  2823
信刻光盘安全隔离与信息交换系统  2718
镜舟科技与中启乘数科技达成战略合作,共筑数据服务新生态  1251
grub引导程序无法找到指定设备和分区  1217
华为全联接大会2024丨软通动力分论坛精彩议程抢先看!  164
点击报名 | 京东2025校招进校行程预告  163
2024海洋能源产业融合发展论坛暨博览会同期活动-海洋能源与数字化智能化论坛成功举办  161
华为纯血鸿蒙正式版9月底见!但Mate 70的内情还得接着挖...  157
本周热议
我的信创开放社区兼职赚钱历程 40
今天你签到了吗? 27
如何玩转信创开放社区—从小白进阶到专家 15
信创开放社区邀请他人注册的具体步骤如下 15
方德桌面操作系统 14
用抖音玩法闯信创开放社区——用平台宣传企业产品服务 13
我有15积分有什么用? 13
如何让你先人一步获得悬赏问题信息?(创作者必看) 12
2024中国信创产业发展大会暨中国信息科技创新与应用博览会 9
中央国家机关政府采购中心:应当将CPU、操作系统符合安全可靠测评要求纳入采购需求 8

加入交流群

请使用微信扫一扫!