Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别


20001909
20001909 2022-09-19 10:18:01 51799
分类专栏: 资讯
Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别

目录

深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

读取源码

理解源代码

data与raw_data对比结果

X.shape 

X_display.shape 


深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

  1. X,y = shap.datasets.adult()
  2. X_display,y_display = shap.datasets.adult(display=True)

读取源码

  1. def adult(display=False):
  2. """ Return the Adult census data in a nice package. """
  3. dtypes = [
  4. ("Age", "float32"), ("Workclass", "category"), ("fnlwgt", "float32"),
  5. ("Education", "category"), ("Education-Num", "float32"), ("Marital Status", "category"),
  6. ("Occupation", "category"), ("Relationship", "category"), ("Race", "category"),
  7. ("Sex", "category"), ("Capital Gain", "float32"), ("Capital Loss", "float32"),
  8. ("Hours per week", "float32"), ("Country", "category"), ("Target", "category")
  9. ]
  10. raw_data = pd.read_csv(
  11. cache(github_data_url + "adult.data"),
  12. names=[d[0] for d in dtypes],
  13. na_values="?",
  14. dtype=dict(dtypes)
  15. )
  16. data = raw_data.drop(["Education"], axis=1) redundant with Education-Num
  17. filt_dtypes = list(filter(lambda x: not (x[0] in ["Target", "Education"]), dtypes))
  18. data["Target"] = data["Target"] == " >50K"
  19. rcode = {
  20. "Not-in-family": 0,
  21. "Unmarried": 1,
  22. "Other-relative": 2,
  23. "Own-child": 3,
  24. "Husband": 4,
  25. "Wife": 5
  26. }
  27. for k, dtype in filt_dtypes:
  28. if dtype == "category":
  29. if k == "Relationship":
  30. data[k] = np.array([rcode[v.strip()] for v in data[k]])
  31. else:
  32. data[k] = data[k].cat.codes
  33. if display:
  34. return raw_data.drop(["Education", "Target", "fnlwgt"], axis=1), data["Target"].values
  35. return data.drop(["Target", "fnlwgt"], axis=1), data["Target"].values

理解源代码

data与raw_data对比结果

结论
data:是基于raw_data读入的csv文件数据,为新定义的新数据,共计drop了3列(第1个红色矩形框),又进行了目标特征的二分类(第2个红色矩形框),最后进行了类别特征进行了数值化/编码化(第3个红色矩形框);经过处理后的数据均为数字列目标特征为二分类的dataframe。
raw_data:为原始数据,从csv读入,仅经过drop了3列,其余原封不同输出数据。

X.shape 

  1. (32561, 12) X.shape
  2. age workclass ... hours-per-week native-country
  3. 0 39 State-gov ... 40 United-States
  4. 1 50 Self-emp-not-inc ... 13 United-States
  5. 2 38 Private ... 40 United-States
  6. 3 53 Private ... 40 United-States
  7. 4 28 Private ... 40 Cuba
  8. -meta">... ... ... ... ... ...
  9. 32556 27 Private ... 38 United-States
  10. 32557 40 Private ... 40 United-States
  11. 32558 58 Private ... 40 United-States
  12. 32559 22 Private ... 20 United-States
  13. 32560 52 Self-emp-inc ... 40 United-States
  14. [32561 rows x 12 columns]
ageworkclasseducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
039State-gov13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States
150Self-emp-not-inc13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States
238Private9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States
353Private7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States
428Private13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba
537Private14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States
649Private5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica
752Self-emp-not-inc9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States
831Private14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States
942Private13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States

X_display.shape 

  1. (32561, 12) X_display.shape
  2. age workclass ... hours-per-week native-country
  3. 0 39 State-gov ... 40 United-States
  4. 1 50 Self-emp-not-inc ... 13 United-States
  5. 2 38 Private ... 40 United-States
  6. 3 53 Private ... 40 United-States
  7. 4 28 Private ... 40 Cuba
  8. -meta">... ... ... ... ... ...
  9. 32556 27 Private ... 38 United-States
  10. 32557 40 Private ... 40 United-States
  11. 32558 58 Private ... 40 United-States
  12. 32559 22 Private ... 20 United-States
  13. 32560 52 Self-emp-inc ... 40 United-States
  14. [32561 rows x 12 columns]
ageworkclasseducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
039State-gov13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States
150Self-emp-not-inc13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States
238Private9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States
353Private7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States
428Private13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba
537Private14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States
649Private5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica
752Self-emp-not-inc9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States
831Private14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States
942Private13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States
文章知识点与官方知识档案匹配,可进一步学习相关知识

网站声明:如果转载,请联系本站管理员。否则一切后果自行承担。

本文链接:https://www.xckfsq.com/news/show.html?id=1721
赞同 0
评论 0 条
20001909L0
粉丝 0 发表 4 + 关注 私信
上周热门
如何使用 StarRocks 管理和优化数据湖中的数据?  2691
【软件正版化】软件正版化工作要点  2655
统信UOS试玩黑神话:悟空  2559
信刻光盘安全隔离与信息交换系统  2247
镜舟科技与中启乘数科技达成战略合作,共筑数据服务新生态  1117
grub引导程序无法找到指定设备和分区  769
江波龙2025届校园招聘宣讲会行程大放送  28
点击报名 | 京东2025校招进校行程预告  25
海康威视2025校招|海康机器人,邀你共创工业智能化未来!  24
金山办公2024算法挑战赛 | 报名截止日期更新  22
本周热议
我的信创开放社区兼职赚钱历程 40
今天你签到了吗? 27
信创开放社区邀请他人注册的具体步骤如下 15
如何玩转信创开放社区—从小白进阶到专家 15
方德桌面操作系统 14
我有15积分有什么用? 13
用抖音玩法闯信创开放社区——用平台宣传企业产品服务 13
如何让你先人一步获得悬赏问题信息?(创作者必看) 12
2024中国信创产业发展大会暨中国信息科技创新与应用博览会 9
中央国家机关政府采购中心:应当将CPU、操作系统符合安全可靠测评要求纳入采购需求 8

加入交流群

请使用微信扫一扫!