ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生


米哈游招聘
米哈游招聘 2022-09-19 11:16:03 51446
分类专栏: 资讯

ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

目录

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

2、DFS设计

输出结果

feature_matrix_cats_df.csv

feature_matrix_nums.csv


推荐文章
Py之featuretools:featuretools库的简介、安装、使用方法之详细攻略
ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生
ML之FE:基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生实现

基于单个csv文件数据集(自动切分为两个dataframe表)利用featuretools工具实现自动特征生成/特征衍生

设计思路

1、定义数据集

contents={"name": ['Bob',        'LiSa',                     'Mary',                       'Alan'],
          "ID":   [1,              2,                            3,                            4],     输出 NaN
          "age":  [np.nan,        28,                           38 ,                          '' ],   输出 
        "born": [pd.NaT,     pd.Timestamp("1990-01-01"),  pd.Timestamp("1980-01-01"),        ''],     输出 NaT
          "sex":  ['男',          '女',                        '女',                        '男',],   输出 None
          "hobbey":['打篮球',     '打羽毛球',                   '打乒乓球',                    '',],   输出 
          "money":[200.0,                240.0,                   290.0,                     300.0],   输出
          "weight":[140.5,                120.8,                 169.4,                      155.6],   输出
          }

2、DFS设计

  • (1)、指定一个包含数据集中所有实体的字典
  • (2)、指定实体间如何关联:当两个实体有一对多关系时,我们称之为“one”实体,即“parent entity”。
  • (3)、运行深度特征合成:DFS的最小输入是一组实体、一组关系和计算特性的“target_entity”。DFS的输出是一个特征矩阵和相应的特征定义列表。
    让我们首先为数据中的每个客户创建一个特性矩阵,那么现在有几十个新特性来描述客户的行为。
  • (4)、改变目标的实体:DFS如此强大的原因之一是它可以为我们的数据中的任何实体创建一个特征矩阵。例如,如果我们想为会话构建特性
  • (5)、理解特征输出:一般来说,Featuretools通过特性名称引用生成的特性。
    为了让特性更容易理解,Featuretools提供了两个额外的工具,Featuretools .graph_feature()和Featuretools .describe_feature(),
    来帮助解释什么是特性以及Featuretools生成特性的步骤。
  • (6)、特征谱系图
    特征谱系图可视地遍历功能生成过程。从基本数据开始,它们一步一步地展示应用的原语和生成的中间特征,以创建最终特征。
  • (7)、特征描述:功能工具还可以自动生成功能的英文句子描述。特性描述有助于解释什么是特性,并且可以通过包含手动定义的自定义来进一步改进。
    有关如何自定义自动生成的特性描述的详细信息,请参见生成特性描述。

输出结果

  1. name ID age born sex hobbey money weight
  2. 0 Bob 1 NaN NaT 男 打篮球 200.0 140.5
  3. 1 LiSa 2 28 1990-01-01 女 打羽毛球 240.0 120.8
  4. 2 Mary 3 38 1980-01-01 女 打乒乓球 290.0 169.4
  5. 3 Alan 4 NaT 男 300.0 155.6
  6. -------------------------------------------
  7. nums_df:----------------------------------
  8. name ID age money weight
  9. 0 Bob 1 NaN 200.0 140.5
  10. 1 LiSa 2 28.0 240.0 120.8
  11. 2 Mary 3 38.0 290.0 169.4
  12. 3 Alan 4 NaN 300.0 155.6
  13. cats_df:----------------------------------
  14. ID hobbey sex born
  15. 0 4 NaN 男 NaN
  16. 1 1 打篮球 男 NaN
  17. 2 2 打羽毛球 女 1990-01-01
  18. ---------------------------------DFS设计:-----------------------------------
  19. feature_matrix_nums
  20. ID age money weight cats.hobbey cats.sex cats.COUNT(nums) \
  21. name
  22. Bob 1 NaN 200.0 140.5 打篮球 男 1.0
  23. LiSa 2 28.0 240.0 120.8 打羽毛球 女 1.0
  24. Mary 3 38.0 290.0 169.4 NaN NaN NaN
  25. cats.MAX(nums.age) cats.MAX(nums.money) cats.MAX(nums.weight) \
  26. name
  27. Bob NaN 200.0 140.5
  28. LiSa 28.0 240.0 120.8
  29. Mary NaN NaN NaN
  30. cats.MEAN(nums.age) cats.MEAN(nums.money) cats.MEAN(nums.weight) \
  31. name
  32. Bob NaN 200.0 140.5
  33. LiSa 28.0 240.0 120.8
  34. Mary NaN NaN NaN
  35. cats.MIN(nums.age) cats.MIN(nums.money) cats.MIN(nums.weight) \
  36. name
  37. Bob NaN 200.0 140.5
  38. LiSa 28.0 240.0 120.8
  39. Mary NaN NaN NaN
  40. cats.SKEW(nums.age) cats.SKEW(nums.money) cats.SKEW(nums.weight) \
  41. name
  42. Bob NaN NaN NaN
  43. LiSa NaN NaN NaN
  44. Mary NaN NaN NaN
  45. cats.STD(nums.age) cats.STD(nums.money) cats.STD(nums.weight) \
  46. name
  47. Bob NaN NaN NaN
  48. LiSa NaN NaN NaN
  49. Mary NaN NaN NaN
  50. cats.SUM(nums.age) cats.SUM(nums.money) cats.SUM(nums.weight) \
  51. name
  52. Bob 0.0 200.0 140.5
  53. LiSa 28.0 240.0 120.8
  54. Mary NaN NaN NaN
  55. cats.DAY(born) cats.MONTH(born) cats.WEEKDAY(born) cats.YEAR(born)
  56. name
  57. Bob NaN NaN NaN NaN
  58. LiSa 1.0 1.0 0.0 1990.0
  59. Mary NaN NaN NaN NaN
  60. features_defs_nums: 29 [<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]
  61. feature_matrix_cats_df
  62. hobbey sex COUNT(nums) MAX(nums.age) MAX(nums.money) MAX(nums.weight) \
  63. ID
  64. 4 NaN 男 1 NaN 300.0 155.6
  65. 1 打篮球 男 1 NaN 200.0 140.5
  66. 2 打羽毛球 女 1 28.0 240.0 120.8
  67. MEAN(nums.age) MEAN(nums.money) MEAN(nums.weight) MIN(nums.age) \
  68. ID
  69. 4 NaN 300.0 155.6 NaN
  70. 1 NaN 200.0 140.5 NaN
  71. 2 28.0 240.0 120.8 28.0
  72. MIN(nums.money) MIN(nums.weight) SKEW(nums.age) SKEW(nums.money) \
  73. ID
  74. 4 300.0 155.6 NaN NaN
  75. 1 200.0 140.5 NaN NaN
  76. 2 240.0 120.8 NaN NaN
  77. SKEW(nums.weight) STD(nums.age) STD(nums.money) STD(nums.weight) \
  78. ID
  79. 4 NaN NaN NaN NaN
  80. 1 NaN NaN NaN NaN
  81. 2 NaN NaN NaN NaN
  82. SUM(nums.age) SUM(nums.money) SUM(nums.weight) DAY(born) MONTH(born) \
  83. ID
  84. 4 0.0 300.0 155.6 NaN NaN
  85. 1 0.0 200.0 140.5 NaN NaN
  86. 2 28.0 240.0 120.8 1.0 1.0
  87. WEEKDAY(born) YEAR(born)
  88. ID
  89. 4 NaN NaN
  90. 1 NaN NaN
  91. 2 0.0 1990.0
  92. features_defs_cats_df: 25 [<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]
  93. <Feature: SUM(nums.age)>
  94. The sum of the "age" of all instances of "nums" for each "ID" in "cats".

feature_matrix_cats_df.csv

features_defs_cats_df: 25
[<Feature: hobbey>, <Feature: sex>, <Feature: COUNT(nums)>, <Feature: MAX(nums.age)>, <Feature: MAX(nums.money)>, <Feature: MAX(nums.weight)>, <Feature: MEAN(nums.age)>, <Feature: MEAN(nums.money)>, <Feature: MEAN(nums.weight)>, <Feature: MIN(nums.age)>, <Feature: MIN(nums.money)>, <Feature: MIN(nums.weight)>, <Feature: SKEW(nums.age)>, <Feature: SKEW(nums.money)>, <Feature: SKEW(nums.weight)>, <Feature: STD(nums.age)>, <Feature: STD(nums.money)>, <Feature: STD(nums.weight)>, <Feature: SUM(nums.age)>, <Feature: SUM(nums.money)>, <Feature: SUM(nums.weight)>, <Feature: DAY(born)>, <Feature: MONTH(born)>, <Feature: WEEKDAY(born)>, <Feature: YEAR(born)>]

IDhobbeysexCOUNT(nums)MAX(nums.age)MAX(nums.money)MAX(nums.weight)MEAN(nums.age)MEAN(nums.money)MEAN(nums.weight)MIN(nums.age)MIN(nums.money)MIN(nums.weight)SKEW(nums.age)SKEW(nums.money)SKEW(nums.weight)STD(nums.age)STD(nums.money)STD(nums.weight)SUM(nums.age)SUM(nums.money)SUM(nums.weight)DAY(born)MONTH(born)WEEKDAY(born)YEAR(born)
4 1 300155.6 300155.6 300155.6      0300155.6    
1打篮球1 200140.5 200140.5 200140.5      0200140.5    
2打羽毛球128240120.828240120.828240120.8      28240120.81101990
IDhobbeysexCOUNT(nums)      
4 1      
1打篮球1      
2打羽毛球1      
 MAX(nums.age)MAX(nums.money)MAX(nums.weight)MEAN(nums.age)MEAN(nums.money)MEAN(nums.weight)MIN(nums.age)MIN(nums.money)MIN(nums.weight)
  300155.6 300155.6 300155.6
  200140.5 200140.5 200140.5
 28240120.828240120.828240120.8
 SKEW(nums.age)SKEW(nums.money)SKEW(nums.weight)STD(nums.age)STD(nums.money)STD(nums.weight)SUM(nums.age)SUM(nums.money)SUM(nums.weight)
       0300155.6
       0200140.5
       28240120.8
 DAY(born)MONTH(born)WEEKDAY(born)YEAR(born)     
          
          
 1101990     

字段解释

  1. <Feature: hobbey> : The "hobbey".
  2. <Feature: sex> : The "sex".
  3. <Feature: COUNT(nums)> : The number of all instances of "nums" for each "ID" in "cats".
  4. <Feature: MAX(nums.age)> : The maximum of the "age" of all instances of "nums" for each "ID" in "cats".
  5. <Feature: MAX(nums.money)> : The maximum of the "money" of all instances of "nums" for each "ID" in "cats".
  6. <Feature: MAX(nums.weight)> : The maximum of the "weight" of all instances of "nums" for each "ID" in "cats".
  7. <Feature: MEAN(nums.age)> : The average of the "age" of all instances of "nums" for each "ID" in "cats".
  8. <Feature: MEAN(nums.money)> : The average of the "money" of all instances of "nums" for each "ID" in "cats".
  9. <Feature: MEAN(nums.weight)> : The average of the "weight" of all instances of "nums" for each "ID" in "cats".
  10. <Feature: MIN(nums.age)> : The minimum of the "age" of all instances of "nums" for each "ID" in "cats".
  11. <Feature: MIN(nums.money)> : The minimum of the "money" of all instances of "nums" for each "ID" in "cats".
  12. <Feature: MIN(nums.weight)> : The minimum of the "weight" of all instances of "nums" for each "ID" in "cats".
  13. <Feature: SKEW(nums.age)> : The skewness of the "age" of all instances of "nums" for each "ID" in "cats".
  14. <Feature: SKEW(nums.money)> : The skewness of the "money" of all instances of "nums" for each "ID" in "cats".
  15. <Feature: SKEW(nums.weight)> : The skewness of the "weight" of all instances of "nums" for each "ID" in "cats".
  16. <Feature: STD(nums.age)> : The standard deviation of the "age" of all instances of "nums" for each "ID" in "cats".
  17. <Feature: STD(nums.money)> : The standard deviation of the "money" of all instances of "nums" for each "ID" in "cats".
  18. <Feature: STD(nums.weight)> : The standard deviation of the "weight" of all instances of "nums" for each "ID" in "cats".
  19. <Feature: SUM(nums.age)> : The sum of the "age" of all instances of "nums" for each "ID" in "cats".
  20. <Feature: SUM(nums.money)> : The sum of the "money" of all instances of "nums" for each "ID" in "cats".
  21. <Feature: SUM(nums.weight)> : The sum of the "weight" of all instances of "nums" for each "ID" in "cats".
  22. <Feature: DAY(born)> : The day of the month of the "born".
  23. <Feature: MONTH(born)> : The month of the "born".
  24. <Feature: WEEKDAY(born)> : The day of the week of the "born".
  25. <Feature: YEAR(born)> : The year of the "born".

feature_matrix_nums.csv

features_defs_nums: 29
[<Feature: ID>, <Feature: age>, <Feature: money>, <Feature: weight>, <Feature: cats.hobbey>, <Feature: cats.sex>, <Feature: cats.COUNT(nums)>, <Feature: cats.MAX(nums.age)>, <Feature: cats.MAX(nums.money)>, <Feature: cats.MAX(nums.weight)>, <Feature: cats.MEAN(nums.age)>, <Feature: cats.MEAN(nums.money)>, <Feature: cats.MEAN(nums.weight)>, <Feature: cats.MIN(nums.age)>, <Feature: cats.MIN(nums.money)>, <Feature: cats.MIN(nums.weight)>, <Feature: cats.SKEW(nums.age)>, <Feature: cats.SKEW(nums.money)>, <Feature: cats.SKEW(nums.weight)>, <Feature: cats.STD(nums.age)>, <Feature: cats.STD(nums.money)>, <Feature: cats.STD(nums.weight)>, <Feature: cats.SUM(nums.age)>, <Feature: cats.SUM(nums.money)>, <Feature: cats.SUM(nums.weight)>, <Feature: cats.DAY(born)>, <Feature: cats.MONTH(born)>, <Feature: cats.WEEKDAY(born)>, <Feature: cats.YEAR(born)>]

nameIDagemoneyweightcats.hobbeycats.sexcats.COUNT(nums)cats.MAX(nums.age)cats.MAX(nums.money)cats.MAX(nums.weight)cats.MEAN(nums.age)cats.MEAN(nums.money)cats.MEAN(nums.weight)cats.MIN(nums.age)cats.MIN(nums.money)cats.MIN(nums.weight)cats.SKEW(nums.age)cats.SKEW(nums.money)

网站声明:如果转载,请联系本站管理员。否则一切后果自行承担。

本文链接:https://www.xckfsq.com/news/show.html?id=2008
赞同 0
评论 0 条
米哈游招聘L0
粉丝 0 发表 8 + 关注 私信
上周热门
如何使用 StarRocks 管理和优化数据湖中的数据?  2672
【软件正版化】软件正版化工作要点  2637
统信UOS试玩黑神话:悟空  2532
信刻光盘安全隔离与信息交换系统  2216
镜舟科技与中启乘数科技达成战略合作,共筑数据服务新生态  1092
grub引导程序无法找到指定设备和分区  743
WPS City Talk · 校招西安站来了!  15
金山办公2024算法挑战赛 | 报名截止日期更新  15
看到某国的寻呼机炸了,就问你用某水果手机发抖不?  14
有在找工作的IT人吗?  13
本周热议
我的信创开放社区兼职赚钱历程 40
今天你签到了吗? 27
信创开放社区邀请他人注册的具体步骤如下 15
如何玩转信创开放社区—从小白进阶到专家 15
方德桌面操作系统 14
我有15积分有什么用? 13
用抖音玩法闯信创开放社区——用平台宣传企业产品服务 13
如何让你先人一步获得悬赏问题信息?(创作者必看) 12
2024中国信创产业发展大会暨中国信息科技创新与应用博览会 9
中央国家机关政府采购中心:应当将CPU、操作系统符合安全可靠测评要求纳入采购需求 8

加入交流群

请使用微信扫一扫!