ML之xgboost:利用xgboost算法(特征筛选和GridSearchCV)对数据集实现回归预测


小猪佩奇
小猪佩奇 2022-09-19 15:07:03 67078
分类专栏: 资讯

ML之xgboost:利用xgboost算法(特征筛选和GridSearchCV)对数据集实现回归预测

目录

输出结果

实现代码


输出结果

  1. ['EnterCOD', 'EnterBOD', 'EnterAD', 'EnterZL', 'EnterZD', 'EnterPH', 'EnterSS', 'M4', 'N4', 'O4', 'P4', 'Q4', 'R4']
  2. EnterCOD EnterBOD EnterAD EnterZL EnterZD EnterPH EnterSS M4 \
  3. 0 299.0 0.0 16.7 9.63 26.5 7 354.0 4609.0
  4. 1 331.0 0.0 15.0 9.34 31.8 7 297.5 4834.0
  5. 2 326.0 0.0 19.6 11.17 33.5 7 389.5 4928.0
  6. 3 230.0 0.0 17.4 6.23 32.3 7 277.5 5073.0
  7. 4 149.0 0.0 16.8 3.59 23.7 7 106.0 4856.0
  8. N4 O4 P4 Q4 R4
  9. 0 2346.0 1.72 32.0 69.43 17.0
  10. 1 2434.0 1.72 34.0 70.34 18.0
  11. 2 2604.0 1.70 35.0 71.02 18.0
  12. 3 2678.0 1.68 36.0 70.96 19.0
  13. 4 2452.0 1.69 37.0 76.19 19.0
  14. mlss准确率: 0.950752699205583
  15. 特征: Index(['EnterCOD', 'EnterBOD', 'EnterAD', 'EnterZL', 'EnterZD', 'EnterPH',
  16. 'EnterSS', 'M4', 'N4', 'O4', 'P4', 'Q4', 'R4'],
  17. dtype='object')
  18. 每个特征的重要性: [100. 21.307432 48.90534 37.218624 26.950356 2.081406
  19. 31.82239 72.88005 49.49121 61.9334 19.071848 33.441257
  20. 17.745914]
  21. mlss选取重要特征后准确率: 0.9485146037853682
  22. 重要特征: Index(['EnterCOD', 'M4', 'O4', 'N4', 'EnterAD', 'EnterZL', 'Q4', 'EnterSS',
  23. 'EnterZD', 'EnterBOD', 'P4', 'R4'],
  24. dtype='object')
  25. 每个重要特征的重要性: [100. 92.00673 75.79092 55.387436 36.038513 32.217636
  26. 42.442307 28.243927 24.789852 12.685312 18.707016 19.150238]

实现代码

  1. ML之xgboost:利用xgboost算法(特征筛选和GridSearchCV)对数据集实现回归预测
  2. import pandas as pd
  3. import numpy as np
  4. import matplotlib.pyplot as plt
  5. from sklearn import metrics
  6. import pickle
  7. from xgboost.sklearn import XGBRegressor
  8. from sklearn.preprocessing import StandardScaler
  9. from clean_data import prep_water_data, normalize_water_data, normalize_data, delete_null_date
  10. from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score
  11. from sklearn.model_selection import TimeSeriesSplit
  12. def GDBTTrain(X, y):
  13. """xgboost用法"""
  14. train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=0) test_size测试集合所占比例
  15. test_preds = pd.DataFrame({"label": test_y})
  16. clf = XGBRegressor(
  17. learning_rate=0.1, 默认0.3
  18. n_estimators=400, 树的个数
  19. max_depth=8,
  20. )
  21. clf.fit(train_x, train_y)
  22. test_preds['y_pred'] = clf.predict(test_x)
  23. stdm = metrics.r2_score(test_preds['label'], test_preds['y_pred'])
  24. GridSearchCV和cross_val_score的结果一样
  25. scores = cross_val_score(clf, X, y, scoring='r2')
  26. print(scores)
  27. gs = GridSearchCV(clf, {}, cv=3, verbose=3).fit(X, y)
  28. return stdm, clf
  29. def XGTSearch(X, y):
  30. print("Parameter optimization")
  31. n_estimators = [50, 100, 200, 400]
  32. max_depth = [2, 4, 6, 8]
  33. learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
  34. param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)
  35. xgb_model = XGBRegressor()
  36. kfold = TimeSeriesSplit(n_splits=2).get_n_splits([X, y])
  37. fit_params = {"eval_metric": "rmse"}
  38. grid_search = GridSearchCV(xgb_model, param_grid, verbose=1, fit_params=fit_params, cv=kfold)
  39. grid_result = grid_search.fit(X, y)
  40. summarize results
  41. print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
  42. means = grid_result.cv_results_['mean_test_score']
  43. stds = grid_result.cv_results_['std_test_score']
  44. params = grid_result.cv_results_['params']
  45. for mean, stdev, param in zip(means, stds, params):
  46. print("%f (%f) with: %r" % (mean, stdev, param))
  47. return means, grid_result
  48. feature_string = 'EnterCOD EnterBOD EnterAD EnterZL EnterZD EnterPH EnterSS M4 N4 O4 P4 Q4 R4' 选取的特征
  49. outputs_string = 'mlss mlvss sv30 OutCOD OutBOD OutAD OutZL OutZD OutPH OutSS' 需要预测的标签
  50. feature = feature_string.split()
  51. outputs = outputs_string.split()
  52. print(feature)
  53. def prep_water_data(data, columns):
  54. for c in columns:
  55. data[c] = [0 if ((x in ['Not Available', 'Not Mapped', 'NULL']) | (pd.isnull(x))) else x for x in data[c]]
  56. return data
  57. def delete_null_date(data, date_name):
  58. data = data[data[date_name].notnull()] 删除日期存在缺失的数据
  59. return data
  60. data = pd.read_csv('water_a.csv', encoding="gb18030")
  61. data = prep_water_data(data, feature)
  62. print(data.iloc[:5][feature])
  63. def predict(data, out):
  64. data = delete_null_date(data, out)
  65. y = data[out]
  66. y = y.as_matrix()
  67. X = data[feature]
  68. stdm, clf = GDBTTrain(X, y)
  69. print(out +'准确率:', stdm)
  70. feature_importance = clf.feature_importances_
  71. feature_importance = 100.0 * (feature_importance / feature_importance.max())
  72. print('特征:', X.columns)
  73. print('每个特征的重要性:', feature_importance)
  74. sorted_idx = np.argsort(feature_importance)
  75. pos = np.arange(sorted_idx.shape[0])
  76. plt.barh(pos, feature_importance[sorted_idx], align='center')
  77. plt.yticks(pos, X.columns[sorted_idx])
  78. plt.xlabel('Features')
  79. plt.ylabel('Importance')
  80. plt.title('Variable Importance')
  81. plt.show()
  82. .......................选取重要性高的特征再次进行训练和预测..................................
  83. X = data[X.columns[sorted_idx][::-1][:-1]]
  84. stdm, clf = GDBTTrain(X, y)
  85. print(out +'选取重要特征后准确率:', stdm)
  86. feature_importance = clf.feature_importances_
  87. feature_importance = 100.0 * (feature_importance / feature_importance.max())
  88. print('重要特征:', X.columns)
  89. print('每个重要特征的重要性:', feature_importance)
  90. sorted_idx = np.argsort(feature_importance)
  91. pos = np.arange(sorted_idx.shape[0])
  92. plt.barh(pos, feature_importance[sorted_idx], align='center')
  93. plt.yticks(pos, X.columns[sorted_idx])
  94. plt.xlabel('Features')
  95. plt.ylabel('Importance')
  96. plt.title('重要特征 Variable Importance')
  97. plt.show()
  98. for out in outputs[:1]:
  99. sorted_idx = predict(data, out)

网站声明:如果转载,请联系本站管理员。否则一切后果自行承担。

本文链接:https://www.xckfsq.com/news/show.html?id=3176
赞同 0
评论 0 条
小猪佩奇L0
粉丝 0 发表 10 + 关注 私信
上周热门
Kingbase用户权限管理  2020
信刻全自动光盘摆渡系统  1749
信刻国产化智能光盘柜管理系统  1419
银河麒麟添加网络打印机时,出现“client-error-not-possible”错误提示  1014
银河麒麟打印带有图像的文档时出错  924
银河麒麟添加打印机时,出现“server-error-internal-error”  715
麒麟系统也能完整体验微信啦!  657
统信桌面专业版【如何查询系统安装时间】  633
统信操作系统各版本介绍  624
统信桌面专业版【全盘安装UOS系统】介绍  598
本周热议
我的信创开放社区兼职赚钱历程 40
今天你签到了吗? 27
信创开放社区邀请他人注册的具体步骤如下 15
如何玩转信创开放社区—从小白进阶到专家 15
方德桌面操作系统 14
我有15积分有什么用? 13
用抖音玩法闯信创开放社区——用平台宣传企业产品服务 13
如何让你先人一步获得悬赏问题信息?(创作者必看) 12
2024中国信创产业发展大会暨中国信息科技创新与应用博览会 9
中央国家机关政府采购中心:应当将CPU、操作系统符合安全可靠测评要求纳入采购需求 8

添加我为好友,拉您入交流群!

请使用微信扫一扫!