DATAWHALE-数据挖掘竞赛入门-task4-模型融合

程序员文章站 2022-07-14 23:08:31

...

背景

在这次Datawhale的组队学习中，我们主要学习数据竞赛的相关知识，其中task5是有关于模型融合的知识。

模型融合简介

对完成调参的多个模型所得的预测结果进行综合，以不同的方法进行结果的融合（如加权平均、stacking、voting等），以提升模型整体的性能。

在进行模型融合之前，各个基学习器不能够太差，他们的效果最好是接近的。其次，它们之间要有区分度，模型的相关性不能够太高。要满足这两点，把多个学习器结合在一起，它们的效果才能比原先的各个基学习器要好。

融合方法

在Datawhale的学习手册中，ML67介绍了以下三大类的方法：

1.简单加权融合:

回归（分类概率）：算术平均融合（Arithmetic mean），几何平均融合（Geometric mean）；
分类：投票（Voting)
综合：排序融合(Rank averaging)，log融合

2.stacking/blending:

构建多层模型，并利用预测结果再拟合预测。

3.boosting/bagging（在xgboost，Adaboost,GBDT中已经用到）:

多树的提升方法

对于stacking与blending，有一篇图解版的博客讲解，地址如下：

图解stacking与blending

还有一篇关于模型融合的知乎专栏也提到了不少上述的方法：

【机器学习】模型融合方法概述

具体代码

代码的Notebooks链接：Jupyter Notebook 代码链接

本章代码由ML64编写，首先从简单加权谈起：

1.简单加权平均

首先，构造一下几个基学习器的预测值，及其标签（真实值）。

## 生成一些简单的样本数据，test_prei 代表第i个模型的预测值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

# y_test_true 代表第模型的真实值
y_test_true = [1, 3, 2, 6]

然后，定义一个有三个基学习器的加权平均函数

## 三个基学习器，w为权重，这里默认为粗略的平均值
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
    Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
    return Weighted_result

这个函数可以自定义权重，其外还有mean平均，median平均，它们的函数分别如下：

## 定义结果的平均函数
def Mean_method(test_pre1, test_pre2, test_pre3):
    Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
    return Mean_result

## 定义结果的中位数函数（取中位数）
def Median_method(test_pre1,test_pre2,test_pre3):
    Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
    return Median_result

接下来，ML64给出了用于回归的stacking实现方法，运用了线性回归作为stacking层：

from sklearn import linear_model
# 使用线性回归作stacking层
def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
    model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
    Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
    # 打印一下lr的参数
    print(model_L2.coef_, model_L2.intercept_)
    return Stacking_result

以上都是基于回归模型给出模型融合的函数实现，下面给出一些用于分类模型的函数，首先是Voting方法：

先用三种方法给出分类模型：

# 用3种方法训练模型
clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7, colsample_bytree=0.6,  objective='binary:logistic')

clf2 = RandomForestClassifier(n_estimators=1000, random_state=10)

clf3 = SVC(C=0.1)

投票法分为硬投票与软投票。硬投票就是对多个模型进行投票，不区分重要程度（没有权重），投票数最多的类为最终被预测的类：

# 硬投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy') #在这里面fit
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

软投票和硬投票原理相同，增加了设置权重的功能，可以为不同模型设置不同权重，进而区别模型不同的重要度。这里直接自定义模型的权重：

# SVC要做一些修改，不然报错
clf3 = SVC(C=0.1, probability=True)

eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)], voting='soft', weights=[2, 1, 1])
clf1.fit(x_train, y_train)

for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

总结

由于时间原因，目前这篇文章代码部分只展示了一部分成果，还待进一步更新。

DATAWHALE-数据挖掘竞赛入门-task4-模型融合

背景

模型融合简介

融合方法

具体代码

总结

DATAWHALE-数据挖掘竞赛入门-task4-模型融合

DATAWHALE-数据挖掘竞赛入门-task4-建模调参

Datawhale 零基础入门数据挖掘-Task5 模型融合

【我的数据挖掘竞赛之旅（二）】二手车交易价格预测——2020年天池阿里云竞赛Task5模型融合