Datawhale 零基础入门数据挖掘-Task4 建模调参

程序员文章站 2022-07-14 10:43:28

...

一、前言

感谢Datawhale的给出的学习指南：https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.6.1cd81b43dZv7yn&postId=95460
下面的数据主要都是基于零基础入门数据挖掘 - 二手车交易价格预测的比赛：https://tianchi.aliyun.com/competition/entrance/231784/information

二、学习目标

了解常用的机器学习模型，并掌握机器学习模型的建模与调参流程
完成相应学习打卡任务

三、学习过程

1.相关原理介绍与推荐

（1）线性回归

线性回归是一种被广泛应用的回归技术，也是机器学习里面最简单的一个模型，它有很多种推广形式。
本质上是一系列特征的线性组合。在二维空间中，可以视它为一条直线。在三维空间中，可以视为一个平面。
线性回归最普通的形式是：f(x)=w’x+b
x向量代表一条样本{x1,x2,x3…xn}，其中x1,x2,x3代表样本的各个特征。
w是一条向量代表了每个特征所占的权重。
b是一个标量代表特征都为0时的预测值，可以视为模型的basis或bias

参考：https://zhuanlan.zhihu.com/p/49480391

(2)决策树(Decision Tree)

在已知各种情况发现概率的基础上，通过构成决策树来求取净现值的期望值大于等于0的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。
由于这种决策分支画成图像很想一棵树的枝干，故称决策树。
决策树是一种预测模型，代表对象属性与对象值之间的一种映射关系。
决策树也是一种树形结构，其中每个内部节点表示一个属性上的测试，每个分支代表一个测试输出，每个叶节点代表一种类别。
一个决策树包含：
1. 决策节点：通常用矩形框来表示。
2. 机会节点：通常用圆圈来表示。
3. 终结点：通常用三角形来表示。

（3）GBDT模型

GBDT模型是一个集成模型，是以决策树（CART）为基学习器的GB算法，是迭代数。
Goost是“提升”的意思，一般Boosting算法都是一个迭代的过程，每次新的训练都是为了改进上次的结果。
GBDT的核心在于：每棵树学的是之前所有树结论和的残差。
CART树是一个决策树模型，与普通的ID3和C4.5相比，
CART树的主要特征是：
1. 一个二分树
2. 每个节点特征取值为“是”和“不是”。

参考：https://zhuanlan.zhihu.com/p/45145899

参考：https://www.zhihu.com/topic/20066371/top-answers

(4)XGBoost模型

XGBoost是一套提升树可拓展的机器学习系统。
XGBoost的可信算法思想：
1. 不断地添加树，不断地进行特征分裂来生长一棵树，每次添加一个树，其实是学习一个新函数f(x)，去拟合上次预测的残差。
2. 当我们训练完成得到k棵树，我们要预测一个样本的分数，其实就是根据这个样本的特征，在每棵树中会落到对应的一个叶子结点，每个叶子结点就对应一个分数。
3. 最后只需要将每棵树对应的分数加起来就是该样本的预测值。
类似之前GBDT的套路，XGBoost也是需要将多棵树的得分累加得到最终的预测得分（每一次迭代，都在现有树的基础上，增加一棵树去拟合前面树的预测结果与真实值之间的残差）。

参考：https://www.cnblogs.com/mantch/p/11164221.html

参考：https://www.jianshu.com/p/a62f4dce3ce8

(5)LightGBM模型

LightGBM采用leaf-wise生长策略，每次从当前所有叶子中找到分裂增益最大（一般也是数据量最大）的一个叶子，然后分裂，如此循环。
LightGBM的优化方法是，在保留大梯度样本的同时，随机地保留一些小梯度样本，同时放大了小梯度样本带来的信息增益。
- 这样说起来比较抽象，我们过一遍流程：首先把样本按照梯度排序，选出梯度最大的a%个样本，然后在剩下小梯度数据中随机选取b%个样本，在计算信息增益的时候，将选出来b%个小梯度样本的信息增益扩大 1 - a / b 倍。这样就会避免对于数据分布的改变。
- 这给我的感觉就是一个公寓里本来住了十个人，感觉太挤了，赶走了六个人，但剩下的四个人要分摊他们六个人的房租。

-参考： https://zhuanlan.zhihu.com/p/89360721

参考：https://www.biaodianfu.com/lightgbm.html

2.读取数据

import pandas as pd 
import numpy as np 
import warnings 
warnings.filterwarnings('ignore')

warnings.filterwarnings():过滤警告，在警告过滤器规则列表中插入一个条目。
- warnings.filterwarnings(‘ignore’)：忽略匹配的警告。

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum()
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')
                
    end_mem = df.memory_usage().sum()
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    return df

reduce_mem_usage 函数通过调整数据类型，帮助我们减少数据在内存中占用的空间

sample_feature = reduce_mem_usage(pd.read_csv('data_for_tree.csv'))

Memory usage of dataframe is 62099624.00 MB
Memory usage after optimization is: 16520255.00 MB
Decreased by 73.4%

continuous_feature_names = [x for x in sample_feature.columns if x not in ['price','brand','model']]

3. 线性回归 & 五折交叉验证 & 模拟真实业务情况

sample_feature = sample_feature.dropna().replace('-', 0).reset_index(drop=True) 
sample_feature['notRepairedDamage'] = sample_feature['notRepairedDamage'].astype(np.float32) 
train = sample_feature[continuous_feature_names + ['price']]

train_X = train[continuous_feature_names] 
train_y = train['price']

sample_feature.dropna():删除缺失值

sample_feature.dropna().replace(’-’, 0).reset_index(drop=True):删除缺失值后，将有‘-’的位置替换成0，把原来的索引index列去掉

sample_feature[‘notRepairedDamage’].astype(np.float32):做强制类型转换

（1）简单建模

from sklearn.linear_model import LinearRegression
model = LinearRegression(normalize=True)

LinearRegression():普通最小二乘线性回归。normalize为True，回归前对回归量x进行归一化处理。

model = model.fit(train_X, train_y)

model.fit(X, y, sample_weight=None):拟合线性模型。
- 参数X：数组类或稀疏矩阵 -->训练数据
- 参数y：数组类 -->目标价值
- 参数sample_weight=None：每个样本的独立权值。

'intercept:'+ str(model.intercept_)
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

[('v_6', 3367064.341641862),
 ('v_8', 700675.5609398965),
 ('v_9', 170630.27723222625),
 ('v_7', 32322.66193204228),
 ('v_12', 20473.670796959854),
 ('v_3', 17868.079541492534),
 ('v_11', 11474.938996713121),
 ('v_13', 11261.76456001184),
 ('v_10', 2683.920090588445),
 ('gearbox', 881.822503924815),
 ('fuelType', 363.9042507216377),
 ('bodyType', 189.60271012071905),
 ('city', 44.94975120523428),
 ('power', 28.55390161675822),
 ('brand_price_median', 0.5103728134078572),
 ('brand_price_std', 0.4503634709263408),
 ('brand_amount', 0.14881120395067576),
 ('brand_price_max', 0.0031910186703164602),
 ('SaleID', 5.355989919853865e-05),
 ('offerType', 4.058936610817909e-06),
 ('train', -2.3469328880310059e-07),
 ('seller', -1.482432708144188e-06),
 ('brand_price_sum', -2.1750068681879964e-05),
 ('name', -0.0002980012713079153),
 ('used_time', -0.002515894332888446),
 ('brand_price_average', -0.4049048451011004),
 ('brand_price_min', -2.2467753486895097),
 ('power_bin', -34.42064411732811),
 ('v_14', -274.7841180773582),
 ('kilometer', -372.89752666071536),
 ('notRepairedDamage', -495.1903844629893),
 ('v_0', -2045.054957354484),
 ('v_5', -11022.986240434542),
 ('v_4', -15121.731109853818),
 ('v_2', -26098.299920531143),
 ('v_1', -45556.18929728326)]

sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True):代表将continuous_feature_names和model.coef_打包成元祖，放在字典中后变成列表。在排序的时候将model.coef_作为排序对象，并且是倒序排列。

查看训练的线性回归模型的截距（intercept）与权重(coef)

from matplotlib import pyplot as plt
subsample_index = np.random.randint(low=0, high=len(train_y), size=50)

np.random.randint(low=0, high=len(train_y), size=50):返回从low(包括)到high(不包括)的随机整数。其中，从分布中提取的最低值为0，最高值为train_y的长度，提取的个数为50.

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black') 
plt.scatter(train_X['v_9'][subsample_index], model.predict(train_X.loc[subsample_index]), color='blue')
plt.xlabel('v_9') 
plt.ylabel('price') 
plt.legend(['True Price','Predicted Price'],loc='upper right') 
print('The predicted price is obvious different from true price') 
plt.show()

The predicted price is obvious different from true price

Datawhale 零基础入门数据挖掘-Task4 建模调参

绘制特征v_9的值与标签的散点图，图片发现模型的预测结果（蓝色点）与真实标签（黑色点）的分布差异较大，且部分预测值出现了小于0的情况，说明我们的模型存在一些问题

import seaborn as sns
print('It is clear to see the price shows a typical exponential distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y)
plt.subplot(1,2,2)
sns.distplot(train_y[train_y < np.quantile(train_y, 0.9)])

It is clear to see the price shows a typical exponential distribution





<matplotlib.axes._subplots.AxesSubplot at 0x1e58ab3d1d0>

Datawhale 零基础入门数据挖掘-Task4 建模调参

第一个图是有全部的train_y的数据，第二个图是计算出train_y小于train_y按从小到大排序的的第90百分位数的所有train_y的数

通过作图我们发现数据的标签（price）呈现长尾分布，不利于我们的建模预测。原因是很多模型都假设数据误差项符合正态分布，而长尾分布的数据违背了这一假设。参考博客：https://blog.csdn.net/Noob_daniel/article/details/76087829

train_y_ln = np.log(train_y + 1)

在这里我们对标签进行了 log(x+1) 变换，使标签贴近于正态分布

import seaborn as sns
print('The transformed price seems like normal distribution')
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(train_y_ln)
plt.subplot(1,2,2)
sns.distplot(train_y_ln[train_y_ln < np.quantile(train_y_ln, 0.9)])

The transformed price seems like normal distribution





<matplotlib.axes._subplots.AxesSubplot at 0x1e58acf9ac8>

Datawhale 零基础入门数据挖掘-Task4 建模调参

model = model.fit(train_X, train_y_ln)

print('intercept:'+ str(model.intercept_))
sorted(dict(zip(continuous_feature_names, model.coef_)).items(), key=lambda x:x[1], reverse=True)

intercept:18.75074946557286





[('v_9', 8.052409900568154),
 ('v_5', 5.764236596653902),
 ('v_12', 1.6182081236781853),
 ('v_1', 1.479831058297011),
 ('v_11', 1.1669016563603853),
 ('v_13', 0.9404711296032395),
 ('v_7', 0.7137273083565033),
 ('v_3', 0.6837875771077901),
 ('v_0', 0.008500518010088588),
 ('power_bin', 0.008497969302894976),
 ('gearbox', 0.007922377278324315),
 ('fuelType', 0.006684769706823328),
 ('bodyType', 0.00452352009270419),
 ('power', 0.0007161894205356782),
 ('brand_price_min', 3.3343511147486766e-05),
 ('brand_amount', 2.8978797042770635e-06),
 ('brand_price_median', 1.2571172873034594e-06),
 ('brand_price_std', 6.659176363444686e-07),
 ('brand_price_max', 6.194956307514967e-07),
 ('brand_price_average', 5.999345965034972e-07),
 ('SaleID', 2.1194170039643388e-08),
 ('seller', 1.000444171950221e-10),
 ('train', -4.547473508864641e-13),
 ('offerType', -8.637357495899778e-11),
 ('brand_price_sum', -1.5126504215913738e-10),
 ('name', -7.015512588892976e-08),
 ('used_time', -4.122479372350753e-06),
 ('city', -0.002218782481041604),
 ('v_14', -0.004234223418112898),
 ('kilometer', -0.01383586622688241),
 ('notRepairedDamage', -0.2702794234984524),
 ('v_4', -0.8315701200993837),
 ('v_2', -0.9470842241612685),
 ('v_10', -1.6261466689777442),
 ('v_8', -40.343007487616696),
 ('v_6', -238.7903638550667)]

plt.scatter(train_X['v_9'][subsample_index], train_y[subsample_index], color='black')
plt.scatter(train_X['v_9'][subsample_index], np.exp(model.predict(train_X.loc[subsample_index])), color='blue')
plt.xlabel('v_9')
plt.ylabel('price')
plt.legend(['True Price','Predicted Price'],loc='upper right')
print('The predicted price seems normal after np.log transforming')
plt.show()

The predicted price seems normal after np.log transforming

Datawhale 零基础入门数据挖掘-Task4 建模调参

再次进行可视化，发现预测结果与真实值较为接近，且未出现异常状况

(2)五折交叉验证

在使用训练集对参数进行训练的时候，经常会发现人们通常会将一整个训练集分为三个部分（比如mnist手写训练集）。一般分为：训练集（train_set），评估集（valid_set），测试集（test_set）这三个部分。这其实是为了保证训练效果而特意设置的。其中测试集很好理解，其实就是完全不参与训练的数据，仅仅用来观测测试效果的数据。而训练集和评估集则牵涉到下面的知识了。

因为在实际的训练中，训练的结果对于训练集的拟合程度通常还是挺好的（初始条件敏感），但是对于训练集之外的数据的拟合程度通常就不那么令人满意了。因此我们通常并不会把所有的数据集都拿来训练，而是分出一部分来（这一部分不参加训练）对训练集生成的参数进行测试，相对客观的判断这些参数对训练集之外的数据的符合程度。这种思想就称为交叉验证（Cross Validation）

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error,  make_scorer

def log_transfer(func):
    def wrapper(y, yhat):
        result = func(np.log(y), np.nan_to_num(np.log(yhat)))
        return result
    return wrapper

scores = cross_val_score(model, X=train_X, y=train_y, verbose=1, cv = 5, scoring=make_scorer(log_transfer(mean_absolute_error)))

cross_val_score():通过交叉检验来评估分数。其中参数x为数据；y为预测数据；verbose为详细程度；cv为交叉验证生成器或可迭代的次数;scoring中创建一个记分员，并将mean_absolute_error函数（评价绝对误差回归损失）传入函数log_transfer

平均绝对误差：表示预测值和观测值之间绝对误差的平均值。因此wrapper中的y就是预测值，yhat就是观测值（真实值）。

print('AVG:', np.mean(scores))

AVG: 1.365802392031409

使用线性回归模型，对未处理标签的特征数据进行五折交叉验证

scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=1, cv = 5, scoring=make_scorer(mean_absolute_error))

print('AVG:', np.mean(scores))

AVG: 0.1932530183704742

使用线性回归模型，对未处理标签的特征数据进行五折交叉验证

scores = pd.DataFrame(scores.reshape(1,-1))
scores.columns = ['cv' + str(x) for x in range(1, 6)]
scores.index = ['MAE']
scores

	cv1	cv2	cv3	cv4	cv5
MAE	0.190792	0.193758	0.194132	0.191825	0.195758

(3)模拟真实业务情况

但在事实上，由于我们并不具有预知未来的能力，五折交叉验证在某些与时间相关的数据集上反而反映了不真实的情况。通过2018年的二手车价格预测2017年的二手车价格，这显然是不合理的，因此我们还可以采用时间顺序对数据集进行分隔。在本例中，我们选用靠前时间的4/5样本当作训练集，靠后时间的1/5当作验证集，最终结果与五折交叉验证差距不大

import datetime

sample_feature = sample_feature.reset_index(drop=True)

split_point = len(sample_feature) // 5 * 4

train = sample_feature.loc[:split_point].dropna()
val = sample_feature.loc[split_point:].dropna()

sample_feature.loc[:split_point].dropna():访问从开始到第split_point的元素并且删除含有空值的行

train_X = train[continuous_feature_names]
train_y_ln = np.log(train['price'] + 1)
val_X = val[continuous_feature_names]
val_y_ln = np.log(val['price'] + 1)

model = model.fit(train_X, train_y_ln)

mean_absolute_error(val_y_ln, model.predict(val_X))

0.19577667270301036

（4）绘制学习率曲线与验证曲线

from sklearn.model_selection import learning_curve, validation_curve

? learning_curve

确定不同训练集大小的交叉验证训练和测试分数。

交叉验证生成器将整个数据集在训练和测试数据中分割k次。不同大小的训练集的子集将被用来训练估计器，每个训练子集的大小和测试集的分数都将被计算出来。然后，对每个训练子集大小的所有k次运行的分数取平均值。

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,n_jobs=1, train_size=np.linspace(.1, 1.0, 5 )):  
    plt.figure()  
    plt.title(title)  
    if ylim is not None:  
        plt.ylim(*ylim)  
    plt.xlabel('Training example')  
    plt.ylabel('score')  
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))  
    train_scores_mean = np.mean(train_scores, axis=1)  
    train_scores_std = np.std(train_scores, axis=1)  
    test_scores_mean = np.mean(test_scores, axis=1)  
    test_scores_std = np.std(test_scores, axis=1)  
    plt.grid()#区域  
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,  
                     train_scores_mean + train_scores_std, alpha=0.1,  
                     color="r")  
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,  
                     test_scores_mean + test_scores_std, alpha=0.1,  
                     color="g")  
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r',  
             label="Training score")  
    plt.plot(train_sizes, test_scores_mean,'o-',color="g",  
             label="Cross-validation score")  
    plt.legend(loc="best")  
    return plt

plt.ylim()：获取或者设置当前y轴的限制。

learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_size, scoring = make_scorer(mean_absolute_error))：
- 参数estimator：实现“拟合”和“预测”方法的对象类型，该类型的对象为每个验证克隆；
- 参数X：训练集；
- 参数y：目标相对于X进行分类或回归，无监督学习；
- 参数n_jobs：要并行运行的作业数；
- 参数train_sizes：用于生成学习曲线的训练示例的相对或绝对数量；
- 参数scoring：记分员，可调用对象或函数。

np.mean():求取均值；（以mn矩阵举例）
- axis-不设置值，对mn个数求均值，返回一个实数；
- axis=0：压缩行，对各列求均值，返回1n矩阵；
- axis=1：压缩列，对各行求均值，返回m1矩阵。

np.std()：求全局标准差：
- axis=0：计算每一列的标准差；
- axis=1：计算每一行的标准差。

plt.fill_between():填充两条水平曲线之间的区域：（以第一个为例）
- train_sizes：曲线的节点的x坐标；
- train_scores_mean - train_scores_std：定义第一条曲线的节点的y坐标；
- train_scores_mean + train_scores_std：定义第二条曲线的节点的y坐标；
- 参数alpha：透明度。

plt.legend(loc=“best”):在坐标轴上放置一个图例。
- loc=“best”：将图例放在指定的九个位置中，与其他绘制的图重叠最少。对于具有大量数据的图，这个选择可能会非常慢。

plot_learning_curve(LinearRegression(), 'Liner_model', train_X[:1000], train_y_ln[:1000], ylim=(0.0, 0.5), cv=5, n_jobs=1)

<module 'matplotlib.pyplot' from 'C:\\Users\\83769\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py'>

Datawhale 零基础入门数据挖掘-Task4 建模调参

4.多种模型对比

train = sample_feature[continuous_feature_names + ['price']].dropna()

train_X = train[continuous_feature_names]
train_y = train['price']
train_y_ln = np.log(train_y + 1)

(1)线性模型&嵌入式特征选择

过拟合（overfitting）

过拟合：简单地讲，就是知道怎么做，但是不知道里面用了什么规则或者是原理，导致复杂化而错误。

过拟合的两种原因：
1. 训练集和测试集特征分布不一致
2.模型太过复杂而样本量不足

解决过拟合从两方面下手：收集多样化的样本，简化模型，交叉检验。

参考：https://www.zhihu.com/question/32246256/answer/55320482

模型复杂度与模型的泛化能力

如果模型复杂度太低（参数过少），即模型可训练空间太小，就难以训练出有效的模型，就会出现欠拟合；
- 欠拟合就是训练过程中误差难以下降。

如果模型复杂度太高（参数很多），即模型可训练空间很大，在大量样本输入后容易训练过头，就会出现过拟合。
- 过拟合就是训练之后，测试误差要远比训练误差大。

所以控制好模型复杂度（参数数量），是调整欠拟合和过拟合的一种方法。

若欠拟合，表示无法充分训练，可将网络层的节点数调大些。

参考：http://yangyingming.com/article/434/

正则化的直观理解

机器学习中几乎都可以看到损失函数后添加个额外项。常用的额外项一般有两种：L1正则化和L2正则化。

L1正则化和L2正则化可看做是损失函数中某些参数做些限制。
- 对于线性回归模型：
- 使用L1正则化的模型叫做Lasso回归；
- 使用L2正则化的模型叫做Ridge回归(岭回归)。

L1正则化和L2正则化的说明：
- L1正则化是指权值向量w中各个元素的绝对值之和；
- L2正则化是指权值向量w中各个元素的平方和后求求平方根。

L1正则化和L2正则化的作用：
- L1正则化可产生稀疏权值矩阵，即产生一个稀疏模型，可用于特征选择；
- L2正则化可防止模型过拟合；一定程度上，L1也可以防过拟合。

参考：https://blog.csdn.net/*_shi/article/details/52433975

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = [LinearRegression(),
          Ridge(),
          Lasso()]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
Ridge is finished
Lasso is finished

在过滤式和包裹式特征选择方法中，特征选择过程与学习器训练过程有明显的分别。而嵌入式特征选择在学习器训练过程中自动地进行特征选择。嵌入式选择最常用的是L1正则化与L2正则化。在对线性回归模型加入两种正则化方法后，他们分别变成了岭回归与Lasso回归。

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

	LinearRegression	Ridge	Lasso
cv1	0.190792	0.194832	0.383899
cv2	0.193758	0.197632	0.381894
cv3	0.194132	0.198123	0.384090
cv4	0.191825	0.195670	0.380526
cv5	0.195758	0.199676	0.383611

对三种方法的效果对比

model = LinearRegression().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:18.750726309297328





<matplotlib.axes._subplots.AxesSubplot at 0x1e589be5780>

Datawhale 零基础入门数据挖掘-Task4 建模调参

L2正则化在拟合过程中通常都倾向于让权值尽可能小，最后构造一个所有参数都比较小的模型。因为一般认为参数值小的模型比较简单，能适应不同的数据集，也在一定程度上避免了过拟合现象。可以设想一下对于一个线性回归方程，若参数很大，那么只要数据偏移一点点，就会对结果造成很大的影响；但如果参数足够小，数据偏移得多一点也不会对结果造成什么影响，专业一点的说法是『抗扰动能力强』

model = Ridge().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:4.671709787661963





<matplotlib.axes._subplots.AxesSubplot at 0x1e58ab830b8>

Datawhale 零基础入门数据挖掘-Task4 建模调参

L1正则化有助于生成一个稀疏权值矩阵，进而可以用于特征选择。如下图，我们发现power与userd_time特征非常重要。

model = Lasso().fit(train_X, train_y_ln)
print('intercept:'+ str(model.intercept_))
sns.barplot(abs(model.coef_), continuous_feature_names)

intercept:8.67218477988307





<matplotlib.axes._subplots.AxesSubplot at 0x1e58b3fde48>

Datawhale 零基础入门数据挖掘-Task4 建模调参

除此之外，决策树通过信息熵或GINI指数选择分裂节点时，优先选择的分裂特征也更加重要，这同样是一种特征选择的方法。XGBoost与LightGBM模型中的model_importance指标正是基于此计算的

(2)非线性模型

除了线性模型以外，还有许多我们常用的非线性模型如下，在此篇幅有限不再一一讲解原理。我们选择了部分常用模型与线性模型进行效果比对。

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from xgboost.sklearn import XGBRegressor
from lightgbm.sklearn import LGBMRegressor

models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          MLPRegressor(solver='lbfgs', max_iter=100), 
          XGBRegressor(n_estimators = 100, objective='reg:squarederror'), 
          LGBMRegressor(n_estimators = 100)]

result = dict()
for model in models:
    model_name = str(model).split('(')[0]
    scores = cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error))
    result[model_name] = scores
    print(model_name + ' is finished')

LinearRegression is finished
DecisionTreeRegressor is finished
RandomForestRegressor is finished
GradientBoostingRegressor is finished
MLPRegressor is finished
XGBRegressor is finished
LGBMRegressor is finished

result = pd.DataFrame(result)
result.index = ['cv' + str(x) for x in range(1, 6)]
result

	LinearRegression	DecisionTreeRegressor	RandomForestRegressor	GradientBoostingRegressor	MLPRegressor	XGBRegressor	LGBMRegressor
cv1	0.190792	0.198405	0.142131	0.168897	2772.442908	0.142367	0.141542
cv2	0.193758	0.193682	0.143025	0.171816	1708.891820	0.140923	0.145501
cv3	0.194132	0.189418	0.141544	0.170888	311.359174	0.139393	0.143887
cv4	0.191825	0.190877	0.141012	0.169083	902.516489	0.137492	0.142497
cv5	0.195758	0.203953	0.146057	0.174088	399.349459	0.143732	0.144852

可以看到随机森林模型在每一个fold中均取得了更好的效果

5.模型调参

三种常用的调参方法：

## LGB的参数集合：

objective = ['regression', 'regression_l1', 'mape', 'huber', 'fair']

num_leaves = [3,5,10,15,20,40, 55]
max_depth = [3,5,10,15,20,40, 55]
bagging_fraction = []
feature_fraction = []
drop_rate = []

（1）贪心调参

贪心算法

贪心算法是指，对问题求解时，总是做出在当前看来是最好的选择。也就是说，不从整体最优上加以考虑，它所做的仅仅是在某种意义上的局部最优解。

必须注意的是，贪心算法不是对所有问题都能得到整天最优解，选择的贪心策略必须具备无后效性（即某个状态以后的过程不会影响以前的状态，只与当前状态有关。）

贪心算法的基本思路：
1. 建立数学模型来描述问题
2.把求解的问题分成若干个子问题
3.对每个子问题求解，得到子问题的局部最优解
4.把子问题的解局部最优解合成原来问题的一个解

贪心算法策略使用的前提：局部最优策略能导致产生全局最优解。

贪心算法的实现框架：
- 从问题的某一初始解出发：
- while（朝给定总目标前进一步）
- {
- 利用可行的决策，求出可行解的一个解元素。
- }
- 由所有解元素组合成问题的一个可行解；

参考：https://www.jianshu.com/p/ab89df9759c8

best_obj = dict()
for obj in objective:
    model = LGBMRegressor(objective=obj)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_obj[obj] = score
    
best_leaves = dict()
for leaves in num_leaves:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_leaves[leaves] = score
    
best_depth = dict()
for depth in max_depth:
    model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                          num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                          max_depth=depth)
    score = np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))
    best_depth[depth] = score

sns.lineplot(x=['0_initial','1_turning_obj','2_turning_leaves','3_turning_depth'], y=[0.143 ,min(best_obj.values()), min(best_leaves.values()), min(best_depth.values())])

<matplotlib.axes._subplots.AxesSubplot at 0x1e58b400ac8>

Datawhale 零基础入门数据挖掘-Task4 建模调参

（2）Grid Search调参

网络调参

-网格搜索（Grid Search）：一种调参方法。
- 当你算法模型效果不是很好时，可通过该方法来调整参数，通过循环遍历，尝试每一种参数组合，返回最好的得分值得参数组合。
- 每个参数都能组合在一起，循环过程就像是在网格中遍历。
- 运行的过程中花费许多时间。

存在的问题：原来的数据集被分为训练集和测试集，其中测试集有两个作用：1.用来调整参数；2.用来评价模型的好坏；这样都会导致评分值比实际效果要好。

解决方法：可通过将数据集分三份：1.训练集（训练数据）；2.验证集（调整参数）；3.测试集（测试模型）。

参考：https://blog.csdn.net/weixin_43172660/article/details/83032029

from sklearn.model_selection import GridSearchCV

parameters = {'objective': objective , 'num_leaves': num_leaves, 'max_depth': max_depth}
model = LGBMRegressor()
clf = GridSearchCV(model, parameters, cv=5)
clf = clf.fit(train_X, train_y)

clf.best_params_

{'max_depth': 15, 'num_leaves': 55, 'objective': 'regression'}

model = LGBMRegressor(objective='regression',
                          num_leaves=55,
                          max_depth=15)

np.mean(cross_val_score(model, X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)))

0.13754980533444577

（3）贝叶斯调参

贝叶斯调参

贝叶斯优化通过基于目标函数的过去评估结果建立代替函数（概率模型），来找到最小化目标函数的值。

贝叶斯方法与随机或网格搜索不同之处：贝叶斯方法在尝试下一组超参数时，会参考之前的评估结果，因此可以省去很多无用功。

贝叶斯优化的四个部分：
1. 目标函数：想要最小化的内容。在这里，目标函数是机器学习模型使用该超参数在验证集上的损失。
2. 域空间：要搜索的超参数的取值范围。
3. 优化算法：构造替代函数并选择下一个超参数值进行评估的方法。
4. 结果历史记录：来自目标函数评估的存储结果，包括超参数和验证集上的损失。

参考：https://blog.csdn.net/linxid/article/details/81189154

from bayes_opt import BayesianOptimization

def rf_cv(num_leaves, max_depth, subsample, min_child_samples):
    val = cross_val_score(
        LGBMRegressor(objective = 'regression_l1',
            num_leaves=int(num_leaves),
            max_depth=int(max_depth),
            subsample = subsample,
            min_child_samples = int(min_child_samples)
        ),
        X=train_X, y=train_y_ln, verbose=0, cv = 5, scoring=make_scorer(mean_absolute_error)
    ).mean()
    return 1 - val

rf_bo = BayesianOptimization(
    rf_cv,
    {
    'num_leaves': (2, 100),
    'max_depth': (2, 100),
    'subsample': (0.1, 1),
    'min_child_samples' : (2, 100)
    }
)

rf_bo.maximize()

1 - rf_bo.max['Value']

三、后记

Task4 建模调参 END.
— By: 小雨姑娘
数据挖掘爱好者，多次获比赛TOP名次。作者的机器学习笔记：https://zhuanlan.zhihu.com/mlbasic

关于Datawhale：
Datawhale是一个专注于数据科学与AI领域的开源组织，汇集了众多领域院校和知名企业的优秀学习者，聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner，和学习者一起成长”为愿景，鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案，赋能人才培养，助力人才成长，建立起人与人，人与知识，人与企业和人与未来的联结。

上一篇： Datawhale 零基础入门数据挖掘-Task4 建模调参

下一篇： Datawhale 零基础入门数据挖掘-Task4 建模调参

Datawhale 零基础入门数据挖掘-Task4 建模调参

文章目录

一、前言

二、学习目标

三、学习过程

1.相关原理介绍与推荐

（1）线性回归

(2)决策树(Decision Tree)

（3）GBDT模型

(4)XGBoost模型

(5)LightGBM模型

2.读取数据

3. 线性回归 & 五折交叉验证 & 模拟真实业务情况

（1）简单建模

(2)五折交叉验证

(3)模拟真实业务情况

（4）绘制学习率曲线与验证曲线

4.多种模型对比

(1)线性模型&嵌入式特征选择

(2)非线性模型

5.模型调参

（1）贪心调参

（2）Grid Search调参

（3）贝叶斯调参

三、后记

DATAWHALE-数据挖掘竞赛入门-task4-建模调参

数据挖掘Task4 建模调参