欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

分类算法-随机森林

程序员文章站 2022-07-14 14:49:21
...

分类算法-随机森林 (Classification Algorithms - Random Forest)



Advertisements
广告

介绍 (Introduction)

Random forest is a supervised learning algorithm which is used for both classification as well as regression. But however, it is mainly used for classification problems. As we know that a forest is made up of trees and more trees means more robust forest. Similarly, random forest algorithm creates decision trees on data samples and then gets the prediction from each of them and finally selects the best solution by means of voting. It is an ensemble method which is better than a single decision tree because it reduces the over-fitting by averaging the result.

随机森林是一种监督学习算法,可用于分类和回归。 但是,它主要用于分类问题。 众所周知,森林由树木组成,更多的树木意味着更坚固的森林。 同样,随机森林算法在数据样本上创建决策树,然后从每个样本中获取预测,最后通过投票选择最佳解决方案。 它是一种集成方法,比单个决策树要好,因为它通过对结果求平均值来减少过度拟合。

随机森林算法的工作 (Working of Random Forest Algorithm)

We can understand the working of Random Forest algorithm with the help of following steps −

我们可以通过以下步骤来了解随机森林算法的工作原理-

  • Step 1 − First, start with the selection of random samples from a given dataset.

    步骤1-首先,从给定的数据集中选择随机样本。

  • Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.

    步骤2-接下来,该算法将为每个样本构造一个决策树。 然后它将从每个决策树中获得预测结果。

  • Step 3 − In this step, voting will be performed for every predicted result.

    步骤3-在此步骤中,将对每个预测结果进行投票。

  • Step 4 − At last, select the most voted prediction result as the final prediction result.

    步骤4-最后,选择投票最多的预测结果作为最终预测结果。

The following diagram will illustrate its working −

下图将说明其工作方式-

分类算法-随机森林

用Python实现 (Implementation in Python)

First, start with importing necessary Python packages −

首先,从导入必要的Python包开始-


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Next, download the iris dataset from its weblink as follows −

接下来,如下所示从其网络链接下载iris数据集:


path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

Next, we need to assign column names to the dataset as follows −

接下来,我们需要为数据集分配列名,如下所示:


headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

Now, we need to read dataset to pandas dataframe as follows −

现在,我们需要将数据集读取为pandas数据框,如下所示:


dataset = pd.read_csv(path, names=headernames)
dataset.head()

sepal-length sepal-width petal-length petal-width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
萼片长度 萼片宽度 花瓣长度 花瓣宽度
0 5.1 3.5 1.4 0.2 鸢尾
1个 4.9 3.0 1.4 0.2 鸢尾
2 4.7 3.2 1.3 0.2 鸢尾
3 4.6 3.1 1.5 0.2 鸢尾
4 5.0 3.6 1.4 0.2 鸢尾

Data Preprocessing will be done with the help of following script lines −

数据预处理将在以下脚本行的帮助下完成-


X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data −

接下来,我们将数据分为训练和测试拆分。 以下代码将数据集拆分为70%的训练数据和30%的测试数据-


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

Next, train the model with the help of RandomForestClassifier class of sklearn as follows −

接下来,在sklearn的RandomForestClassifier类的帮助下训练模型,如下所示-


from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(X_train, y_train)

At last, we need to make prediction. It can be done with the help of following script −

最后,我们需要进行预测。 可以在以下脚本的帮助下完成-


y_pred = classifier.predict(X_test)

Next, print the results as follows −

接下来,按如下所示打印结果-


from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

输出量 (Output)


Confusion Matrix:
[
   [14 0 0]
   [ 0 18 1]
   [ 0 0 12]
]
Classification Report:
               precision       recall     f1-score       support
Iris-setosa        1.00         1.00        1.00         14
Iris-versicolor    1.00         0.95        0.97         19
Iris-virginica     0.92         1.00        0.96         12
micro avg          0.98         0.98        0.98         45
macro avg          0.97         0.98        0.98         45
weighted avg       0.98         0.98        0.98         45

Accuracy: 0.9777777777777777

随机森林的利与弊 (Pros and Cons of Random Forest)

优点 (Pros)

The following are the advantages of Random Forest algorithm −

以下是随机森林算法的优点-

  • It overcomes the problem of overfitting by averaging or combining the results of different decision trees.

    它通过平均或组合不同决策树的结果来克服过拟合的问题。

  • Random forests work well for a large range of data items than a single decision tree does.

    与单个决策树相比,随机森林在较大范围的数据项上效果很好。

  • Random forest has less variance then single decision tree.

    随机森林的方差小于单个决策树。

  • Random forests are very flexible and possess very high accuracy.

    随机森林非常灵活,并且具有很高的准确性。

  • Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

    在随机森林算法中不需要数据缩放。 即使在没有缩放的情况下提供数据后,它仍保持良好的准确性。

  • Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.

    在随机森林算法中不需要数据缩放。 即使在没有缩放的情况下提供数据后,它仍保持良好的准确性。

缺点 (Cons)

The following are the disadvantages of Random Forest algorithm −

以下是随机森林算法的缺点-

  • Complexity is the main disadvantage of Random forest algorithms.

    复杂性是随机森林算法的主要缺点。

  • Construction of Random forests are much harder and time-consuming than decision trees.

    与决策树相比,随机森林的建设更加困难且耗时。

  • More computational resources are required to implement Random Forest algorithm.

    实现随机森林算法需要更多的计算资源。

  • It is less intuitive in case when we have a large collection of decision trees .

    如果我们有大量的决策树集合,那么它就不太直观了。

  • The prediction process using random forests is very time-consuming in comparison with other algorithms.

    与其他算法相比,使用随机森林的预测过程非常耗时。

Advertisements
广告

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/classification_algorithms_random_forest.htm