欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

如何在Python中建立和训练K最近邻和K-Means集群ML模型

程序员文章站 2024-02-15 21:48:16
...

One of machine learning's most popular applications is in solving classification problems.

机器学习最流行的应用之一是解决分类问题。

Classification problems are situations where you have a data set, and you want to classify observations from that data set into a specific category.

分类问题是指您拥有数据集,并且想要将来自该数据集的观察结果分类为特定类别的情况。

A famous example is a spam filter for email providers. Gmail uses supervised machine learning techniques to automatically place emails in your spam folder based on their content, subject line, and other features.

一个著名的例子是针对电子邮件提供商的垃圾邮件过滤器。 Gmail使用受监督的机器学习技术,根据邮件的内容,主题行和其他功能自动将其放入垃圾邮件文件夹。

Two machine learning models perform much of the heavy lifting when it comes to classification problems:

当涉及分类问题时,两种机器学习模型会承担很多繁重的工作:

  • K-nearest neighbors

    K近邻
  • K-means clustering

    K均值聚类

This tutorial will teach you how to code K-nearest neighbors and K-means clustering algorithms in Python.

本教程将教您如何在Python中编写K近邻和K均值聚类算法。

K最近邻居模型 (K-Nearest Neighbors Models)

The K-nearest neighbors algorithm is one of the world’s most popular machine learning models for solving classification problems.

K近邻算法是解决分类问题的世界上最受欢迎的机器学习模型之一。

A common exercise for students exploring machine learning is to apply the K nearest neighbors algorithm to a data set where the categories are not known. A real-life example of this would be if you needed to make predictions using machine learning on a data set of classified government information.

学生探索机器学习的一个常见练习是将K最近邻算法应用于类别未知的数据集。 一个真实的例子是,如果您需要使用机器学习对机密*信息的数据集进行预测。

In this tutorial, you will learn to write your first K nearest neighbors machine learning algorithm in Python. We will be working with an anonymous data set similar to the situation described above.

在本教程中,您将学习用Python编写第一个K最近邻机器学习算法。 我们将使用类似于上述情况的匿名数据集。

您在本教程中需要的数据集 (The Data Set You Will Need in This Tutorial)

The first thing you need to do is download the data set we will be using in this tutorial. I have uploaded the file to my website. You can access it by clicking here.

您需要做的第一件事是下载我们将在本教程中使用的数据集。 我已将文件上传到我的网站 。 您可以通过单击此处访问它。

Now that you have downloaded the data set, you will want to move the file to the directory that you’ll be working in. After that, open a Jupyter Notebook and we can get started writing Python code!

现在,您已经下载了数据集,您将需要将文件移动到将要使用的目录中。之后,打开Jupyter Notebook ,我们可以开始编写Python代码了!

在本教程中您将需要的图书馆 (The Libraries You Will Need in This Tutorial)

To write a K nearest neighbors algorithm, we will take advantage of many open-source Python libraries including NumPy, pandas, and scikit-learn.

要编写K最近邻居算法,我们将利用许多开源Python库,包括NumPypandasscikit-learn

Begin your Python script by writing the following import statements:

通过编写以下导入语句开始Python脚本:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

将数据集导入我们的Python脚本 (Importing the Data Set Into Our Python Script)

Our next step is to import the classified_data.csv file into our Python script. The pandas library makes it easy to import data into a pandas DataFrame.

下一步是将classified_data.csv文件导入到我们的Python脚本中。 使用pandas库可以轻松地将数据导入pandas DataFrame中

Since the data set is stored in a csv file, we will be using the read_csv method to do this:

由于数据集存储在一个csv文件中,因此我们将使用read_csv方法来执行此操作:

raw_data = pd.read_csv('classified_data.csv')

Printing this DataFrame inside of your Jupyter Notebook will give you a sense of what the data looks like:

在Jupyter Notebook内部打印此DataFrame可以使您大致了解数据的样子:

如何在Python中建立和训练K最近邻和K-Means集群ML模型

You will notice that the DataFrame starts with an unnamed column whose values are equal to the DataFrame’s index. We can fix this by making a slight adjustment to the command that imported our data set into the Python script:

您会注意到,DataFrame以未命名的列开头,该列的值等于DataFrame的索引。 我们可以通过对将数据集导入Python脚本的命令稍作调整来解决此问题:

raw_data = pd.read_csv('classified_data.csv', index_col = 0)

Next, let’s take a look at the actual features that are contained in this data set. You can print a list of the data set’s column names with the following statement:

接下来,让我们看一下此数据集中包含的实际功能。 您可以使用以下语句打印数据集的列名列表:

print(raw_data.columns)

This returns:

返回:

Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',

       'TARGET CLASS'],

      dtype='object')

Since this is a classified data set, we have no idea what any of these columns means. For now, it is sufficient to recognize that every column is numerical in nature and thus well-suited for modelling with machine learning techniques.

由于这是一个分类的数据集,因此我们不知道这些列的含义。 到目前为止,足以认识到每一列本质上都是数字,因此非常适合使用机器学习技术进行建模。

标准化数据集 (Standardizing the Data Set)

Since the K nearest neighbors algorithm makes predictions about a data point by using the observations that are closest to it, the scale of the features within a data set matters a lot.

由于K最近邻算法使用最接近的观测值对数据点进行预测,因此数据集内要素的规模非常重要。

Because of this, machine learning practitioners typically standardize the data set, which means adjusting every x value so that they are roughly on the same scale.

因此,机器学习从业人员通常会standardize数据集,这意味着需要调整每个x值,以使它们大致在同一范围内。

Fortunately, scikit-learn includes some excellent functionality to do this with very little headache.

幸运的是, scikit-learn包含一些出色的功能,可以scikit-learn完成此任务。

To start, we will need to import the StandardScaler class from scikit-learn. Add the following command to your Python script to do this:

首先,我们需要从scikit-learn导入StandardScaler类。 将以下命令添加到您的Python脚本中以执行此操作:

from sklearn.preprocessing import StandardScaler

This function behaves a lot like the LinearRegression and LogisticRegression classes that we used earlier in this course. We will want to create an instance of this class and then fit the instance of that class on our data set.

此函数的行为与我们在本课程前面使用的LinearRegressionLogisticRegression类非常相似。 我们将要创建此类的实例,然后将该类的实例适合我们的数据集。

First, let’s create an instance of the StandardScaler class named scaler with the following statement:

首先,让我们使用以下语句创建一个名为scalerStandardScaler类的实例:

scaler = StandardScaler()

We can now train this instance on our data set using the fit method:

现在,我们可以使用fit方法在数据集上训练该实例:

scaler.fit(raw_data.drop('TARGET CLASS', axis=1))

Now we can use the transform method to standardize all of the features in the data set so they are roughly the same scale. We’ll assign these scaled features to the variable named scaled_features:

现在,我们可以使用transform方法来标准化数据集中的所有特征,因此它们的比例大致相同。 我们将这些缩放后的特征分配给名为scaled_features的变量:

scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))

This actually creates a NumPy array of all the features in the data set, and we want it to be a pandas DataFrame instead.

实际上,这将创建一个NumPy数组 ,其中包含数据集中的所有功能,我们希望它是一个熊猫DataFrame

Fortunately, this is an easy fix. We’ll simply wrap the scaled_features variable in a pd.DataFrame method and assign this DataFrame to a new variable called scaled_data with an appropriate argument to specify the column names:

幸运的是,这很容易解决。 我们将简单地将scaled_features变量包装在pd.DataFrame方法中,然后将此DataFrame分配给名为scaled_data的新变量,并使用适当的参数来指定列名称:

scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)

Now that we have imported our data set and standardized its features, we are ready to split the data set into training data and test data.

现在,我们已经导入了数据集并对其功能进行了标准化,我们准备将数据集分为训练数据和测试数据。

将数据集分为训练数据和测试数据 (Splitting the Data Set Into Training Data and Test Data)

We will use the train_test_split function from scikit-learn combined with list unpacking to create training data and test data from our classified data set.

我们将结合使用scikit-learntrain_test_split函数和列表解train_test_split来从分类数据集中创建训练数据和测试数据。

First, you’ll need to import train_test_split from the model_validation module of scikit-learn with the following statement:

首先,您需要使用以下语句从scikit-learnmodel_validation模块中导入train_test_split

from sklearn.model_selection import train_test_split

Next, we will need to specify the x and y values that will be passed into this train_test_split function.

接下来,我们将需要指定将传递给此train_test_split函数的xy值。

The x values will be the scaled_data DataFrame that we created previously. The y values will be the TARGET CLASS column of our original raw_data DataFrame.

x值将是我们先前创建的scaled_data DataFrame。 y值将是我们原始raw_data DataFrame的TARGET CLASS列。

You can create these variables with the following statements:

您可以使用以下语句创建这些变量:

x = scaled_data

y = raw_data['TARGET CLASS']

Next, you’ll need to run the train_test_split function using these two arguments and a reasonable test_size. We will use a test_size of 30%, which gives the following parameters for the function:

接下来,您需要使用这两个参数和合理的test_size运行train_test_split函数。 我们将使用30%的test_size ,它为该函数提供以下参数:

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

Now that our data set has been split into training data and test data, we’re ready to start training our model!

现在,我们的数据集已分为训练数据和测试数据,我们准备开始训练我们的模型!

训练K最近邻居模型 (Training a K Nearest Neighbors Model)

Let’s start by importing the KNeighborsClassifier from scikit-learn:

让我们首先从scikit-learn导入KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

Next, let’s create an instance of the KNeighborsClassifier class and assign it to a variable named model

接下来,让我们创建KNeighborsClassifier类的实例,并将其分配给名为model的变量。

This class requires a parameter named n_neighbors, which is equal to the K value of the K nearest neighbors algorithm that you’re building. To start, let’s specify n_neighbors = 1:

此类需要一个名为n_neighbors的参数,该参数等于您要构建的K个最近邻居算法的K值。 首先,让我们指定n_neighbors = 1

model = KNeighborsClassifier(n_neighbors = 1)

Now we can train our K nearest neighbors model using the fit method and our x_training_data and y_training_data variables:

现在,我们可以使用fit方法以及x_training_datay_training_data变量训练我们的K个最近邻居模型:

model.fit(x_training_data, y_training_data)

Now let’s make some predictions with our newly-trained K nearest neighbors algorithm!

现在,让我们用我们新训练的K最近邻算法做出一些预测!

使用我们的K最近邻算法进行预测 (Making Predictions With Our K Nearest Neighbors Algorithm)

We can make predictions with our K nearest neighbors algorithm in the same way that we did with our linear regression and logistic regression models earlier in this course: by using the predict method and passing in our x_test_data variable.

我们可以使用K最近邻算法进行predict方法与本课程前面的线性回归逻辑回归模型相同:通过使用predict方法并传入x_test_data变量。

More specifically, here’s how you can make predictions and assign them to a variable called predictions:

更具体地讲,这里是你如何能做出预测,并将其分配给一个变量称为predictions

predictions = model.predict(x_test_data)

Let’s explore how accurate our predictions are in the next section of this tutorial.

让我们在本教程的下一部分中探索我们的predictions准确性。

测量模型的准确性 (Measuring the Accuracy of Our Model)

We saw in our logistic regression tutorial that scikit-learn comes with built-in functions that make it easy to measure the performance of machine learning classification models.

我们在逻辑回归教程中看到scikit-learn带有内置函数,可轻松测量机器学习分类模型的性能。

Let’s import two of these functions (classification_report and confuson_matrix) into our report now:

我们要汇入其中的两个功能( classification_reportconfuson_matrix )到我们的报告现在:

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

Let’s work through each of these one-by-one, starting with the classfication_report. You can generate the report with the following statement:

让我们从classfication_report开始,逐一研究这些内容。 您可以使用以下语句生成报告:

print(classification_report(y_test_data, predictions))

This generates:

这将产生:

precision    recall  f1-score   support

           0       0.94      0.85      0.89       150

           1       0.86      0.95      0.90       150

    accuracy                           0.90       300

   macro avg       0.90      0.90      0.90       300

weighted avg       0.90      0.90      0.90       300

Similarly, you can generate a confusion matrix with the following statement:

同样,您可以使用以下语句生成混淆矩阵:

print(confusion_matrix(y_test_data, predictions))

This generates:

这将产生:

[[141  12]

 [ 18 129]]

Looking at these performance metrics, it looks like our model is already fairly performant. It can still be improved.

从这些性能指标来看,我们的模型似乎已经相当不错了。 仍然可以改进。

In the next section, we will see how we can improve the performance of our K nearest neighbors model by choosing a better value for K.

在下一节中,我们将看到如何通过为K选择一个更好的值来改善我们的K最近邻居模型的性能。

使用弯头法选择最佳K (Choosing An Optimal K Value Using the Elbow Method)

In this section, we will use the elbow method to choose an optimal value of K for our K nearest neighbors algorithm.

在本节中,我们将使用弯头方法为K最近邻算法选择K的最佳值。

The elbow method involves iterating through different K values and selecting the value with the lowest error rate when applied to our test data.

弯头法涉及遍历不同的K值,并选择应用于我们的测试数据时错误率最低的值。

To start, let’s create an empty list called error_rates. We will loop through different K values and append their error rates to this list.

首先,让我们创建一个名为error_rates的空列表 。 我们将遍历不同的K值,并将其错误率附加到此列表中。

error_rates = []

Next, we need to make a Python loop that iterates through the different values of K we’d like to test and executes the following functionality with each iteration:

接下来,我们需要创建一个Python循环,该循环遍历我们要测试的K的不同值,并在每次迭代中执行以下功能:

  • Creates a new instance of the KNeighborsClassifier class from scikit-learn

    scikit-learn创建KNeighborsClassifier类的新实例

  • Trains the new model using our training data

    使用我们的训练数据训练新模型
  • Makes predictions on our test data

    对我们的测试数据做出预测
  • Calculates the mean difference for every incorrect prediction (the lower this is, the more accurate our model is)

    计算每个错误预测的均值差(这个值越低,我们的模型越准确)

Here is the code to do this for K values between 1 and 100:

这是针对K值介于1100之间的代码:

for i in np.arange(1, 101):

    new_model = KNeighborsClassifier(n_neighbors = i)

    new_model.fit(x_training_data, y_training_data)

    new_predictions = new_model.predict(x_test_data)

    error_rates.append(np.mean(new_predictions != y_test_data))

Let’s visualize how our error rate changes with different K values using a quick matplotlib visualization:

让我们使用快速的matplotlib可视化效果来可视化我们的错误率如何随不同的K值变化:

plt.plot(error_rates)
如何在Python中建立和训练K最近邻和K-Means集群ML模型

As you can see, our error rates tend to be minimized with a K value of approximately 50. This means that 50 is a suitable choice for K that balances both simplicity and predictive power.

如您所见,我们的错误率倾向于以大约50的K值最小化。这意味着50K兼顾简单性和预测能力的合适选择。

本教程的完整代码 (The Full Code For This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存储库中查看本教程的完整代码。 还将其粘贴在下面以供您参考:

#Common imports

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

#Import the data set

raw_data = pd.read_csv('classified_data.csv', index_col = 0)

#Import standardization functions from scikit-learn

from sklearn.preprocessing import StandardScaler

#Standardize the data set

scaler = StandardScaler()

scaler.fit(raw_data.drop('TARGET CLASS', axis=1))

scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))

scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)

#Split the data set into training data and test data

from sklearn.model_selection import train_test_split

x = scaled_data

y = raw_data['TARGET CLASS']

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

#Train the model and make predictions

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors = 1)

model.fit(x_training_data, y_training_data)

predictions = model.predict(x_test_data)

#Performance measurement

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

print(classification_report(y_test_data, predictions))

print(confusion_matrix(y_test_data, predictions))

#Selecting an optimal K value

error_rates = []

for i in np.arange(1, 101):

    new_model = KNeighborsClassifier(n_neighbors = i)

    new_model.fit(x_training_data, y_training_data)

    new_predictions = new_model.predict(x_test_data)

    error_rates.append(np.mean(new_predictions != y_test_data))

plt.figure(figsize=(16,12))

plt.plot(error_rates)

K-均值聚类模型 (K-Means Clustering Models)

The K-means clustering algorithm is typically the first unsupervised machine learning model that students will learn.

K均值聚类算法通常是学生将要学习的第一个无监督机器学习模型。

It allows machine learning practitioners to create groups of data points within a data set with similar quantitative characteristics. It is useful for solving problems like creating customer segments or identifying localities in a city with high crime rates.

它允许机器学习从业人员在具有相似定量特征的数据集中创建数据点组。 它对于解决诸如创建客户群或确定犯罪率高的城市中的地区之类的问题很有用。

In this section, you will learn how to build your first K means clustering algorithm in Python.

在本部分中,您将学习如何在Python中构建第一个K均值聚类算法。

我们将在本教程中使用的数据集 (The Data Set We Will Use In This Tutorial)

In this tutorial, we will be using a data set of data generated using scikit-learn.

在本教程中,我们将使用scikit-learn生成的数据集。

Let’s import scikit-learn’s make_blobs function to create this artificial data. Open up a Jupyter Notebook and start your Python script with the following statement:

让我们导入scikit-learnmake_blobs函数来创建此人工数据。 打开Jupyter Notebook,并使用以下语句启动Python脚本:

from sklearn.datasets import make_blobs

Now let’s use the make_blobs function to create some artificial data!

现在让我们使用make_blobs函数创建一些人工数据!

More specifically, here is how you could create a data set with 200 samples that has 2 features and 4 cluster centers. The standard deviation within each cluster will be set to 1.8.

更具体地说,这里是如何创建包含200样本的数据集的示例,该样本集具有2功能部件和4群集中心。 每个群集内的标准偏差将设置为1.8

raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)

If you print this raw_data object, you’ll notice that it is actually a Python tuple. The first element of this tuple is a NumPy array with 200 observations. Each observation contains 2 features (just like we specified with our make_blobs function!).

如果您打印此raw_data对象,您会注意到它实际上是一个Python元组 。 该元组的第一个元素是具有200个观测值的NumPy数组 。 每个观察都包含2个功能(就像我们用make_blobs函数指定的make_blobs !)。

Now that our data has been created, we can move on to importing other important open-source libraries into our Python script.

现在我们的数据已经创建,我们可以继续将其他重要的开源库导入我们的Python脚本中。

我们将在本教程中使用的导入 (The Imports We Will Use In This Tutorial)

This tutorial will make use of a number of popular open-source Python libraries, including pandas, NumPy, and matplotlib. Let’s continue our Python script by adding the following imports:

本教程将利用许多流行的开源Python库,包括pandasNumPymatplotlib 。 让我们通过添加以下导入来继续我们的Python脚本:

import pandas as pd

import numpy as np

import seaborn

import matplotlib.pyplot as plt

%matplotlib inline

The first group of imports in this code block is for manipulating large data sets. The second group of imports is for creating data visualizations.

此代码块中的第一组导入用于处理大型数据集。 第二组导入用于创建数据可视化。

Let’s move on to visualizing our data set next.

接下来让我们继续可视化我们的数据集。

可视化我们的数据集 (Visualizing Our Data Set)

In our make_blobs function, we specified for our data set to have 4 cluster centers. The best way to verify that this has been handled correctly is by creating some quick data visualizations.

在我们的make_blobs函数中,我们为数据集指定了4个集群中心。 验证此问题是否正确处理的最佳方法是创建一些快速的数据可视化文件。

To start, let’s use the following command to plot all of the rows in the first column of our data set against all of the rows in the second column of our data set:

首先,让我们使用以下命令将数据集第一列中的所有行与数据集第二列中的所有行进行绘制:

如何在Python中建立和训练K最近邻和K-Means集群ML模型

Note: your data set will appear differently than mine since this is randomly-generated data.

注意:由于这是随机生成的数据,因此数据集的显示方式与我的不同。

This image seems to indicate that our data set has only three clusters. This is because two of the clusters are very close to each other.

该图像似乎表明我们的数据集只有三个聚类。 这是因为两个群集彼此非常接近。

To fix this, we need to reference the second element of our raw_data tuple, which is a NumPy array that contains the cluster to which each observation belongs.

为了解决这个问题,我们需要引用raw_data元组的第二个元素,它是一个NumPy数组,其中包含每个观察值所属的簇。

If we color our data set using each observation’s cluster, the unique clusters will quickly become clear. Here is the code to do this:

如果我们使用每个观察值的群集为数据集着色,则唯一的群集将很快变得清晰。 这是执行此操作的代码:

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])
如何在Python中建立和训练K最近邻和K-Means集群ML模型

We can now see that our data set has four unique clusters. Let’s move on to building our K means cluster model in Python!

现在我们可以看到我们的数据集具有四个唯一的群集。 让我们继续以Python构建我们的K均值集群模型!

建立和训练我们的K均值聚类模型 (Building and Training Our K Means Clustering Model)

The first step to building our K means clustering algorithm is importing it from scikit-learn. To do this, add the following command to your Python script:

建立我们的K均值聚类算法的第一步是从scikit-learn导入它。 为此,将以下命令添加到您的Python脚本中:

from sklearn.cluster import KMeans

Next, lets create an instance of this KMeans class with a parameter of n_clusters=4 and assign it to the variable model:

接下来,让我们使用参数n_clusters=4创建此KMeans类的实例,并将其分配给变量model

model = KMeans(n_clusters=4)

Now let’s train our model by invoking the fit method on it and passing in the first element of our raw_data tuple:

现在,通过调用模型上的fit方法并传入raw_data元组的第一个元素来训练模型:

model.fit(raw_data[0])

In the next section, we’ll explore how to make predictions with this K means clustering model.

在下一节中,我们将探讨如何使用这种K均值聚类模型进行预测。

Before moving on, I wanted to point out one difference that you may have noticed between the process for building this K means clustering algorithm (which is an unsupervised machine learning algorithm) and the supervised machine learning algorithms we’ve worked with so far in this course.

在继续之前,我想指出一个差异,您可能已经注意到,构建此K均值聚类算法(这是一种无监督的机器学习算法)的过程与我们迄今为止在此方面使用的有监督的机器学习算法之间的区别课程。

Namely, we did not have to split the data set into training data and test data. This is an important difference - and in fact, you never need to make the train/test split on a data set when building unsupervised machine learning models!

即,我们不必将数据集分为训练数据和测试数据。 这是一个重要的区别-实际上,在构建无监督的机器学习模型时,您无需对数据集进行训练/测试拆分!

用我们的K均值聚类模型进行预测 (Making Predictions With Our K Means Clustering Model)

Machine learning practitioners generally use K means clustering algorithms to make two types of predictions:

机器学习从业人员通常使用K均值聚类算法进行两种类型的预测:

  • Which cluster each data point belongs to

    每个数据点属于哪个群集
  • Where the center of each cluster is

    每个群集的中心在哪里

It is easy to generate these predictions now that our model has been trained.

既然我们的模型已经过训练,就很容易生成这些预测。

First, let’s predict which cluster each data point belongs to. To do this, access the labels_ attribute from our model object using the dot operator, like this:

首先,让我们预测每个数据点属于哪个群集。 为此,请使用点运算符从我们的model对象访问labels_属性,如下所示:

model.labels_

This generates a NumPy array with predictions for each data point that looks like this:

这将生成一个NumPy数组,其中包含每个数据点的预测,如下所示:

array([3, 2, 7, 0, 5, 1, 7, 7, 6, 1, 2, 4, 6, 7, 6, 4, 4, 3, 3, 6, 0, 0,

       6, 4, 5, 6, 0, 2, 6, 5, 4, 3, 4, 2, 6, 6, 6, 5, 6, 2, 1, 1, 3, 4,

       3, 5, 7, 1, 7, 5, 3, 6, 0, 3, 5, 5, 7, 1, 3, 1, 5, 7, 7, 0, 5, 7,

       3, 4, 0, 5, 6, 5, 1, 4, 6, 4, 5, 6, 7, 2, 2, 0, 4, 1, 1, 1, 6, 3,

       3, 7, 3, 6, 7, 7, 0, 3, 4, 3, 4, 0, 3, 5, 0, 3, 6, 4, 3, 3, 4, 6,

       1, 3, 0, 5, 4, 2, 7, 0, 2, 6, 4, 2, 1, 4, 7, 0, 3, 2, 6, 7, 5, 7,

       5, 4, 1, 7, 2, 4, 7, 7, 4, 6, 6, 3, 7, 6, 4, 5, 5, 5, 7, 0, 1, 1,

       0, 0, 2, 5, 0, 3, 2, 5, 1, 5, 6, 5, 1, 3, 5, 1, 2, 0, 4, 5, 6, 3,

       4, 4, 5, 6, 4, 4, 2, 1, 7, 4, 6, 6, 0, 6, 3, 5, 0, 5, 2, 4, 6, 0,

       1, 0], dtype=int32)

To see where the center of each cluster lies, access the cluster_centers_ attribute using the dot operator like this:

要查看每个集群的中心位置,请使用点运算符访问cluster_centers_属性,如下所示:

model.cluster_centers_

This generates a two-dimensional NumPy array that contains the coordinates of each clusters center. It will look like this:

这将生成一个二维NumPy数组,其中包含每个聚类中心的坐标。 它看起来像这样:

array([[ -8.06473328,  -0.42044783],

       [  0.15944397,  -9.4873621 ],

       [  1.49194628,   0.21216413],

       [-10.97238157,  -2.49017206],

       [  3.54673215,  -9.7433692 ],

       [ -3.41262049,   7.80784834],

       [  2.53980034,  -2.96376999],

       [ -0.4195847 ,   6.92561289]])

We’ll assess the accuracy of these predictions in the next section.

我们将在下一部分中评估这些预测的准确性。

可视化我们模型的准确性 (Visualizing the Accuracy of Our Model)

The last thing we’ll do in this tutorial is visualize the accuracy of our model. You can use the following code to do this:

我们在本教程中要做的最后一件事是可视化模型的准确性。 您可以使用以下代码执行此操作:

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))

ax1.set_title('Our Model')

ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)

ax2.set_title('Original Data')

ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])

This generates two different plots side-by-side where one plot shows the clusters according to the real data set and the other plot shows the clusters according to our model. Here is what the output looks like:

这将并排生成两个不同的图,其中一个图根据实际数据集显示聚类,而另一个图根据我们的模型显示聚类。 输出如下所示:

如何在Python中建立和训练K最近邻和K-Means集群ML模型

Although the coloring between the two plots is different, you can see that our model did a fairly good job of predicting the clusters within our data set. You can also see that the model was not perfect - if you look at the data points along a cluster’s edge, you can see that it occasionally misclassified an observation from our data set.

尽管两个图之间的颜色不同,但是您可以看到我们的模型在预测数据集中的聚类方面做得很好。 您还可以看到该模型不是完美的-如果您查看集群边缘的数据点,您会发现它有时会错误地将数据从我们的数据集中分类。

There’s one last thing that needs to be mentioned about measuring our model’s prediction. In this example ,we knew which cluster each observation belonged to because we actually generated this data set ourselves.

关于测量模型的预测,还有最后一件事需要提及。 在此示例中,我们知道每个观测值属于哪个聚类,因为我们实际上是自己生成了此数据集。

This is highly unusual. K means clustering is more often applied when the clusters aren’t known in advance. Instead, machine learning practitioners use K means clustering to find patterns that they don’t already know within a data set.

这是非常不寻常的。 K表示当群集未知时更常应用群集。 取而代之的是,机器学习从业人员使用K表示聚类来查找他们在数据集中尚不知道的模式。

本教程的完整代码 (The Full Code For This Tutorial)

You can view the full code for this tutorial in this GitHub repository. It is also pasted below for your reference:

您可以在GitHub存储库中查看本教程的完整代码。 还将其粘贴在下面以供您参考:

#Create artificial data set

from sklearn.datasets import make_blobs

raw_data = make_blobs(n_samples = 200, n_features = 2, centers = 4, cluster_std = 1.8)

#Data imports

import pandas as pd

import numpy as np

#Visualization imports

import seaborn

import matplotlib.pyplot as plt

%matplotlib inline

#Visualize the data

plt.scatter(raw_data[0][:,0], raw_data[0][:,1])

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1])

#Build and train the model

from sklearn.cluster import KMeans

model = KMeans(n_clusters=4)

model.fit(raw_data[0])

#See the predictions

model.labels_

model.cluster_centers_

#PLot the predictions against the original data set

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))

ax1.set_title('Our Model')

ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)

ax2.set_title('Original Data')

ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1])

最后的想法 (Final Thoughts)

This tutorial taught you how to how to build K-nearest neighbors and K-means clustering machine learning models in Python.

本教程教您如何在Python中建立K近邻和K均值集群机器学习模型。

If you're interested in learning more about machine learning, my book Pragmatic Machine Learning will teach you practical machine learning techniques by building 9 real projects. The book launches August 3rd. You can preorder it for 50% off using the link below:

如果您有兴趣了解有关机器学习的更多信息,我的书《 实用机器学习》将通过构建9个真实项目来教您实用的机器学习技术。 该书于8月3日发行。 您可以使用以下链接预订50%的折扣:

Pragmatic Machine Learning
Machine learning is changing the world. But it’s always been hard to learn machine learning... until now. Pragmatic Machine Learning is a step-by-step guide that will teach you machine learning fundamentals through building 9 real-world projects. You’ll learn: Linear regression, Logistic regression,…
语用机器学习
机器学习正在改变世界。 但是直到现在,学习机器学习一直都很困难。 实用机器学习是一个循序渐进的指南,将通过构建9个现实项目来教您机器学习基础知识。 您将学习:线性回归,逻辑回归,…

Here is a brief summary of what you learned about K-nearest neighbors models in Python:

这是您从Python中了解到的K近邻模型的摘要:

  • How classified data is a common tool used to teach students how to solve their first K nearest neighbor problems

    分类数据是如何用来教学生如何解决他们的第一个K最近邻问题的常用工具
  • Why it’s important to standardize your data set when building K nearest neighbor models

    为什么在建立K个最近邻居模型时标准化数据集很重要
  • How to split your data set into training data and test data using the train_test_split function

    如何使用train_test_split函数将数据集分为训练数据和测试数据

  • How to train your first K nearest neighbors model and make predictions with it

    如何训练您的第一个K最近邻模型并进行预测
  • How to measure the performance of a K nearest neighbors model

    如何测量K最近邻居模型的性能
  • How to use the elbow method to select an optimal value of K in a K nearest neighbors model

    如何使用肘法在K最近邻居模型中选择K的最优值

Similarly, here is a brief summary of what you learned about K-means clustering models in Python:

同样,这是您从Python中了解到的K-means聚类模型的摘要:

  • How to create artificial data in scikit-learn using the make_blobs function

    如何使用make_blobs函数在scikit-learn创建人工数据

  • How to build and train a K means clustering model

    如何建立和训练K均值聚类模型
  • That unsupervised machine learning techniques do not require you to split your data into training data and test data

    这种无监督的机器学习技术不需要您将数据分为训练数据和测试数据
  • How to build and train a K means clustering model using scikit-learn

    如何使用scikit-learn构建和训练K均值聚类模型

  • How to visualizes the performance of a K means clustering algorithm when you know the clusters in advance

    当您提前了解聚类时,如何可视化K表示聚类算法

翻译自: https://www.freecodecamp.org/news/how-to-build-and-train-k-nearest-neighbors-ml-models-in-python/