kNN算法实践2

程序员文章站 2022-07-14 20:32:31

...

引言

我的朋友小美最近一直使用约会网站寻找适合自己的约会对象，由于推选的人数过多，小美想做一个排除，剔除那些她不喜欢的人和魅力一般的人，只和极具魅力的人约会。
好了，作为一个标准的好男儿，是时候为小美做一点贡献了！

回顾我们的步骤：

1.收集数据：小美为我们提供了文本文件
2.准备数据：使用Python解析文本文件,就是导入数据
3.分析数据：画散点图啦，比较想越她的那么多
4.训练算法：kNN那么笨的算法不需要训练；
5.测试算法：使用小美给我们提供的数据进行测试，计算错误率；
6.使用算法：让小美输入一些特征，判断出他是不是小美想约的对象

通过这个案例，你可以学会

1.kNN具体用法
2.数据归一化的重要性

1.收集导入数据

#-*- coding: utf-8 -*-
import numpy as np
import matplotlib.font_manager
import matplotlib.pyplot as plt
import operator
myfont = matplotlib.font_manager.FontProperties(fname="Light.ttc")
# 2.准备数据：小美把他需要的数据放在文本文件datingTestSet2.txt中，一共有1000行的样本
# 样本有三个属性：
(1)每年获得的飞机常客里程数,
(2)玩游戏所消耗的时间,
(3)每周消耗的冰淇淋数
# 好了，开始我们的算法吧！！

def file2matrix(filename):
    fr = open(filename)
    # readlines()方法用于读取所有行(直到结束符EOF)并返回列表，
    arrayOLines = fr.readlines()
    # 得到文件的行数
    numberOfLines = len(arrayOLines)
    returnMat = np.zeros((numberOfLines,3))
    classLabelVector = []
    index =0
    for line in arrayOLines:
        #strip() 方法用于移除字符串头尾指定的字符（默认为空格）,使用后截取掉所有的回车字符
        #str2 = "   Runoob      ";   # 去除首尾空格
        # print(str2.strip());
        #>>Runoob
        line = line.strip()
        # 将整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        # 保存标签,转化为整型，这个很重要
        classLabelVector.append(int(listFromLine[-1]))
        index +=1
    return returnMat,classLabelVector

#调用函数
matrix,labels = file2matrix('datingTestSet2.txt')
print(matrix)
print(labels[0:20])

输出结果为：

>>[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 [2.6052000e+04 1.4418710e+00 8.0512400e-01]
 ...
 [2.6575000e+04 1.0650102e+01 8.6662700e-01]
 [4.8111000e+04 9.1345280e+00 7.2804500e-01]
 [4.3757000e+04 7.8826010e+00 1.3324460e+00]]

 >>[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2.分析数据：画散点图啦

def plot_fig():
    type1_x = []
    type1_y = []
    type2_x = []
    type2_y = []
    type3_x = []
    type3_y = []
    # 这里我们取两个特征
    for i in range(len(labels)):
        if labels[i] == 1:  # 不喜欢
            type1_x.append(matrix[i][1])
            type1_y.append(matrix[i][2])
        if labels[i] == 2:  # 魅力一般
            type2_x.append(matrix[i][1])
            type2_y.append(matrix[i][2])

        if labels[i] == 3:  # 极具魅力

            type3_x.append(matrix[i][1])
            type3_y.append(matrix[i][2])

    plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体
    plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题

    #画图开始
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(type1_x, type1_y, s=20, alpha=0.8,color='#145b7d',label='不喜欢')
    ax.scatter(type2_x, type2_y, s=20, alpha=0.8,color='#a7324a',label='魅力一般')
    ax.scatter(type3_x, type3_y, s=20, alpha=0.8,color='#585eaa',label='极具魅力')
    plt.xlabel('玩视频优秀消耗的时间')
    plt.ylabel('每周所消耗的冰淇淋')
    plt.legend(loc=1)
    plt.show()
# 调用函数
plot_fig()

kNN算法实践2

图上，密密麻麻一团

3.数据归一化处理

为了避免数值偏大的属性对计算结果的影响，换句话说小美觉得三个特征是同样重要的，做法很简单！
把取值范围处理为 $0 - 1$ 或者 $- 1 - 1$ 。可以使用下式转换 $[newValue = (oldValue - min) / (max - min)$ ,这样就可以，看不懂这个式子，我也没办法了….
好了，写程序了！

def autoNorm(dataSet):
    # axis=0; 每列的最小值/对于多维就是第一维度 ，不理解的好好看[[1,2],[3,4],就是依次比较1和3,2和4
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    m = dataSet.shape[0]
    #tile函数将变量内容复制到输入矩阵同样的维度
    normDataSet = dataSet - np.tile(minVals,(m,1))
    normDataSet = normDataSet/np.tile(ranges,(m,1))

    return normDataSet,ranges,minVals

normMat,ranges,minVals = autoNorm(matrix)
print(normMat)

输出结果为

[[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 [0.28542943 0.06892523 0.47449629]
 ...
 [0.29115949 0.50910294 0.51079493]
 [0.52711097 0.43665451 0.4290048 ]
 [0.47940793 0.3768091  0.78571804]]

4.分类器

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0] #取第一个维度，计算样本的个数
    diffMat = np.tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    #argsort()是将元素从小到大排列，提取其对应的index(索引)，然后输出
    sortedDisIndicies = distances.argsort()
    #创建一个dict
    classCount ={}
    for i in range(k):
        #获得距离最小前K个的标签
        voteIlabel = labels[sortedDisIndicies[i]]
        # dict.get(key, default=None)函数，key就是dict中的键voteIlabel，
        # 如果不存在则返回一个0并存入dict，如果存在则读取当前值并 + 1；
        # 这样操作后我们可能得到{'A':1,'B':1,'A':1}
        classCount[voteIlabel] = classCount.get(voteIlabel,0) +1

    sortedClassCount = sorted(classCount.items(),\
                       key = operator.itemgetter(1),reverse = True)
    return sortedClassCount[0][0]

5.下面进入测试了，10%的数据进行测试好了，开始

def dataingClassTest():
    hoRatio =0.10
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m= normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
                                     datingLabels[numTestVecs:m],3)
        print('the classifier came back with: %d,the real answer is :%d'\
              %(classifierResult,datingLabels[i]))
        if (classifierResult != datingLabels[i]):
            errorCount +=1.0
    print('the total error rate is :%f'%(errorCount/float(numTestVecs)))


dataingClassTest()

输出结果为

# the total error rate is :0.050000,误差还是可以的，当然了，需要指出kNN只适合小型的算法，对于处理图片
# 这样的，估计就够呛了.....

6.约会网站预测函数

def classifyPerson():
    resultList = ['not at all','in small doses','in large doses']
    percentTats = float(input('percentage of time spent playing video games?'))
    ffMiles = float(input('frequent flier miles earned per year?'))
    iceCream = float(input('lilers of ice cream consumed per year?'))
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles,percentTats,iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    # 默认为1,2,3，索引时需要减去1
    print('You will probably like this person:',resultList[classifierResult-1])
classifyPerson()

结果：

percentage of time spent playing video games?10
frequent flier miles earned per year?2
lilers of ice cream consumed per year?3
You will probably like this person: in small doses

完整代码在这：

#-*- coding: utf-8 -*-
import numpy as np
import matplotlib.font_manager
import matplotlib.pyplot as plt
import operator
myfont = matplotlib.font_manager.FontProperties(fname="Light.ttc")
# 1.收集导入数据
def file2matrix(filename):
    fr = open(filename)
    # readlines()方法用于读取所有行(直到结束符EOF)并返回列表，
    arrayOLines = fr.readlines()
    # 得到文件的行数
    numberOfLines = len(arrayOLines)
    returnMat = np.zeros((numberOfLines,3))
    classLabelVector = []
    index =0
    for line in arrayOLines:
        line = line.strip()
        # 将整行数据分割成一个元素列表
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        # 保存标签,转化为整型，这个很重要
        classLabelVector.append(int(listFromLine[-1]))
        index +=1
    return returnMat,classLabelVector

#调用函数
matrix,labels = file2matrix('datingTestSet2.txt')
print(matrix)
print(labels[0:20])
#2.分析数据：画散点图啦
def plot_fig():
    type1_x = []
    type1_y = []
    type2_x = []
    type2_y = []
    type3_x = []
    type3_y = []
    # 这里我们取两个特征
    for i in range(len(labels)):
        if labels[i] == 1:  # 不喜欢
            type1_x.append(matrix[i][1])
            type1_y.append(matrix[i][2])
        if labels[i] == 2:  # 魅力一般
            type2_x.append(matrix[i][1])
            type2_y.append(matrix[i][2])

        if labels[i] == 3:  # 极具魅力

            type3_x.append(matrix[i][1])
            type3_y.append(matrix[i][2])

    plt.rcParams['font.sans-serif'] = ['SimHei']  # 指定默认字体
    plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题

    #画图开始
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(type1_x, type1_y, s=20, alpha=0.8,color='#145b7d',label='不喜欢')
    ax.scatter(type2_x, type2_y, s=20, alpha=0.8,color='#a7324a',label='魅力一般')
    ax.scatter(type3_x, type3_y, s=20, alpha=0.8,color='#585eaa',label='极具魅力')
    plt.xlabel('玩视频优秀消耗的时间')
    plt.ylabel('每周所消耗的冰淇淋')
    plt.legend(loc=1)
    plt.show()

# 3.数据归一化处理
def autoNorm(dataSet):
    # axis=0; 每列的最小值/对于多维就是第一维度 ，不理解的好好看[[1,2],[3,4],就是依次比较1和3,2和4
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    m = dataSet.shape[0]
    #tile函数将变量内容复制到输入矩阵同样的维度
    normDataSet = dataSet - np.tile(minVals,(m,1))
    normDataSet = normDataSet/np.tile(ranges,(m,1))

    return normDataSet,ranges,minVals

normMat,ranges,minVals = autoNorm(matrix)
print(normMat)

# 4.这是分类器
def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0] #取第一个维度，计算样本的个数
    diffMat = np.tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    #argsort()是将元素从小到大排列，提取其对应的index(索引)，然后输出
    sortedDisIndicies = distances.argsort()
    #创建一个dict
    classCount ={}
    for i in range(k):
        #获得距离最小前K个的标签
        voteIlabel = labels[sortedDisIndicies[i]]
        # dict.get(key, default=None)函数，key就是dict中的键voteIlabel，
        # 如果不存在则返回一个0并存入dict，如果存在则读取当前值并 + 1；
        # 这样操作后我们可能得到{'A':1,'B':1,'A':1}
        classCount[voteIlabel] = classCount.get(voteIlabel,0) +1

    sortedClassCount = sorted(classCount.items(),\
                       key = operator.itemgetter(1),reverse = True)
    return sortedClassCount[0][0]


# 5.下面进入测试了，10%的数据进行测试好了，开始   //实际过程中，测试没有必要的，为了对误差有一定了解，还是加上好了
def dataingClassTest():
    hoRatio =0.10
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m= normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
                                     datingLabels[numTestVecs:m],3)
        print('the classifier came back with: %d,the real answer is :%d'\
              %(classifierResult,datingLabels[i]))
        if (classifierResult != datingLabels[i]):
            errorCount +=1.0
    print('the total error rate is :%f'%(errorCount/float(numTestVecs)))


dataingClassTest()
# the total error rate is :0.050000,误差还是可以的，当然了，需要指出kNN只适合小型的算法，对于处理图片
# 这样的，估计就够呛了.....

# 7.约会网站预测函数
def classifyPerson():
    resultList = ['not at all','in small doses','in large doses']
    percentTats = float(input('percentage of time spent playing video games?'))
    ffMiles = float(input('frequent flier miles earned per year?'))
    iceCream = float(input('lilers of ice cream consumed per year?'))
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles,percentTats,iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    # 默认为1,2,3，索引时需要减去1
    print('You will probably like this person:',resultList[classifierResult-1])


classifyPerson()
plot_fig()

kNN算法实践2

引言

1.收集导入数据

2.分析数据：画散点图啦

3.数据归一化处理

4.分类器

5.下面进入测试了，10%的数据进行测试好了，开始

6.约会网站预测函数

常用JS加密编码算法代码第2/2页

在虚拟机Linux上部署DB2pureScale实践过程

C#用递归算法实现：一列数的规则如下: 1、1、2、3、5、8、13、21、34，求第30位数是多少

Python实现LRU算法的2种方法

详解vue-cli多页面工程实践第1/2页

终于！SM2国密算法被Linux内核社区接受了

基于python实现KNN分类算法

Python机器学习之scikit-learn库中KNN算法的封装与使用方法

创 PHP RSA2 签名算法

kNN算法python实现和简单数字识别的方法