1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 机器学习-有监督学习-分类算法:k-近邻(KNN)算法【多分类】

机器学习-有监督学习-分类算法:k-近邻(KNN)算法【多分类】

时间:2021-08-27 15:39:50

相关推荐

机器学习-有监督学习-分类算法:k-近邻(KNN)算法【多分类】

一、K-近邻算法简介

1、K-近邻算法(KNN)概念

k-近邻算法:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。

相似的样本,同一特征的值应该是相近的。

k的取值会影响结果。

就是通过你的"邻居"来判断你属于哪个类别。

如何计算你到你的"邻居"的距离:一般时候,都是使用欧氏距离

计算k-近邻距离公式:两个样本的距离可以通过如下公式计算,又叫欧式距离。需要事先对数据进行标准化处理。

2 电影类型分析

分别计算每个电影和被预测电影的距离,然后求解

3 KNN算法流程总结

计算已知类别数据集中的点与当前点之间的距离按距离递增次序排序选取与当前点距离最小的k个点统计前k个点所在的类别出现的频率返回前k个点出现频率最高的类别作为当前点的预测分类

4、k-近邻“分类”算法Api

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm=‘auto’)

n_neighbors:int,可选(默认= 5),k_neighbors查询默认使用的邻居数algorithm:{‘auto’,‘ball_tree’,‘kd_tree’,‘brute’},可选用于计算最近邻居的算法:‘ball_tree’将会使用 BallTree,‘kd_tree’将使用 KDTree。‘auto’将尝试根据传递给fit方法的值来决定最合适的算法。 (不同实现方式影响运算效率,但不影响结果)

5、k-近邻“回归”算法Api

sklearn.neighbors.KNeighborsRegressor(n_neighbors=5,algorithm=‘auto’)

knn回归算法不是构建方程,更像是平均值,找k个邻居然后计算这几个邻居的平均值作为拟合值。

二、K-近邻算法案例

使用K-近邻算法时,特征数据值需要进行去量纲化归一化标准化处理。

1、电影分类

import pandas as pdfrom sklearn.neighbors import KNeighborsClassifierif __name__ == '__main__':# 一、读取数据movies = pd.read_excel('I:/AI_Data/movies.xlsx', sheet_name=1)print("movies = \n", movies)# 二、 特征工程:分割特征数据值、目标值x_data = movies.iloc[:, 1:3] # 第0列的电影名称不是特征数据值y_data = movies['分类情况']print('特征数据值:x_data =\n', x_data)print('目标值:y_data =\n', y_data)# 三、算法工程# 3.1 实例化一个”k-近邻“估计器knn = KNeighborsClassifier(n_neighbors=5)# 3.2 将训练数据 x_data、y_data 喂给”k-近邻“估计器knn进行训练knn.fit(x_data, y_data)# 四、模型的使用# 4.1 构造一个测试电影特征数据集x_test = pd.DataFrame({'武打镜头': [100, 67, 1], '接吻镜头': [3, 2, 10]})print('测试电影特征数据集:x_test = \n', x_test)# 4.2 评估测试数据集的电影所属分类y_test = knn.predict(x_test)print('测试电影目标值:y_test = ', y_test)

打印结果:

movies = 电影名称 武打镜头 接吻镜头 分类情况0 大话西游 361 动作片1 杀破狼 432 动作片2 前任30 10 爱情片3 战狼2 591 动作片4 泰坦尼克号1 15 爱情片5 星语心愿2 19 爱情片特征数据值:x_data =武打镜头 接吻镜头0 3611 43220 103 59141 1552 19目标值:y_data =0 动作片1 动作片2 爱情片3 动作片4 爱情片5 爱情片Name: 分类情况, dtype: object测试电影特征数据集:x_test = 武打镜头 接吻镜头0 10031 67221 10测试电影目标值:y_test = ['动作片' '动作片' '爱情片']

2、鸢尾花分类

import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import datasetsfrom sklearn.preprocessing import StandardScalerif __name__ == '__main__':# 一、加载数据iris = datasets.load_iris()print('iris =\n', iris)X_data = iris.data # 4列分别代表4个特征:花萼长、宽;花瓣长、宽y_data = iris.target # 0代表setosa类花;1代表versicolor类花;2代表virginica类花;# 二、特征工程# 2.1 先统一标准化特征数据值,训练集与测试集不能分开标准化(注:此数据集标准化后的模型准确度降低)# std = StandardScaler()# X_data = std.fit_transform(X_data)# print('标准化后的特征数据值:X_data =\n', X_data)# 2.2 打乱数据集顺序index = np.arange(150)np.random.shuffle(index)print("index =\n", index)# 2.3 分割数据集为:特征数据值of训练集,特征数据值of测试集,目标值of训练集,目标值of测试集X_train = X_data[index[:120]]X_test = X_data[index[120:]]y_train = y_data[index[:120]]y_test = y_data[index[120:]]print('特征数据值of训练集:X_train =\n', X_train)print('特征数据值of训练集:X_test =\n', X_test)print('目标值of训练集:y_train =\n', y_train)print('目标值of测试集:y_test =\n', y_test)# 三、算法工程# 3.1 实例化一个”k-近邻“估计器# p = 1 距离度量采用的是:曼哈顿距离;p = 2 距离度量采用的是:欧氏距离;n_jobs:开启的进程数量# n_neighbors 一般不要超过样本数量的开平方数。knn = KNeighborsClassifier(n_neighbors=5, weights='distance', p=1, n_jobs=4) # 3.2 将训练数据 x_data、y_data 喂给”k-近邻“估计器knn进行训练knn.fit(X_train, y_train)# 四、模型评估# 4.1 预测测试集概率及其对应的类别y_proba = knn.predict_proba(X_test)print('预测测试集中各个鸢尾花的分类概率:y_proba =\n', y_proba)y_proba_predict = y_proba.argmax(axis=1)print('根据概率计算各个鸢尾花的分类:y_proba_predict =\n', y_proba_predict)# 4.2 直接用knn.predict()方法预测测试集y_predict = knn.predict(X_test)print('直接用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =\n', y_predict)print('实际测试集中各个鸢尾花的分类:y_test =\n', y_test)# 4.3 knn模型准确度评估predict_score = knn.score(X_test, y_test)print('knn模型准确度:predict_score =\n', predict_score)

打印结果:

iris ={'data': array([[5.1, 3.5, 1.4, 0.2],[4.9, 3. , 1.4, 0.2],[4.7, 3.2, 1.3, 0.2],[4.6, 3.1, 1.5, 0.2],[5. , 3.6, 1.4, 0.2],[5.4, 3.9, 1.7, 0.4],[4.6, 3.4, 1.4, 0.3],[5. , 3.4, 1.5, 0.2],[4.4, 2.9, 1.4, 0.2],[4.9, 3.1, 1.5, 0.1],[5.4, 3.7, 1.5, 0.2],[4.8, 3.4, 1.6, 0.2],[4.8, 3. , 1.4, 0.1],[4.3, 3. , 1.1, 0.1],[5.8, 4. , 1.2, 0.2],[5.7, 4.4, 1.5, 0.4],[5.4, 3.9, 1.3, 0.4],[5.1, 3.5, 1.4, 0.3],[5.7, 3.8, 1.7, 0.3],[5.1, 3.8, 1.5, 0.3],[5.4, 3.4, 1.7, 0.2],[5.1, 3.7, 1.5, 0.4],[4.6, 3.6, 1. , 0.2],[5.1, 3.3, 1.7, 0.5],[4.8, 3.4, 1.9, 0.2],[5. , 3. , 1.6, 0.2],[5. , 3.4, 1.6, 0.4],[5.2, 3.5, 1.5, 0.2],[5.2, 3.4, 1.4, 0.2],[4.7, 3.2, 1.6, 0.2],[4.8, 3.1, 1.6, 0.2],[5.4, 3.4, 1.5, 0.4],[5.2, 4.1, 1.5, 0.1],[5.5, 4.2, 1.4, 0.2],[4.9, 3.1, 1.5, 0.1],[5. , 3.2, 1.2, 0.2],[5.5, 3.5, 1.3, 0.2],[4.9, 3.1, 1.5, 0.1],[4.4, 3. , 1.3, 0.2],[5.1, 3.4, 1.5, 0.2],[5. , 3.5, 1.3, 0.3],[4.5, 2.3, 1.3, 0.3],[4.4, 3.2, 1.3, 0.2],[5. , 3.5, 1.6, 0.6],[5.1, 3.8, 1.9, 0.4],[4.8, 3. , 1.4, 0.3],[5.1, 3.8, 1.6, 0.2],[4.6, 3.2, 1.4, 0.2],[5.3, 3.7, 1.5, 0.2],[5. , 3.3, 1.4, 0.2],[7. , 3.2, 4.7, 1.4],[6.4, 3.2, 4.5, 1.5],[6.9, 3.1, 4.9, 1.5],[5.5, 2.3, 4. , 1.3],[6.5, 2.8, 4.6, 1.5],[5.7, 2.8, 4.5, 1.3],[6.3, 3.3, 4.7, 1.6],[4.9, 2.4, 3.3, 1. ],[6.6, 2.9, 4.6, 1.3],[5.2, 2.7, 3.9, 1.4],[5. , 2. , 3.5, 1. ],[5.9, 3. , 4.2, 1.5],[6. , 2.2, 4. , 1. ],[6.1, 2.9, 4.7, 1.4],[5.6, 2.9, 3.6, 1.3],[6.7, 3.1, 4.4, 1.4],[5.6, 3. , 4.5, 1.5],[5.8, 2.7, 4.1, 1. ],[6.2, 2.2, 4.5, 1.5],[5.6, 2.5, 3.9, 1.1],[5.9, 3.2, 4.8, 1.8],[6.1, 2.8, 4. , 1.3],[6.3, 2.5, 4.9, 1.5],[6.1, 2.8, 4.7, 1.2],[6.4, 2.9, 4.3, 1.3],[6.6, 3. , 4.4, 1.4],[6.8, 2.8, 4.8, 1.4],[6.7, 3. , 5. , 1.7],[6. , 2.9, 4.5, 1.5],[5.7, 2.6, 3.5, 1. ],[5.5, 2.4, 3.8, 1.1],[5.5, 2.4, 3.7, 1. ],[5.8, 2.7, 3.9, 1.2],[6. , 2.7, 5.1, 1.6],[5.4, 3. , 4.5, 1.5],[6. , 3.4, 4.5, 1.6],[6.7, 3.1, 4.7, 1.5],[6.3, 2.3, 4.4, 1.3],[5.6, 3. , 4.1, 1.3],[5.5, 2.5, 4. , 1.3],[5.5, 2.6, 4.4, 1.2],[6.1, 3. , 4.6, 1.4],[5.8, 2.6, 4. , 1.2],[5. , 2.3, 3.3, 1. ],[5.6, 2.7, 4.2, 1.3],[5.7, 3. , 4.2, 1.2],[5.7, 2.9, 4.2, 1.3],[6.2, 2.9, 4.3, 1.3],[5.1, 2.5, 3. , 1.1],[5.7, 2.8, 4.1, 1.3],[6.3, 3.3, 6. , 2.5],[5.8, 2.7, 5.1, 1.9],[7.1, 3. , 5.9, 2.1],[6.3, 2.9, 5.6, 1.8],[6.5, 3. , 5.8, 2.2],[7.6, 3. , 6.6, 2.1],[4.9, 2.5, 4.5, 1.7],[7.3, 2.9, 6.3, 1.8],[6.7, 2.5, 5.8, 1.8],[7.2, 3.6, 6.1, 2.5],[6.5, 3.2, 5.1, 2. ],[6.4, 2.7, 5.3, 1.9],[6.8, 3. , 5.5, 2.1],[5.7, 2.5, 5. , 2. ],[5.8, 2.8, 5.1, 2.4],[6.4, 3.2, 5.3, 2.3],[6.5, 3. , 5.5, 1.8],[7.7, 3.8, 6.7, 2.2],[7.7, 2.6, 6.9, 2.3],[6. , 2.2, 5. , 1.5],[6.9, 3.2, 5.7, 2.3],[5.6, 2.8, 4.9, 2. ],[7.7, 2.8, 6.7, 2. ],[6.3, 2.7, 4.9, 1.8],[6.7, 3.3, 5.7, 2.1],[7.2, 3.2, 6. , 1.8],[6.2, 2.8, 4.8, 1.8],[6.1, 3. , 4.9, 1.8],[6.4, 2.8, 5.6, 2.1],[7.2, 3. , 5.8, 1.6],[7.4, 2.8, 6.1, 1.9],[7.9, 3.8, 6.4, 2. ],[6.4, 2.8, 5.6, 2.2],[6.3, 2.8, 5.1, 1.5],[6.1, 2.6, 5.6, 1.4],[7.7, 3. , 6.1, 2.3],[6.3, 3.4, 5.6, 2.4],[6.4, 3.1, 5.5, 1.8],[6. , 3. , 4.8, 1.8],[6.9, 3.1, 5.4, 2.1],[6.7, 3.1, 5.6, 2.4],[6.9, 3.1, 5.1, 2.3],[5.8, 2.7, 5.1, 1.9],[6.8, 3.2, 5.9, 2.3],[6.7, 3.3, 5.7, 2.5],[6.7, 3. , 5.2, 2.3],[6.3, 2.5, 5. , 1.9],[6.5, 3. , 5.2, 2. ],[6.2, 3.4, 5.4, 2.3],[5.9, 3. , 5.1, 1.8]]), 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': 'Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n- Iris-Setosa\n- Iris-Versicolour\n- Iris-Virginica\n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\nMin Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.760.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\nAnnual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\nMathematical Statistics" (John Wiley, NY, 1950).\n - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\nStructure and Classification Rule for Recognition in Partially Exposed\nEnvironments". IEEE Transactions on Pattern Analysis and Machine\nIntelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\non Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\nconceptual clustering system finds 3 classes in the data.\n - Many, many more ...\n', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']}index =[147 89 67 57 26 130 113 87 48 38 110 118 65 33 14 94 47 70138 132 82 103 60 46 121 84 115 64 25 136 50 43 56 20 62 1213 88 39 119 4 59 31 66 100 108 5 15 10 129 114 144 45 3616 98 124 104 105 109 133 21 37 2 135 28 71 51 120 141 140 10630 146 81 8 96 90 40 24 139 92 55 83 17 0 68 61 73 6999 148 77 143 23 27 19 53 18 41 80 9 142 112 101 54 95 5211 58 1 131 79 44 127 145 29 91 86 74 111 102 76 128 6 7275 137 22 34 32 116 7 49 149 117 42 97 63 35 93 3 126 134107 125 123 78 122 85]特征数据值of训练集:X_train =[[6.5 3. 5.2 2. ][5.5 2.5 4. 1.3][5.8 2.7 4.1 1. ][4.9 2.4 3.3 1. ][5. 3.4 1.6 0.4][7.4 2.8 6.1 1.9][5.7 2.5 5. 2. ][6.3 2.3 4.4 1.3][5.3 3.7 1.5 0.2][4.4 3. 1.3 0.2][6.5 3.2 5.1 2. ][7.7 2.6 6.9 2.3][6.7 3.1 4.4 1.4][5.5 4.2 1.4 0.2][5.8 4. 1.2 0.2][5.6 2.7 4.2 1.3][4.6 3.2 1.4 0.2][5.9 3.2 4.8 1.8][6. 3. 4.8 1.8][6.4 2.8 5.6 2.2][5.8 2.7 3.9 1.2][6.3 2.9 5.6 1.8][5. 2. 3.5 1. ][5.1 3.8 1.6 0.2][5.6 2.8 4.9 2. ][5.4 3. 4.5 1.5][6.4 3.2 5.3 2.3][5.6 2.9 3.6 1.3][5. 3. 1.6 0.2][6.3 3.4 5.6 2.4][7. 3.2 4.7 1.4][5. 3.5 1.6 0.6][6.3 3.3 4.7 1.6][5.4 3.4 1.7 0.2][6. 2.2 4. 1. ][4.8 3. 1.4 0.1][4.3 3. 1.1 0.1][5.6 3. 4.1 1.3][5.1 3.4 1.5 0.2][6. 2.2 5. 1.5][5. 3.6 1.4 0.2][5.2 2.7 3.9 1.4][5.4 3.4 1.5 0.4][5.6 3. 4.5 1.5][6.3 3.3 6. 2.5][6.7 2.5 5.8 1.8][5.4 3.9 1.7 0.4][5.7 4.4 1.5 0.4][5.4 3.7 1.5 0.2][7.2 3. 5.8 1.6][5.8 2.8 5.1 2.4][6.7 3.3 5.7 2.5][4.8 3. 1.4 0.3][5.5 3.5 1.3 0.2][5.4 3.9 1.3 0.4][5.1 2.5 3. 1.1][6.7 3.3 5.7 2.1][6.5 3. 5.8 2.2][7.6 3. 6.6 2.1][7.2 3.6 6.1 2.5][6.3 2.8 5.1 1.5][5.1 3.7 1.5 0.4][4.9 3.1 1.5 0.1][4.7 3.2 1.3 0.2][7.7 3. 6.1 2.3][5.2 3.4 1.4 0.2][6.1 2.8 4. 1.3][6.4 3.2 4.5 1.5][6.9 3.2 5.7 2.3][6.9 3.1 5.1 2.3][6.7 3.1 5.6 2.4][4.9 2.5 4.5 1.7][4.8 3.1 1.6 0.2][6.3 2.5 5. 1.9][5.5 2.4 3.7 1. ][4.4 2.9 1.4 0.2][5.7 2.9 4.2 1.3][5.5 2.6 4.4 1.2][5. 3.5 1.3 0.3][4.8 3.4 1.9 0.2][6.9 3.1 5.4 2.1][5.8 2.6 4. 1.2][5.7 2.8 4.5 1.3][6. 2.7 5.1 1.6][5.1 3.5 1.4 0.3][5.1 3.5 1.4 0.2][6.2 2.2 4.5 1.5][5.9 3. 4.2 1.5][6.1 2.8 4.7 1.2][5.6 2.5 3.9 1.1][5.7 2.8 4.1 1.3][6.2 3.4 5.4 2.3][6.7 3. 5. 1.7][6.8 3.2 5.9 2.3][5.1 3.3 1.7 0.5][5.2 3.5 1.5 0.2][5.1 3.8 1.5 0.3][5.5 2.3 4. 1.3][5.7 3.8 1.7 0.3][4.5 2.3 1.3 0.3][5.5 2.4 3.8 1.1][4.9 3.1 1.5 0.1][5.8 2.7 5.1 1.9][6.8 3. 5.5 2.1][5.8 2.7 5.1 1.9][6.5 2.8 4.6 1.5][5.7 3. 4.2 1.2][6.9 3.1 4.9 1.5][4.8 3.4 1.6 0.2][6.6 2.9 4.6 1.3][4.9 3. 1.4 0.2][7.9 3.8 6.4 2. ][5.7 2.6 3.5 1. ][5.1 3.8 1.9 0.4][6.1 3. 4.9 1.8][6.7 3. 5.2 2.3][4.7 3.2 1.6 0.2][6.1 3. 4.6 1.4][6.7 3.1 4.7 1.5][6.4 2.9 4.3 1.3]]特征数据值of训练集:X_test =[[6.4 2.7 5.3 1.9][7.1 3. 5.9 2.1][6.8 2.8 4.8 1.4][6.4 2.8 5.6 2.1][4.6 3.4 1.4 0.3][6.3 2.5 4.9 1.5][6.6 3. 4.4 1.4][6.4 3.1 5.5 1.8][4.6 3.6 1. 0.2][4.9 3.1 1.5 0.1][5.2 4.1 1.5 0.1][6.5 3. 5.5 1.8][5. 3.4 1.5 0.2][5. 3.3 1.4 0.2][5.9 3. 5.1 1.8][7.7 3.8 6.7 2.2][4.4 3.2 1.3 0.2][6.2 2.9 4.3 1.3][6.1 2.9 4.7 1.4][5. 3.2 1.2 0.2][5. 2.3 3.3 1. ][4.6 3.1 1.5 0.2][6.2 2.8 4.8 1.8][6.1 2.6 5.6 1.4][7.3 2.9 6.3 1.8][7.2 3.2 6. 1.8][6.3 2.7 4.9 1.8][6. 2.9 4.5 1.5][7.7 2.8 6.7 2. ][6. 3.4 4.5 1.6]]目标值of训练集:y_train =[2 1 1 1 0 2 2 1 0 0 2 2 1 0 0 1 0 1 2 2 1 2 1 0 2 1 2 1 0 2 1 0 1 0 1 0 01 0 2 0 1 0 1 2 2 0 0 0 2 2 2 0 0 0 1 2 2 2 2 2 0 0 0 2 0 1 1 2 2 2 2 0 21 0 1 1 0 0 2 1 1 1 0 0 1 1 1 1 1 2 1 2 0 0 0 1 0 0 1 0 2 2 2 1 1 1 0 1 02 1 0 2 2 0 1 1 1]目标值of测试集:y_test =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 2 2 2 2 2 1 2 1]预测测试集中各个鸢尾花的分类概率:y_proba =[[0. 0.1 0.9 ][0. 0. 1. ][0. 0.75 0.25][0. 0. 1. ][1. 0. 0. ][0. 0.55 0.45][0. 0.85 0.15][0. 0.1 0.9 ][1. 0. 0. ][1. 0. 0. ][1. 0. 0. ][0. 0.1 0.9 ][1. 0. 0. ][1. 0. 0. ][0. 0.3 0.7 ][0. 0. 1. ][1. 0. 0. ][0. 1. 0. ][0. 0.85 0.15][1. 0. 0. ][0. 1. 0. ][1. 0. 0. ][0. 0.5 0.5 ][0. 0.2 0.8 ][0. 0. 1. ][0. 0. 1. ][0. 0.5 0.5 ][0. 0.9 0.1 ][0. 0. 1. ][0. 0.9 0.1 ]]根据概率计算各个鸢尾花的分类:y_proba_predict =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 1 2 2 2 1 1 2 1]直接用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 1 2 2 2 1 1 2 1]实际测试集中各个鸢尾花的分类:y_test =[2 2 1 2 0 1 1 2 0 0 0 2 0 0 2 2 0 1 1 0 1 0 2 2 2 2 2 1 2 1]knn模型准确度:predict_score =0.9333333333333333

3、鸢尾花分类画图01

import numpy as npfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import datasetsimport matplotlib as mplimport matplotlib.pyplot as pltimport matplotlib.patches as mpatchesif __name__ == '__main__':# 一、加载数据iris = datasets.load_iris()X_data = iris.data # 4列分别代表4个特征:花萼长、花萼宽、花瓣长、花瓣宽y_data = iris.target # 0代表setosa类花;1代表versicolor类花;2代表virginica类花;# 二、特征工程# 2.1 降维(ndarray数据类型降维)X_data = X_data[:, :2] # 降维后保留前2个特征# 2.2 画散点图plt.scatter(x=X_data[:, 0], y=X_data[:, 1], c=y_data) # 颜色c以目标值区分# 2.2 打乱数据集顺序index = np.arange(150)np.random.shuffle(index)# 2.2 分割数据集为:特征数据值of训练集,特征数据值of测试集,目标值of训练集,目标值of测试集X_train = X_data[index[:120]]X_test = X_data[index[120:]]y_train = y_data[index[:120]]y_test = y_data[index[120:]]# 三、算法工程# 3.1 实例化一个”k-近邻“估计器knn = KNeighborsClassifier(n_neighbors=5, weights='distance', p=1, n_jobs=4) # p = 1 距离度量采用的是:曼哈顿距离;p = 2 距离度量采用的是:欧氏距离;n_jobs:开启的进程数量。n_neighbors 一般不要超过样本数量的开平方数。# 3.2 将训练数据 x_data、y_data 喂给”k-近邻“估计器knn进行训练knn.fit(X_train, y_train)# 四、模型评估# 4.1 预测测试集概率及其对应的类别y_proba = knn.predict_proba(X_test) # 预测测试集中各个鸢尾花的分类概率y_proba_predict = y_proba.argmax(axis=1) # 根据概率计算各个鸢尾花的分类# 4.2 直接用knn.predict()方法预测测试集y_predict = knn.predict(X_test)print('用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =\n', y_predict)print('实际测试集中各个鸢尾花的分类:y_test =\n', y_test)# 4.3 knn模型准确度评估predict_score = knn.score(X_test, y_test)print('knn模型准确度:predict_score =\n', predict_score)# 五、画图# 5.1 生成网格采样点N, M = 5, 5 # 横纵各采样多少个值用于预测后画图,此处取得值越大,画的分界面越清晰,选N, M = 500, 500 效果不错,此处选N, M = 5, 5只是为了方便分析数据间关系x1_min, x1_max = X_train[:, 0].min(), X_train[:, 0].max() # X_train第0列的范围x2_min, x2_max = X_train[:, 1].min(), X_train[:, 1].max() # X_train第1列的范围t1 = np.linspace(x1_min, x1_max, N)t2 = np.linspace(x2_min, x2_max, M)print('type(t1) =', type(t1), '----t1.shape =', t1.shape, "----t1 =\n", t1)print('type(t2) =', type(t2), '----t2.shape =', t2.shape, "----t2 =\n", t2)x1, x2 = np.meshgrid(t1, t2) # 生成网格采样点;函数numpy.meshgrid():生成网格点坐标矩阵print('生成的网格采样点:x1:', '----type(x1) =', type(x1), '----x1.shape =', x1.shape, "----x1 =\n", x1)print('生成的网格采样点:x2:', '----type(x2) =', type(x2), '----x2.shape =', x2.shape, "----x2 =\n", x2)x1_ravel = x1.ravel() # 将二维数组转为一维数组x2_ravel = x2.ravel() # 将二维数组转为一维数组print('将生成的网格采样点x1转换为1D的迭代器----x1_ravel', '----type(x1_ravel) =', type(x1_ravel), '----x1_ravel.shape =', x1_ravel.shape, "----x1_ravel =\n", x1_ravel)print('将生成的网格采样点x2转换为1D的迭代器----x2_ravel', '----type(x2_ravel) =', type(x2_ravel), '----x2_ravel.shape =', x2_ravel.shape, "----x2_ravel =\n", x2_ravel)# 5.2 生成画图样本点x_test = np.stack((x1_ravel, x2_ravel), axis=1) # 将 形状为(2500,1)的x1_flat与(2500,1)的x2_flat堆叠生成形状为(2500,2)的测试样本print('通过numpy.stack()堆叠生成的测试点:x_test:', '----type(x_test) =', type(x_test), '----x_test.shape =', x_test.shape, "----x_test =\n", x_test)# 5.3 使用模型预测画图样本点的分类y_hat = knn.predict(x_test) # 预测值print('y_hat:', '----type(y_hat) =', type(y_hat), '----y_hat.shape =', y_hat.shape, "----y_hat =\n", y_hat)y_hat = y_hat.reshape(x1.shape) # 使之与输入的形状相同print('y_hat:', '----type(y_hat) =', type(y_hat), '----y_hat.shape =', y_hat.shape, "----y_hat =\n", y_hat)# 5.4 画布预设置cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])mpl.rcParams['font.sans-serif'] = [u'simHei']mpl.rcParams['axes.unicode_minus'] = Falseplt.figure(facecolor='w')plt.xlabel(u'花萼长度', fontsize=14)plt.ylabel(u'花萼宽度', fontsize=14)plt.xlim(x1_min, x1_max)plt.ylim(x2_min, x2_max)plt.grid()patchs = [mpatches.Patch(color='#77E0A0', label='Iris-setosa'),mpatches.Patch(color='#FF8080', label='Iris-versicolor'),mpatches.Patch(color='#A0A0FF', label='Iris-virginica')]plt.legend(handles=patchs, fancybox=True, framealpha=0.8)plt.title(u'鸢尾花k-近邻三分类效果', fontsize=17)# 5.5 开始画图plt.pcolormesh(x1, x2, y_hat, cmap=cm_light) # 画图样本点的预测值的显示plt.scatter(X_data[:, 0], X_data[:, 1], edgecolors='k', s=30, cmap=cm_dark) # 样本的显示plt.show()

打印结果:

用knn.predict()方法预测测试集中各个鸢尾花的分类:y_predict =[0 0 2 2 0 2 0 0 2 2 0 2 0 2 1 0 1 1 1 0 2 0 0 2 1 1 1 1 0 2]实际测试集中各个鸢尾花的分类:y_test =[0 0 1 1 0 1 0 0 2 1 0 2 0 2 1 0 2 2 1 0 1 0 0 2 1 1 1 1 0 2]knn模型准确度:predict_score =0.7666666666666667type(t1) = <class 'numpy.ndarray'> ----t1.shape = (5,) ----t1 =[4.4 5.275 6.15 7.025 7.9 ]type(t2) = <class 'numpy.ndarray'> ----t2.shape = (5,) ----t2 =[2. 2.6 3.2 3.8 4.4]生成的网格采样点:x1: ----type(x1) = <class 'numpy.ndarray'> ----x1.shape = (5, 5) ----x1 =[[4.4 5.275 6.15 7.025 7.9 ][4.4 5.275 6.15 7.025 7.9 ][4.4 5.275 6.15 7.025 7.9 ][4.4 5.275 6.15 7.025 7.9 ][4.4 5.275 6.15 7.025 7.9 ]]生成的网格采样点:x2: ----type(x2) = <class 'numpy.ndarray'> ----x2.shape = (5, 5) ----x2 =[[2. 2. 2. 2. 2. ][2.6 2.6 2.6 2.6 2.6][3.2 3.2 3.2 3.2 3.2][3.8 3.8 3.8 3.8 3.8][4.4 4.4 4.4 4.4 4.4]]将生成的网格采样点x1转换为1D的迭代器----x1_ravel ----type(x1_ravel) = <class 'numpy.ndarray'> ----x1_ravel.shape = (25,) ----x1_ravel =[4.4 5.275 6.15 7.025 7.9 4.4 5.275 6.15 7.025 7.9 4.4 5.275 6.15 7.025 7.9 4.4 5.275 6.15 7.025 7.9 4.4 5.275 6.15 7.0257.9 ]将生成的网格采样点x2转换为1D的迭代器----x2_ravel ----type(x2_ravel) = <class 'numpy.ndarray'> ----x2_ravel.shape = (25,) ----x2_ravel =[2. 2. 2. 2. 2. 2.6 2.6 2.6 2.6 2.6 3.2 3.2 3.2 3.2 3.2 3.8 3.8 3.8 3.8 3.8 4.4 4.4 4.4 4.4 4.4]通过numpy.stack()堆叠生成的测试点:x_test: ----type(x_test) = <class 'numpy.ndarray'> ----x_test.shape = (25, 2) ----x_test =[[4.4 2. ][5.275 2. ][6.15 2. ][7.025 2. ][7.9 2. ][4.4 2.6 ][5.275 2.6 ][6.15 2.6 ][7.025 2.6 ][7.9 2.6 ][4.4 3.2 ][5.275 3.2 ][6.15 3.2 ][7.025 3.2 ][7.9 3.2 ][4.4 3.8 ][5.275 3.8 ][6.15 3.8 ][7.025 3.8 ][7.9 3.8 ][4.4 4.4 ][5.275 4.4 ][6.15 4.4 ][7.025 4.4 ][7.9 4.4 ]]y_hat: ----type(y_hat) = <class 'numpy.ndarray'> ----y_hat.shape = (25,) ----y_hat =[1 1 1 1 2 0 1 2 2 2 0 0 1 1 2 0 0 0 2 2 0 0 0 2 2]y_hat: ----type(y_hat) = <class 'numpy.ndarray'> ----y_hat.shape = (5, 5) ----y_hat =[[1 1 1 1 2][0 1 2 2 2][0 0 1 1 2][0 0 0 2 2][0 0 0 2 2]]

4、鸢尾花分类画图02

import numpy as npimport matplotlib.pylab as pybfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn import datasetsfrom matplotlib.colors import ListedColormapif __name__ == '__main__':# 一、加载数据X, y = datasets.load_iris(True)# 4个属性,4维空间,4维的数据# 150代表样本的数量print('X.shape =', X.shape)# 二、特征工程# 降维,切片:简单粗暴方式(信息量变少了)X = X[:, :2]print('X.shape =', X.shape)pyb.scatter(X[:, 0], X[:, 1], c=y)# 三、算法工程knn = KNeighborsClassifier(n_neighbors=5) # 使用KNN算法训练数据knn.fit(X, y) # 使用150个样本点作为训练数据# 五、画图# N, M = 5, 5 # 横纵各采样多少个值用于预测后画图,此处取得值越大,画的分界面越清晰,选N, M = 500, 500 效果不错,此处选N, M = 5, 5只是为了方便分析数据间关系# meshgrid提取测试数据(500*500个测试样本)# 获取测试数据# shape (?,2)# 横坐标4 ~ 8;纵坐标 2~ 4.5# 背景点,取出来,meshgridNum = 5x1 = np.linspace(4, 8, Num)y1 = np.linspace(2, 4.5, Num)X1, Y1 = np.meshgrid(x1, y1)print('x1 =\n', x1)print('y1 =\n', y1)print('X1 =\n', X1)print('Y1 =\n', Y1)# 平铺,一维化,reshape# X1 = X1.reshape(-1,1)# Y1 = Y1.reshape(-1,1)# X_test = np.concatenate([X1,Y1],axis = 1)# print(X_test.shape)X_test = np.c_[X1.ravel(), Y1.ravel()]print('X_test.shape = ', X_test.shape)y_ = knn.predict(X_test)print('y_ =\n', y_)lc = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])lc2 = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])pyb.scatter(X_test[:, 0], X_test[:, 1], c=y_, cmap=lc) # 画图样本点的预测值的显示pyb.scatter(X[:, 0], X[:, 1], c=y, cmap=lc2) # 样本的显示pyb.contourf(X1, Y1, y_.reshape(Num, Num), cmap=lc)pyb.scatter(X[:, 0], X[:, 1], c=y, cmap=lc2)

打印结果:

X.shape = (150, 4)X.shape = (150, 2)x1 =[4. 5. 6. 7. 8.]y1 =[2. 2.625 3.25 3.875 4.5 ]X1 =[[4. 5. 6. 7. 8.][4. 5. 6. 7. 8.][4. 5. 6. 7. 8.][4. 5. 6. 7. 8.][4. 5. 6. 7. 8.]]Y1 =[[2. 2. 2. 2. 2. ][2.625 2.625 2.625 2.625 2.625][3.25 3.25 3.25 3.25 3.25 ][3.875 3.875 3.875 3.875 3.875][4.5 4.5 4.5 4.5 4.5 ]]X_test.shape = (25, 2)y_ =[0 1 1 1 2 0 1 1 2 2 0 0 1 2 2 0 0 0 2 2 0 0 0 2 2]

5、knn肿瘤预测

import numpy as npimport pandas as pdfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import MinMaxScaler, StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import GridSearchCVfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import confusion_matrix# 精确率、召回率、f1-score调和平均值from sklearn.metrics import classification_reportif __name__ == '__main__':# 一、读取数据# 1.1 构造列标签名字columns = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']# 1.2 读取数据data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', names=columns)print('data.shape = ', data.shape, '----type(data) = ', type(data), '----data.head() = \n', data.head())# 1.3 将有?的数据删除data = data.replace(to_replace='?', value=np.nan)data = data.dropna()# 二、特征工程# 2.1 提取特征数据值、目标值x_data = data.iloc[:, 1:-1]y_data = data.iloc[:, -1]print('x_data.shape = ', x_data.shape, '----type(x_data) = ', type(x_data), '----x_data.head() =\n', x_data.head())print('y_data.shape = ', y_data.shape, '----type(y_data) = ', type(y_data), '----y_data.head() =\n', y_data.head())# 2.3 归一化或标准化特征数据值# # 2.3.1 人工进行归一化操作# x_data_normal01 = (x_data - x_data.min()) / (x_data.max() - x_data.min())# print('x_data_normal01.shape = ', x_data_normal01.shape, '----type(x_data_normal01) = ', type(x_data_normal01), '----x_data_normal01 =\n', x_data_normal01)# # 2.3.2 调用MinMaxScaler函数进行归一化操作# mms = MinMaxScaler()# x_data_normal02 = mms.fit_transform(x_data)# print('x_data_normal02.shape = ', x_data_normal02.shape, '----type(x_data_normal02) = ', type(x_data_normal02), '----x_data_normal02 =\n', x_data_normal02)# # 2.3.3 人工标准化操作# x_data_standard01 = (x_data - x_data.mean()) / x_data.std()# print('x_data_standard01.shape = ', x_data_standard01.shape, '----type(x_data_standard01) = ', type(x_data_standard01), '----x_data_standard01 =\n', x_data_standard01)# 2.3.4 调用StandardScaler函数进行标准化操作std = StandardScaler()x_data_standard02 = std.fit_transform(x_data)print('x_data_standard02.shape = ', x_data_standard02.shape, '----type(x_data_standard02) = ', type(x_data_standard02), '----x_data_standard02 =\n', x_data_standard02)x_data = x_data_standard02# 三、算法工程# 3.1 分割训练集与测试集x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.8)# 3.2 实例化knn估计器knn = KNeighborsClassifier()# 3.3 将测试集的特征数据值、目标值喂给knn估计器进行训练knn.fit(x_train, y_train)# 四、模型评估y_predict = knn.predict(x_test)print('y_predict =\n', y_predict)test_score = knn.score(x_test, y_test)print('test_score = ', test_score)# 五、模型优化(网格搜索)params = {'n_neighbors': [i for i in range(1, 30)], 'weights': ['uniform', 'distance'], 'p': [1, 2]}gcv = GridSearchCV(knn, params, scoring='accuracy', cv=10)gcv.fit(x_train, y_train)y_predict_gcv = gcv.predict(x_test)print('y_predict_gcv =\n', y_predict_gcv)# 最优模型的确认best_params_ = gcv.best_params_best_estimator_ = gcv.best_estimator_best_score_ = gcv.best_score_print('best_params_ = ', best_params_)print('best_estimator_ = ', best_estimator_)print('best_score_ = ', best_score_)# 评估准确度01:使用accuracy_score()评估准确度accuracy_score01 = accuracy_score(y_test, y_predict_gcv)print('accuracy_score01 = ', accuracy_score01)# 取出了最好的模型,进行预测knn_best = gcv.best_estimator_y_predict_best = knn_best.predict(x_test)accuracy_score02 = accuracy_score(y_test, y_predict_best)print('accuracy_score02 = ', accuracy_score02)# 评估准确度01:也可以直接使用gcv进行预测,结果一样的score_gcv = gcv.score(x_test, y_test) # # 使用GridSearchCV.score()评估准确度print('score_gcv = ', score_gcv)# 六、交叉表print('目标值大小:y_test.shape =', y_test.shape)cros_tab = pd.crosstab(index=y_test, columns=y_predict_best, rownames=['True'], colnames=['Predict'], margins=True)print('cros_tab =\n', cros_tab)# 七、混淆矩阵confu_matrix = confusion_matrix(y_test, y_predict_best)print('混淆矩阵:confu_matrix =\n', confu_matrix)# 八、模型评估参数print('y_test.value_counts() =\n', y_test.value_counts())class_report = classification_report(y_test, y_predict_best, target_names=['B', 'M'])print('class_report = \n', class_report)

打印结果:

data.shape = (699, 11) ----type(data) = <class 'pandas.core.frame.DataFrame'> ----data.head() = Sample code number Clump Thickness ... Mitoses Class0 10000255 ...121 10029455 ...122 10154253 ...123 10162776 ...124 10170234 ...12[5 rows x 11 columns]x_data.shape = (683, 9) ----type(x_data) = <class 'pandas.core.frame.DataFrame'> ----x_data.head() =Clump Thickness Uniformity of Cell Size ...Normal Nucleoli Mitoses05 1 ... 1 115 4 ... 2 123 1 ... 1 136 8 ... 7 144 1 ... 1 1[5 rows x 9 columns]y_data.shape = (683,) ----type(y_data) = <class 'pandas.core.series.Series'> ----y_data.head() =0 21 22 23 24 2Name: Class, dtype: int64x_data_normal.shape = (683, 9) ----type(x_data_normal) = <class 'numpy.ndarray'> ----x_data_normal =[[0.44444444 0. 0. ... 0.22222222 0. 0. ][0.44444444 0.33333333 0.33333333 ... 0.22222222 0.11111111 0. ][0.22222222 0. 0. ... 0.22222222 0. 0. ]...[0.44444444 1. 1. ... 0.77777778 1. 0.11111111][0.33333333 0.77777778 0.55555556 ... 1. 0.55555556 0. ][0.33333333 0.77777778 0.77777778 ... 1. 0.33333333 0. ]]x_data_standard.shape = (683, 9) ----type(x_data_standard) = <class 'numpy.ndarray'> ----x_data_standard =[[ 0.19790469 -0.70221201 -0.74177362 ... -0.18182716 -0.61292736-0.34839971][ 0.19790469 0.27725185 0.26278299 ... -0.18182716 -0.28510482-0.34839971][-0.51164337 -0.70221201 -0.74177362 ... -0.18182716 -0.61292736-0.34839971]...[ 0.19790469 2.23617957 2.2718962 ... 1.86073779 2.337475540.22916583][-0.15686934 1.58320366 0.93248739 ... 2.67776377 1.02618536-0.34839971][-0.15686934 1.58320366 1.6021918 ... 2.67776377 0.37054027-0.34839971]]y_predict =[4 4 4 2 2 2 4 4 2 2 2 4 4 2 4 4 4 4 4 2 2 2 4 2 2 2 2 2 4 4 4 2 2 2 4 4 22 2 2 2 2 4 2 2 2 4 2 2 2 4 4 4 4 4 4 4 2 4 2 4 4 2 2 2 2 4 2 2 2 2 2 4 24 2 2 2 4 2 2 4 2 4 2 4 2 4 2 2 4 2 2 4 4 2 4 2 4 2 2 2 4 2 4 2 2 4 4 2 22 2 2 2 2 4 2 4 2 2 4 2 4 2 4 4 4 2 4 2 4 2 2 2 2 2]test_score = 0.9635036496350365y_predict_gcv =[4 4 4 2 2 2 4 4 2 2 2 4 4 2 4 4 4 4 4 2 2 2 4 2 2 2 2 2 2 4 4 2 2 2 4 4 22 2 2 2 2 4 2 2 2 4 2 2 2 4 4 4 4 4 4 4 2 4 2 4 4 2 2 2 2 4 2 2 2 2 2 4 24 2 2 2 4 2 2 4 2 4 2 4 2 4 2 2 4 2 2 4 4 2 2 2 4 2 2 2 4 2 4 2 2 4 4 2 22 2 2 2 2 4 2 4 2 2 4 2 4 2 4 4 4 2 4 2 4 2 2 2 2 2]best_params_ = {'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}best_estimator_ = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=1, p=1, weights='uniform')best_score_ = 0.978021978021978accuracy_score01 = 0.948905109489051accuracy_score02 = 0.948905109489051score_gcv = 0.948905109489051目标值大小:y_test.shape = (137,)cros_tab =Predict 2 4 AllTrue2 79 3 824 4 51 55All83 54 137混淆矩阵:confu_matrix =[[79 3][ 4 51]]y_test.value_counts() =2 824 55Name: Class, dtype: int64class_report = precision recall f1-score supportB 0.950.960.96 82M 0.940.930.94 55avg / total 0.950.950.95 137

7、预测facebook签到位置

import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.neighbors import KNeighborsClassifierdef knncls():"""K-近邻预测用户签到位置"""# 1、读取数据(pandas)myDataFrame = pd.read_csv("I:/AI_Data/facebook-v-predicting-check-ins/train.csv")print('\nmyDataFrame.head(5) = \n', myDataFrame.head(5))# 2、处理数据(pandas)# 2.1 缩小数据,查询数据晒讯myDataFrame = myDataFrame.sort_values(by='row_id').query("x > 1.0 & x < 1.25 & y > 2.5 & y < 2.75")print('\nmyDataFrame.count() = \n', myDataFrame.count())# 处理时间的数据:将时间戳转换为Series类型(格式为:index=时间戳, value=yyyy-mm-dd hh:mm:ss)dateTimeSeries = pd.to_datetime(myDataFrame['time'], unit='s')print('\ntype(dateTimeSeries) = ', type(dateTimeSeries))print('\ndateTimeSeries.head(5) = \n', dateTimeSeries.head(5))# 2.2 把dateTimeSeries转换成DatetimeIndex索引(字典类型)dateTimeIndexMap = pd.DatetimeIndex(dateTimeSeries)print('\ntype(dateTimeIndexMap) = ', type(dateTimeIndexMap))print('\ndateTimeIndexMap = \n', dateTimeIndexMap)# 2.3 构造一些特征(添加某些特征有可能会使预测准确度增加或减小)# myDataFrame['year'] = dateTimeIndexMap.year # 年份没有意义,因为以后的预测不可能重复此年份myDataFrame['month'] = dateTimeIndexMap.monthmyDataFrame['day'] = dateTimeIndexMap.daymyDataFrame['hour'] = dateTimeIndexMap.hour# myDataFrame['minute'] = dateTimeIndexMap.minute# myDataFrame['weekday'] = dateTimeIndexMap.weekday# 2.4 把时间戳特征删除myDataFrame = myDataFrame.drop(['time'], axis=1)print('\nmyDataFrame.head(5) = \n', myDataFrame.head(5))# 2.5 把签到数量少于n个目标位置删除# 分组然后统计每组数量,分组后place_id变为索引,其余特征的特征值变为当前分组下的成员数量placeCountDataFrame = myDataFrame.groupby('place_id').count()print('\ntype(placeCountDataFrame) = \n', type(placeCountDataFrame))print('\nplaceCountDataFrame = \n', placeCountDataFrame.head(5))# 选取组成员数量大于3的分组,然后通过reset_index将原来的索引place_id变为一列可以被引用的特征place_id,新索引变为0,1,2,3...placeCountDataFrame = placeCountDataFrame[placeCountDataFrame.row_id > 3].reset_index() # 选择每组数量大于3的样本print('\nplaceCountDataFrame = \n', placeCountDataFrame.head(5))# 根据placeCountDataFrame里的place_id筛选myDataFrame中符合条件的样本myDataFrame = myDataFrame[myDataFrame['place_id'].isin(placeCountDataFrame.place_id)]print('\nmyDataFrame.head(5) = \n', myDataFrame.head(5))# 2.6 将数据当中的特征值和目标值分开y_Series = myDataFrame['place_id'] # 目标值x_DataFrame = myDataFrame.drop(['place_id'], axis=1) # 特征值# 2.7 删除特征值里没有用的特征,来提高准确率x_DataFrame = x_DataFrame.drop(['row_id'], axis=1)print('\n特征数据值:x_DataFrame = \n', type(x_DataFrame), '\n', x_DataFrame.head(5))print('\n目标值:y_Series = \n', type(y_Series), '\n', y_Series.head(5))# 3、特征工程(scikit-learn)# 3.1 特征预处理(特征数据值的标准化,目标值不需要标准化),避免某一特征对最终结果造成比其他特征更大的影响,从而提高准确率。std = StandardScaler()x_DataFrame = std.fit_transform(x_DataFrame)print('\n标准化后的x_DataFrame:\n', x_DataFrame)# 3.2 进行数据的分割:训练集、测试集x_train_DataFrame, x_test_DataFrame, y_train_Series, y_test_Series = train_test_split(x_DataFrame, y_Series, test_size=0.25)print('\n特征数据值of训练集 x_train_DataFrame:\n', x_train_DataFrame)print('\n特征数据值of测试集 x_test_DataFrame:\n', x_test_DataFrame)print('\n目标值of训练集 y_train_Series:\n', y_train_Series.head(5))print('\n目标值of测试集 y_test_Series:\n', y_test_Series.head(5))# 4 算法工程# 4.1 实例化一个k-紧邻估计器对象knn_estimator = KNeighborsClassifier(n_neighbors=5)# 4.2 调用fit方法,进行训练knn_estimator.fit(x_train_DataFrame, y_train_Series)# 5 模型评估# 5.1 数据预测,得出预测结果predictTestSeries = knn_estimator.predict(x_test_DataFrame)print('\n预测的目标值的:\n', predictTestSeries)print('\n预测值与真实值的对比情况:\n', predictTestSeries == y_test_Series)# 5.2 计算准确率predictScore = knn_estimator.score(x_test_DataFrame, y_test_Series) # 输入”测试集“的特征数据值、目标值print('\n预测的准确率为:\n', predictScore)if __name__ == "__main__":knncls()

打印结果:

myDataFrame.head(5) = row_id x y accuracy time place_id0 0 0.7941 9.0809 54 470702 85230656251 1 5.9567 4.7968 13 186555 17577267132 2 8.3078 7.0407 74 322648 11375372353 3 7.3665 2.5165 65 704587 65673932364 4 4.0961 1.1307 31 472130 7440663949myDataFrame.count() = row_id17710x 17710y 17710accuracy 17710time 17710place_id 17710dtype: int64type(dateTimeSeries) = <class 'pandas.core.series.Series'>dateTimeSeries.head(5) = 600 1970-01-01 18:09:40957 1970-01-10 02:11:104345 1970-01-05 15:08:024735 1970-01-06 23:03:035580 1970-01-09 11:26:50Name: time, dtype: datetime64[ns]type(dateTimeIndexMap) = <class 'pandas.core.indexes.datetimes.DatetimeIndex'>dateTimeIndexMap = DatetimeIndex(['1970-01-01 18:09:40', '1970-01-10 02:11:10','1970-01-05 15:08:02', '1970-01-06 23:03:03','1970-01-09 11:26:50', '1970-01-02 16:25:07','1970-01-04 15:52:57', '1970-01-01 10:13:36','1970-01-09 15:26:06', '1970-01-08 23:52:02',...'1970-01-07 10:03:36', '1970-01-09 11:44:34','1970-01-04 08:07:44', '1970-01-04 15:47:47','1970-01-08 01:24:11', '1970-01-01 10:33:56','1970-01-07 23:22:04', '1970-01-08 15:03:14','1970-01-04 00:53:41', '1970-01-08 23:01:07'],dtype='datetime64[ns]', name='time', length=17710, freq=None)myDataFrame.head(5) = row_id x y accuracy place_id month day hour600600 1.2214 2.7023 17 66834267421 1 18957957 1.1832 2.6891 58 66834267421 1024345 4345 1.1935 2.6550 11 68897906531 5 154735 4735 1.1452 2.6074 49 68223597521 6 235580 5580 1.0089 2.7287 19 15279219051 9 11type(placeCountDataFrame) = <class 'pandas.core.frame.DataFrame'>placeCountDataFrame = row_idxy accuracy month day hourplace_id 101972 111 11111057182134 111 11111059958036 333 33331085266789 111 1111109769 1044 1044 10441044 1044 1044 1044placeCountDataFrame = place_id row_idxy accuracy month day hour0 109769 1044 1044 10441044 1044 1044 10441 1228935308120 120 120 120 120 120 1202 126780152958 58 58 5858 58 583 127804050715 15 15 1515 15 154 128505162221 21 21 2121 21 21myDataFrame.head(5) = row_id x y accuracy place_id month day hour600600 1.2214 2.7023 17 66834267421 1 18957957 1.1832 2.6891 58 66834267421 1024345 4345 1.1935 2.6550 11 68897906531 5 154735 4735 1.1452 2.6074 49 68223597521 6 235580 5580 1.0089 2.7287 19 15279219051 9 11特征数据值:x_DataFrame = <class 'pandas.core.frame.DataFrame'> x y accuracy month day hour600 1.2214 2.7023 171 1 18957 1.1832 2.6891 581 1024345 1.1935 2.6550 111 5 154735 1.1452 2.6074 491 6 235580 1.0089 2.7287 191 9 11目标值:y_Series = <class 'pandas.core.series.Series'> 600668342674295766834267424345 68897906534735 68223597525580 1527921905Name: place_id, dtype: int64标准化后的x_DataFrame:[[ 1.27892477 0.9941573 -0.58835492 0. -1.50340614 0.94055369][ 0.78467442 0.80524744 -0.21403874 0.1.80968818 -1.36413448][ 0.91794088 0.31723029 -0.6431329 0. -0.03091978 0.50842466]...[-1.27513331 1.3018514 -0.1775 0.1.07344499 0.50842466][ 1.04344424 0.66928958 0.05072149 0. -0.39904137 -1.65222051][-0.3858 -1.30138377 0.88152082 0.1.07344499 1.66076875]]特征数据值of训练集 x_train_DataFrame:[[-0.35003122 0.37161343 0.70805723 0. -0.03091978 -1.2147][-1.17162539 0.56195443 -0.1501311 0. -1.50340614 -1.2147][ 1.40572198 1.21455214 -0.09535312 0.1.07344499 -1.5081775 ]...[ 0.47932604 0.75945111 7.38184088 0. -0.03091978 -0.35583341][ 0.69539883 -1.6548742 -0.11361244 0.1.07344499 -0.35583341][-1.38122894 0.26141601 -0.61574391 0. -0.76716296 0.79651068]]特征数据值of测试集 x_test_DataFrame:[[-0.42636832 -1.18259954 -0.51531762 0.1.44156658 0.22033864][ 1.04991348 -1.82804157 0.75370554 0. -0.03091978 0.22033864][ 0.02647886 0.63208006 -0.66139222 0. -1.50340614 -0.64391943]...[ 1.30221405 1.40775541 -0.19577941 0.0.7053234 0.79651068][ 0.65011412 -1.87670017 -0.34185402 0.1.07344499 0.50842466][-0.07314752 0.54334967 -0.12274211 0. -0.76716296 0.36438165]]目标值of训练集 y_train_Series:928398652705229186133610331246374612660108 80489857997743611998071101254035746424972551Name: place_id, dtype: int64目标值of测试集 y_test_Series:18871651 50352684172439401374148440519545960 22152683227519621804898579987064875270522918Name: place_id, dtype: int64预测的目标值的:[4932578245 1267801529 1435128522 ... 8048985799 1228935308 3312463746]预测值与真实值的对比情况:18871651 False2439401False19545960 False7519621False8706487False19539181True22742928 False1397064True7166613False25205240True4711428True1250160True16672285True20566737True22569239True21670631 False9025634False21535360True21590563True22327471 False19630021 False21627845True11118565 False17705845 False1378529True21488775 False20664329 False16538012True5969839False1040459True... 14696886True3034395True25438288True7242661False2836178True8692726False11822141True1088451False24377288 False13031289 False6547530False15231249True13378120True24832202 False5806980False23706074 False16672238 False7546070True8210913True5644848True6458142False1752820True21721506 False16328983 False15980516True26763008True8184431False17790064True27489405True13893226TrueName: place_id, Length: 4230, dtype: bool预测的准确率为:0.49267139479905436

其中影响预测结果的可调节的因素有:

knn_estimator = KNeighborsClassifier(n_neighbors=5) 步骤中n_neighbors参数的取值,n_neighbors取不同的值会得到不同的结果。调节n_neighbors参数来达到更佳的效果的过程就是“调参”;“2.3 构造一些特征”步骤中的构造特征,有些特征对目标值没有意义则删除,否则影响模型准确度;

三、k值的选择

K值过小:容易受到异常点的影响,容易发生过拟合;

K值过大:受到样本均衡的问题,容易发生欠拟合;

选择较小的K值,就相当于用较小的领域中的训练实例进行预测,“学习”近似误差会减小,只有与输入实例较近或相似的训练实例才会对预测结果起作用,与此同时带来的问题是“学习”的估计误差会增大,换句话说,K值的减小就意味着整体模型变得复杂,容易发生过拟合

选择较大的K值,就相当于用较大领域中的训练实例进行预测,其优点是可以减少学习的估计误差,但缺点是学习的近似误差会增大。这时候,与输入实例较远(不相似的)训练实例也会对预测器作用,使预测发生错误,且K值的增大就意味着整体的模型变得简单。容易发生欠拟合

K=N(N为训练样本个数),则完全不足取,因为此时无论输入实例是什么,都只是简单的预测它属于在训练实例中最多的类,模型过于简单,忽略了训练实例中大量有用信息。

在实际应用中,K值一般取一个比较小的数值,例如采用交叉验证法(简单来说,就是把训练数据在分成两组:训练集和验证集)来选择最优的K值。

近似误差:

对现有训练集的训练误差,关注训练集,如果近似误差过小可能会出现过拟合的现象,对现有的训练集能有很好的预测,但是对未知的测试样本将会出现较大偏差的预测。模型本身不是最接近最佳模型。

估计误差:

可以理解为对测试集的测试误差,关注测试集,估计误差小说明对未知数据的预测能力好,模型本身最接近最佳模型。

四、k近邻算法优缺点

1、k近邻算法优点:

简单有效

无需估计参数

此参数指的是算法内部的参数,不是实例化算法时传的参数。

无需训练(无需迭代)

此处的训练是指训练的迭代过程。k-近邻算法一次性计算出结果,无需迭代训练。

重新训练的代价低

适合类域交叉样本

KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,KNN方法较其他方法更为适合。

适合大样本自动分类

该算法比较适用于样本容量比较大的类域的自动分类,而那些样本容量较小的类域采用这种算法比较容易产生误分。

2、缺点:

惰性学习

KNN算法是懒散学习方法(lazy learning,基本上不学习),一些积极学习的算法要快很多

类别评分不是规格化

不像一些通过概率评分的分类

输出可解释性不强

例如决策树的输出可解释性就较强

对不均衡的样本不擅长

当样本不平衡时,如一个类的样本容量很大,而其他类样本容量很小时,有可能导致当输入一个新样本时,该样本的K个邻居中大容量类的样本占多数。该算法只计算“最近的”邻居样本,某一类的样本数量很大,那么或者这类样本并不接近目标样本,或者这类样本很靠近目标样本。无论怎样,数量并不能影响运行结果。可以采用权值的方法(和该样本距离小的邻居权值大)来改进。

计算量较大,内存开销大

目前常用的解决方法是事先对已知样本点进行剪辑,事先去除对分类作用不大的样本。

必须指定k值,k值选择不当则分类精度不能保证。

五、k近邻算法使用场景

实践中基本上不用k-近邻算法

对于小数据场景,几千~几万样本,具体场景具体业务去测试应该使用哪个算法效果更好。在特征数据一样的基础上,依次比较各个算法的优劣。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。