1200字范文 > 机器学习第三讲机器学习基础机器学习算法（K-近邻算法朴素贝叶斯算法）

机器学习第三讲机器学习基础机器学习算法（K-近邻算法朴素贝叶斯算法）

时间：2019-12-21 01:27:10

文章目录

一、机器学习基础1.明确几点问题2.机器学习算法的判别依据3.机器学习算法分类4.机器学习开发流程二、机器学习算法1.sklearn数据集（1）数据集划分（2）sklearn数据集接口介绍（3）sklearn分类数据集（4）sklearn回归数据集2.K-近邻算法2.1 k近邻算法(KNN)2.2 k近邻算法实例-预测入住位置3.朴素贝叶斯算法3.1朴素贝叶斯算法3.2 朴素贝叶斯算法案例3.3朴素贝叶斯总结3.4分类模型的评估

一、机器学习基础

1.明确几点问题

算法是核心,数据和计算是基础

找准定位

大部分复杂模型的算法设计都是算法工程师在做,而我们

> 分析很多的数据

> 分析具体的业务

> 应用常见的算法

> 特征工程、调参数、优化

我们应该怎么做

1.学会分析问题,使用机器学习算法的目的,想要算法完成何种任务

2.掌握算法基本思想,学会对问题用相应的算法解决

3.学会利用库或者框架解决问题

2.机器学习算法的判别依据

这两组数据的区别？

数据类型

离散型数据：由记录不同类别个体的数目所得到的数据，又称计数数据，所有这些数据都是整数，而且不能再细分，也不能进一步提高他们的精确度。

连续型数据：变量可以在某个范围内取任一数，即变量的取值可以是连续的，如，长度、时间、质量值等，这类整数通常是非整数，含有小数部分。

数据类型的不同应用

数据的类型将是机器学习模型不同问题不同处理的依据

3.机器学习算法分类

监督学习

分类: k-近邻算法、贝叶斯分类、决策树与随机森林、逻辑回归、神经网络

回归: 线性回归、岭回归

标注: 隐马尔可夫模型 (不做要求)无监督学习

聚类: k-means

监督学习

监督学习，可以由输入数据中学到或建立一个模型，并依此模式推测新的结

果。输入数据是由输入特征值和目标值所组成。函数的输出可以是一个连续

的值（称为回归），或是输出是有限个离散值（称作分类）。

无监督学习

无监督学习（英语：Supervised learning），可以由输入数据中学到或建立一

个模型，并依此模式推测新的结果。输入数据是由输入特征值所组成。

4.机器学习开发流程

二、机器学习算法

算法

1、sklearn数据集与估计器

2、分类算法-k近邻算法

3、k-近邻算法实例

4、分类模型的评估

5、分类算法-朴素贝叶斯算法

6、朴素贝叶斯算法实例

7、模型的选择与调优

8、决策树与随机森林

1.sklearn数据集

（1）数据集划分

机器学习一般的数据集会划分为两个部分：

训练数据：用于训练，构建模型

测试数据：在模型检验时使用，用于评估模型是否有效

（2）sklearn数据集接口介绍

sklearn数据集划分API

sklearn.model_selection.train_test_split

scikit-learn数据集API介绍

sklearn.datasets 加载获取流行数据集

datasets.load_() 获取小规模数据集，数据包含在datasets里

datasets.fetch_(data_home=None)

获取大规模数据集，需要从网络上下载，函数的第一个参数是data_home，表示数据集

下载的目录,默认是 ~/scikit_learn_data/

获取数据集返回的类型

load和fetch返回的数据类型datasets.base.Bunch(字典格式)

data：特征数据数组，是 [n_samples * n_features] 的二维numpy.ndarray 数组

target：标签数组，是 n_samples 的一维 numpy.ndarray 数组

DESCR：数据描述

feature_names：特征名,新闻数据，手写数字、回归数据集没有

target_names：标签名,回归数据集没有

（3）sklearn分类数据集

sklearn.datasets.load_iris() 加载并返回鸢尾花数据集

from sklearn.datasets import load_iris# 获取小规模数据li = load_iris()# 获取特征值print(li.data)'''[[5.1 3.5 1.4 0.2][4.9 3. 1.4 0.2][4.7 3.2 1.3 0.2]...[6.5 3. 5.2 2. ][6.2 3.4 5.4 2.3][5.9 3. 5.1 1.8]] '''# 获取目标值print(li.target)'''[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2]'''# 获取描述信息print(li.DESCR)'''.. _iris_dataset:Iris plants dataset--------------------**Data Set Characteristics:**:Number of Instances: 150 (50 in each of three classes):Number of Attributes: 4 numeric, predictive attributes and the class:Attribute Information:- sepal length in cm- sepal width in cm- petal length in cm- petal width in cm- class:- Iris-Setosa- Iris-Versicolour- Iris-Virginica:Summary Statistics:============== ==== ==== ======= ===== ====================Min Max Mean SD Class Correlation============== ==== ==== ======= ===== ====================sepal length: 4.3 7.9 5.84 0.83 0.7826sepal width: 2.0 4.4 3.05 0.43 -0.4194petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)============== ==== ==== ======= ===== ====================:Missing Attribute Values: None:Class Distribution: 33.3% for each of 3 classes.:Creator: R.A. Fisher:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov):Date: July, 1988'''# 获取特征名print(li.feature_names)'''['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']'''# 获取标签名print(li.target_names)'''['setosa' 'versicolor' 'virginica']'''

sklearn.datasets.load_digits()加载并返回数字数据集

from sklearn.datasets import load_digits# 获取小规模数据li = load_digits()# 获取特征值# print(li.data)'''[[ 0. 0. 5. ... 0. 0. 0.][ 0. 0. 0. ... 10. 0. 0.][ 0. 0. 0. ... 16. 9. 0.]...[ 0. 0. 1. ... 6. 0. 0.][ 0. 0. 2. ... 12. 0. 0.][ 0. 0. 10. ... 12. 1. 0.]]'''# 获取目标值# print(li.target)'''[0 1 2 ... 8 9 8]'''# 获取描述信息# print(li.DESCR)'''.. _digits_dataset:Optical recognition of handwritten digits dataset--------------------------------------------------**Data Set Characteristics:**:Number of Instances: 1797:Number of Attributes: 64:Attribute Information: 8x8 image of integer pixels in the range 0..16.:Missing Attribute Values: None:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr):Date: July; 1998'''# 获取特征名# print(li.feature_names)'''['pixel_0_0', 'pixel_0_1', 'pixel_0_2', 'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6', 'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2', 'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6', 'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2', 'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6', 'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2', 'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6', 'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2', 'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6', 'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2', 'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6', 'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2', 'pixel_6_3', 'pixel_6_4', 'pixel_6_5', 'pixel_6_6', 'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2', 'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6', 'pixel_7_7']'''# 获取标签名print(li.target_names)'''[0 1 2 3 4 5 6 7 8 9]'''

数据集进行分割

sklearn.model_selection.train_test_split(*arrays,**options)

x 数据集的特征值

y 数据集的标签值

test_size 测试集的大小，一般为float

random_state 随机数种子,不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。

return 训练集特征值，测试集特征值，训练标签，测试标签(默认随机取)

用于分类的大数据集

sklearn.datasets.fetch_20newsgroups(data_home=None,subset=‘train’)

subset: ‘train’或者’test’,‘all’，可选，选择要加载的数据集.训练集的“训练”，测试集的“测试”，两者的“全部”

# 数据集进行分割from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import fetch_20newsgroupsli = load_iris()# train 训练集test 测试集x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)# print("训练集特征值和目标值", x_train, y_train)'''训练集特征值和目标值 [[6.7 3.3 5.7 2.1][5.5 2.4 3.7 1. ][5.4 3.9 1.7 0.4]...[7.2 3.2 6. 1.8][6.3 2.7 4.9 1.8][4.9 3.1 1.5 0.1]] [2 1 0 1 0 2 1 0 0 1 1 2 0 0 2 1 0 1 0 0 0 2 1 0 1 2 1 0 0 1 2 2 1 0 1 1 01 2 0 1 0 1 1 1 2 0 0 1 1 2 1 2 1 0 1 0 1 0 0 0 0 0 2 2 0 1 0 2 1 2 0 1 02 1 2 2 2 2 0 2 1 2 0 2 2 1 2 2 0 1 2 1 1 2 2 1 2 2 0 2 1 0 2 2 1 1 0 2 20]'''# print("测试集特征值和目标值", x_test, y_test)'''测试集特征值和目标值 [[6.7 3.1 4.7 1.5][6.3 3.4 5.6 2.4][5.1 3.7 1.5 0.4]...[6.6 3. 4.4 1.4][5.9 3.2 4.8 1.8][6.9 3.2 5.7 2.3]] [1 2 0 0 2 2 1 1 0 0 2 2 0 2 2 2 2 1 0 0 1 0 1 2 1 0 2 1 1 0 0 2 1 2 0 1 12]'''news = fetch_20newsgroups(subset='all')print(news.data)

（4）sklearn回归数据集

sklearn.datasets.load_boston() 加载并返回波士顿房价数据集

# 回归数据集from sklearn.datasets import load_bostonlb = load_boston()# 获取特征值print(lb.data)'''[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00][2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00][2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]...[6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00][1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00][4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]'''# 获取目标值print(lb.target)'''[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.418.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.818.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.625.3 24.7 21.2 19.3 20. 16.6 14.4 19.4 19.7 20.5 25. 23.4 18.9 35.424.7 31.6 23.3 19.6 18.7 16. 22.2 25. 33. 23.5 19.4 22. 17.4 20.924.2 21.7 22.8 23.4 24.1 21.4 20. 20.8 21.2 20.3 28. 23.9 24.8 22.923.9 26.6 22.5 22.2 23.6 28.7 22.6 22. 22.9 25. 20.6 28.4 21.4 38.743.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.818.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22. 20.3 20.5 17.3 18.8 21.415.7 16.2 18. 14.3 19.2 19.6 23. 18.4 15.6 18.1 17.4 17.1 13.3 17.814. 14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.417. 15.6 13.1 41.3 24.3 23.3 27. 50. 50. 50. 22.7 25. 50. 23.823.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.237.9 32.5 26.4 29.6 50. 32. 29.8 34.9 37. 30.5 36.4 31.1 29.1 50.33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50. 22.6 24.4 22.5 24.4 20.21.7 19.3 22.4 28.1 23.7 25. 23.3 28.7 21.5 23. 26.7 21.7 27.5 30.144.8 50. 37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29. 24. 25.1 31.523.7 23.3 22. 20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.829.6 42.8 21.9 20.9 44. 50. 36. 30.1 33.8 43.1 48.8 31. 36.5 22.830.7 50. 43.5 20.7 21.1 25.2 24.4 35.2 32.4 32. 33.2 33.1 29.1 35.145.4 35.4 46. 50. 32.2 22. 20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.921.7 28.6 27.1 20.3 22.5 29. 24.8 22. 26.4 33.1 36.1 28.4 33.4 28.222.8 20.3 16.1 22.1 19.4 21.6 23.8 16.2 17.8 19.8 23.1 21. 23.8 23.120.4 18.5 25. 24.6 23. 22.2 19.3 22.6 19.8 17.1 19.4 22.2 20.7 21.119.5 18.5 20.6 19. 18.7 32.7 16.5 23.9 31.2 17.5 17.2 23.1 24.5 26.622.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6 25. 19.9 20.8 16.821.9 27.5 21.9 23.1 50. 50. 50. 50. 50. 13.8 13.8 15. 13.9 13.313.1 10.2 10.4 10.9 11.3 12.3 8.8 7.2 10.5 7.4 10.2 11.5 15.1 23.29.7 13.8 12.7 13.1 12.5 8.5 5. 6.3 5.6 7.2 12.1 8.3 8.5 5.11.9 27.9 17.2 27.5 15. 17.2 17.9 16.3 7. 7.2 7.5 10.4 8.8 8.416.7 14.2 20.8 13.4 11.7 8.3 10.2 10.9 11. 9.5 14.5 14.1 16.1 14.311.7 13.4 9.6 8.7 8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.614.1 13. 13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20. 16.4 17.719.5 20.2 21.4 19.9 19. 19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.316.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.922. 11.9]'''print(lb.DESCR)'''.. _boston_dataset:Boston house prices dataset---------------------------**Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.:Attribute Information (in order):- CRIMper capita crime rate by town- ZN proportion of residential land zoned for lots over 25,000 sq.ft.- INDUS proportion of non-retail business acres per town- CHASCharles River dummy variable (= 1 if tract bounds river; 0 otherwise)- NOXnitric oxides concentration (parts per 10 million)- RM average number of rooms per dwelling- AGEproportion of owner-occupied units built prior to 1940- DISweighted distances to five Boston employment centres- RADindex of accessibility to radial highways- TAXfull-value property-tax rate per $10,000- PTRATIO pupil-teacher ratio by town- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town- LSTAT % lower status of the population- MEDVMedian value of owner-occupied homes in $1000's:Missing Attribute Values: None:Creator: Harrison, D. and Rubinfeld, D.L.This is a copy of UCI ML housing dataset.https://archive.ics.uci.edu/ml/machine-learning-databases/housing/'''

sklearn.datasets.load_diabetes()加载和返回糖尿病数据集

# 回归数据集from sklearn.datasets import load_diabeteslb = load_diabetes()# 获取特征值print(lb.data)'''[[ 0.03807591 0.05068012 0.06169621 ... -0.00259226 0.01990842-0.01764613][-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974-0.09220405][ 0.08529891 0.05068012 0.04445121 ... -0.00259226 0.00286377-0.02593034]...[ 0.04170844 0.05068012 -0.01590626 ... -0.01107952 -0.046879480.01549073][-0.04547248 -0.04464164 0.03906215 ... 0.02655962 0.04452837-0.02593034][-0.04547248 -0.04464164 -0.0730303 ... -0.03949338 -0.004219860.00306441]]'''# 获取目标值print(lb.target)'''[151. 75. 141. 206. 135. 97. 138. 63. 110. 310. 101. 69. 179. 185.118. 171. 166. 144. 97. 168. 68. 49. 68. 245. 184. 202. 137. 85.131. 283. 129. 59. 341. 87. 65. 102. 265. 276. 252. 90. 100. 55.61. 92. 259. 53. 190. 142. 75. 142. 155. 225. 59. 104. 182. 128.52. 37. 170. 170. 61. 144. 52. 128. 71. 163. 150. 97. 160. 178.48. 270. 202. 111. 85. 42. 170. 200. 252. 113. 143. 51. 52. 210.65. 141. 55. 134. 42. 111. 98. 164. 48. 96. 90. 162. 150. 279.92. 83. 128. 102. 302. 198. 95. 53. 134. 144. 232. 81. 104. 59.246. 297. 258. 229. 275. 281. 179. 200. 200. 173. 180. 84. 121. 161.99. 109. 115. 268. 274. 158. 107. 83. 103. 272. 85. 280. 336. 281.118. 317. 235. 60. 174. 259. 178. 128. 96. 126. 288. 88. 292. 71.197. 186. 25. 84. 96. 195. 53. 217. 172. 131. 214. 59. 70. 220.268. 152. 47. 74. 295. 101. 151. 127. 237. 225. 81. 151. 107. 64.138. 185. 265. 101. 137. 143. 141. 79. 292. 178. 91. 116. 86. 122.72. 129. 142. 90. 158. 39. 196. 222. 277. 99. 196. 202. 155. 77.191. 70. 73. 49. 65. 263. 248. 296. 214. 185. 78. 93. 252. 150.77. 208. 77. 108. 160. 53. 220. 154. 259. 90. 246. 124. 67. 72.257. 262. 275. 177. 71. 47. 187. 125. 78. 51. 258. 215. 303. 243.91. 150. 310. 153. 346. 63. 89. 50. 39. 103. 308. 116. 145. 74.45. 115. 264. 87. 202. 127. 182. 241. 66. 94. 283. 64. 102. 200.265. 94. 230. 181. 156. 233. 60. 219. 80. 68. 332. 248. 84. 200.55. 85. 89. 31. 129. 83. 275. 65. 198. 236. 253. 124. 44. 172.114. 142. 109. 180. 144. 163. 147. 97. 220. 190. 109. 191. 122. 230.242. 248. 249. 192. 131. 237. 78. 135. 244. 199. 270. 164. 72. 96.306. 91. 214. 95. 216. 263. 178. 113. 200. 139. 139. 88. 148. 88.243. 71. 77. 109. 272. 60. 54. 221. 90. 311. 281. 182. 321. 58.262. 206. 233. 242. 123. 167. 63. 197. 71. 168. 140. 217. 121. 235.245. 40. 52. 104. 132. 88. 69. 219. 72. 201. 110. 51. 277. 63.118. 69. 273. 258. 43. 198. 242. 232. 175. 93. 168. 275. 293. 281.72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257. 55. 84. 42.146. 212. 233. 91. 111. 152. 120. 67. 310. 94. 183. 66. 173. 72.49. 64. 48. 178. 104. 132. 220. 57.]'''print(lb.DESCR)'''.. _diabetes_dataset:Diabetes dataset----------------Ten baseline variables, age, sex, body mass index, average bloodpressure, and six blood serum measurements were obtained for each of n =442 diabetes patients, as well as the response of interest, aquantitative measure of disease progression one year after baseline.**Data Set Characteristics:**:Number of Instances: 442:Number of Attributes: First 10 columns are numeric predictive values:Target: Column 11 is a quantitative measure of disease progression one year after baseline:Attribute Information:- ageage in years- sex- bmibody mass index- bpaverage blood pressure- s1tc, T-Cells (a type of white blood cells)- s2ldl, low-density lipoproteins- s3hdl, high-density lipoproteins- s4tch, thyroid stimulating hormone- s5ltg, lamotrigine- s6glu, blood sugar levelNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).Source URL:https://www4.stat.ncsu.edu/~boos/var.select/diabetes.htmlFor more information see:Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani () "Least Angle Regression," Annals of Statistics (with discussion), 407-499.(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'''

转换器与预估器

想一下之前做的特征工程的步骤？

1、实例化 (实例化的是一个转换器类(Transformer))

2、调用fit_transform(对于文档建立分类词频矩阵，不能同时调用)

fit_taransform 输入数据直接转换fit 输入数据不转换taransform 进行数据转换

估计器

在sklearn中，估计器(estimator)是一个重要的角色，分类器和回归器都属于estimator，是一类实现了算法的API

用于分类的估计器

sklearn.neighborsk-近邻算法

sklearn.naive_bayes贝叶斯

sklearn.linear_model.LogisticRegression逻辑回归

用于回归的估计器

sklearn.linear_model.LinearRegression线性回归

sklearn.linear_model.Ridge岭回归

2.K-近邻算法

2.1 k近邻算法(KNN)

定义:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某

一个类别，则该样本也属于这个类别。

来源:KNN算法最早是由Cover和Hart提出的一种分类算法

计算距离公式

两个样本的距离可以通过如下公式计算，又叫欧式距离

比如说，a(a1,a2,a3),b(b1,b2,b3)

k-近邻算法API

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5,algorithm=‘auto’)

> n_neighbors：int,可选（默认= 5）,k_neighbors查询默认使用的邻居数,如果k值等于6的话,两个

特征值出现的概率相等怎么办？

> algorithm：{‘auto’,‘ball_tree’,‘kd_tree’,‘brute’},可选用于计算最近邻居的算法：'ball_tree’将会使

用 BallTree,'kd_tree’将使用 KDTree。'auto’将尝试根据传递给fit方法的值来决定最合适的算法。

(不同实现方式影响效率)

2.2 k近邻算法实例-预测入住位置

档案说明

train.csv，test.csv

row_id：签到事件的ID

xy：坐标

精度：位置精度

时间：时间戳

place_id：商家的ID，这是您要预测

的目标

sample_submission.csv-带有随机预测

的正确格式的样本提交文件

数据处理：

（1）将x,y缩小

（2）时间戳进行转换（年、月、日、时、分、秒），作为新特征

（3）删除少于指定签到人数的位置

问题

1.k值取多大？有什么影响？

k值取很小：容易受异常点影响

k值取很大：容易受最近数据太多导致比例变化

2.性能问题？

K-近邻算法

k-近邻算法优缺点

优点：简单，易于理解，易于实现，无需估计参数，无需训练

缺点：懒惰算法，对测试样本分类时的计算量大，内存开销大必须指定K值，K值选择不当则分类精度不能保证

import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import StandardScalerdef nkkcls():"""K-近邻预测用户签到位置:return: None"""data = pd.read_csv('./train.csv')# print(data.head(10)) # sql语句里面的where条件缩小数据'''row_id x y accuracy time place_id0 0 0.7941 9.0809 54 470702 85230656251 1 5.9567 4.7968 13 186555 17577267132 2 8.3078 7.0407 74 322648 11375372353 3 7.3665 2.5165 65 704587 65673932364 4 4.0961 1.1307 31 472130 74406639495 5 3.8099 1.9586 75 178065 62898029276 6 6.3336 4.3720 13 666829 99312495447 7 5.7409 6.7697 85 369002 56628136558 8 4.3114 6.9410 3 166384 84717809389 9 6.3414 0.0758 65 400060 1253803156'''data = data.query('x > 1.0 & x < 1.25 & y > 2.5 & y<2.75')# print(data)'''row_id x y accuracy time place_id600 600 1.2214 2.7023 17 65380 6683426742957 957 1.1832 2.6891 58 785470 668342674243454345 1.1935 2.6550 11 400082 688979065347354735 1.1452 2.6074 49 514983 682235975255805580 1.0089 2.7287 19 732410 1527921905... ......... ...... ...29100203 29100203 1.0129 2.6775 12 38036 331246374629108443 29108443 1.1474 2.6840 36 602524 353317777929109993 29109993 1.0240 2.7238 62 658994 642497255129111539 29111539 1.2032 2.6796 87 262421 353317777929112154 29112154 1.1070 2.5419 178 687667 4932578245[17710 rows x 6 columns]'''# 处理时间time_value = pd.to_datetime(data['time'],unit='s')# print(time_value)'''600 1970-01-01 18:09:40957 1970-01-10 02:11:104345 1970-01-05 15:08:024735 1970-01-06 23:03:035580 1970-01-09 11:26:50... 29100203 1970-01-01 10:33:5629108443 1970-01-07 23:22:0429109993 1970-01-08 15:03:1429111539 1970-01-04 00:53:4129112154 1970-01-08 23:01:07Name: time, Length: 17710, dtype: datetime64[ns]'''time_value = pd.DatetimeIndex(time_value)# print(time_value)'''DatetimeIndex(['1970-01-01 18:09:40', '1970-01-10 02:11:10','1970-01-05 15:08:02', '1970-01-06 23:03:03','1970-01-09 11:26:50', '1970-01-02 16:25:07','1970-01-04 15:52:57', '1970-01-01 10:13:36','1970-01-09 15:26:06', '1970-01-08 23:52:02',...'1970-01-07 10:03:36', '1970-01-09 11:44:34','1970-01-04 08:07:44', '1970-01-04 15:47:47','1970-01-08 01:24:11', '1970-01-01 10:33:56','1970-01-07 23:22:04', '1970-01-08 15:03:14','1970-01-04 00:53:41', '1970-01-08 23:01:07'],dtype='datetime64[ns]', name='time', length=17710, freq=None)'''data['day'] = time_value.daydata['hour'] = time_value.hourdata['weekday'] = time_value.weekday# print(data)'''row_id x y accuracy ... place_id day hour weekday600 600 1.2214 2.7023 17 ... 6683426742 1 18 3957 957 1.1832 2.6891 58 ... 6683426742 102 543454345 1.1935 2.6550 11 ... 6889790653 5 15 047354735 1.1452 2.6074 49 ... 6822359752 6 23 155805580 1.0089 2.7287 19 ... 1527921905 9 11 4... ......... ... ... ... ... ......29100203 29100203 1.0129 2.6775 12 ... 3312463746 1 10 329108443 29108443 1.1474 2.6840 36 ... 3533177779 7 23 229109993 29109993 1.0240 2.7238 62 ... 6424972551 8 15 329111539 29111539 1.2032 2.6796 87 ... 3533177779 40 629112154 29112154 1.1070 2.5419 178 ... 4932578245 8 23 3[17710 rows x 9 columns]'''data = data.drop(['time'],axis=1)# print(data)'''row_id x y accuracy place_id day hour weekday600 600 1.2214 2.7023 17 6683426742 1 18 3957 957 1.1832 2.6891 58 6683426742 102 543454345 1.1935 2.6550 11 6889790653 5 15 047354735 1.1452 2.6074 49 6822359752 6 23 155805580 1.0089 2.7287 19 1527921905 9 11 4... ......... ... ... ... ......29100203 29100203 1.0129 2.6775 12 3312463746 1 10 329108443 29108443 1.1474 2.6840 36 3533177779 7 23 229109993 29109993 1.0240 2.7238 62 6424972551 8 15 329111539 29111539 1.2032 2.6796 87 3533177779 40 629112154 29112154 1.1070 2.5419 178 4932578245 8 23 3[17710 rows x 8 columns]'''place_count = data.groupby('place_id').count()# print(place_count)'''row_idxy accuracy day hour weekdayplace_id 101972 111 111 11057182134 111 111 11059958036 333 333 31085266789 111 111 1109769 1044 1044 10441044 1044 10441044... ... ... ... ... ... ......9904182060 111 111 19915093501 111 111 19946198589 111 111 19950190890 111 111 19980711012 555 555 5[805 rows x 7 columns]'''tf = place_count[place_count.row_id > 3].reset_index()# print(tf)'''place_id row_idxy accuracy day hour weekday0 109769 1044 1044 10441044 1044 104410441 1228935308120 120 120 120 120 102 126780152958 58 58 58 58 58 583 127804050715 15 15 15 15 15 154 128505162221 21 21 21 21 21 21........ ... ... ... ... ......234 9741307878 555 555 5235 975385552921 21 21 21 21 21 21236 9806043737 666 666 6237 980947606923 23 23 23 23 23 23238 9980711012 555 555 5[239 rows x 8 columns]'''data = data[data['place_id'].isin(tf.place_id)]# print(data)'''row_id x y accuracy place_id day hour weekday600 600 1.2214 2.7023 17 6683426742 1 18 3957 957 1.1832 2.6891 58 6683426742 102 543454345 1.1935 2.6550 11 6889790653 5 15 047354735 1.1452 2.6074 49 6822359752 6 23 155805580 1.0089 2.7287 19 1527921905 9 11 4... ......... ... ... ... ......29100203 29100203 1.0129 2.6775 12 3312463746 1 10 329108443 29108443 1.1474 2.6840 36 3533177779 7 23 229109993 29109993 1.0240 2.7238 62 6424972551 8 15 329111539 29111539 1.2032 2.6796 87 3533177779 40 629112154 29112154 1.1070 2.5419 178 4932578245 8 23 3[16918 rows x 8 columns]'''# 分类算法特征值目标值x = data.drop(['place_id'],axis=1)y = data['place_id']# 数据分割x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)# 特征工程(标准化)std = StandardScaler()# 对训练集和测试集的特征值进行标准化x_train = std.fit_transform(x_train)x_test = std.transform(x_test)# 进行算法流程knn = KNeighborsClassifier(n_neighbors=5)knn.fit(x_train, y_train)# 得出预测结果y_predict = knn.predict(x_test)print(y_predict) # [6424972551 2355236719 3952821602 ... 3992589015 6424972551 3533177779]# [6683426742 8258328058 6399991653 ... 8048985799 7914558846 1435128522]# 预测的准确率score = knn.score(x_test,y_test)print(score) # 0.02860545626477 # 0.4148936170212766if __name__ == '__main__':nkkcls()

3.朴素贝叶斯算法

3.1朴素贝叶斯算法

概率定义为一件事情发生的可能性

> 扔出一个硬币，结果头像朝上

> 某天是晴天

问题？

1.女神喜欢的概率？ 4/7

2.职业是程序员并且体型匀称的概率？联合概率：p(程序员,匀称)=3/7 * 4/7

3.在女神喜欢的条件下，职业是程序

员的概率？ 2/4

4.在女神喜欢的条件下，职业是产品，

体重是超重的概率？条件概率：P(产品,超重|喜欢) = P(产品|喜欢)P(超重|喜欢) = 2/4 * 1/4

联合概率和条件概率

联合概率:包含多个条件,且所有条件同时成立的概率

记作：P(A,B)

条件概率:就是事件A在另外一个事件B已经发生条件下的发生概率

记作：P(A|B)

特性：P(A1,A2|B) = P(A1|B)P(A2|B)

注意：此条件概率的成立,是由于A1,A2相互独立的结果

朴素贝叶斯-贝叶斯公式

公式分为三个部分：

P©：每个文档类别的概率(某文档类别词数/总文档词数)

P(W|C)：给定类别下特征(被预测文档中出现的词)的概率

计算方法:P(F1|C) = Ni/N

Ni为该F1词在C类别所有文档中出现的次数

N为所属类别C下的文档所有词出现的次数和

P(F1,F2)预测文档中每个词的概率

P(文档类别|文档特征)

P(科技|云计算,5G)=P(云计算,5G|科技)P(科技)/P(W)

P(编程|云计算,5G)=P(云计算,5G|科技)P(编程)/P(W)

朴素贝叶斯-贝叶斯公式

现有一篇被预测文档：出现了影院，支付宝，云计算，计算属于科技、娱乐的类别概率？

预测文档：P(科技|影院，支付宝，云计算) = P(影院，支付宝，云计算|科技)P(科技) = 8/10020/10063/100*100/221

预测文档：P(娱乐|影院，支付宝，云计算) = P(影院，支付宝，云计算|娱乐)P(娱乐) = 56/12115/121(0+1)/121*121/221

拉普拉斯平滑

问题：从上面的例子我们得到娱乐概率为0,这是不合理的,如果词频列表里面有很多出现次

数都为0,很可能计算结果都为零

解决方法：拉普拉斯平滑系数

sklearn朴素贝叶斯实现API

sklearn.naive_bayes.MultinomialNB

sklearn.naive_bayes.MultinomialNB(alpha = 1.0) 朴素贝叶斯分类

alpha:拉普拉斯平滑系数

3.2 朴素贝叶斯算法案例

sklearn20类新闻分类

流程

1.加载20类新闻数据，并进行分割

2.生成文章特征词

3.朴素贝叶斯estimator流程进行预估

from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitdef naviebays():"""朴素贝叶斯进行文本分类:return: None"""news = fetch_20newsgroups(subset='all')# 数据分割x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25)# 对数据进行特征抽取tf = TfidfVectorizer()x_train = tf.fit_transform(x_train)x_test = tf.transform(x_test)# 进行算法流程mlt = MultinomialNB(alpha=1.0)mlt.fit(x_train, y_train)y_predict = mlt.predict(x_test)print(y_predict) # [17 9 0 ... 6 2 16]# 准确率print(mlt.score(x_test,y_test)) # 0.855475383956return Noneif __name__ == '__main__':naviebays()

3.3朴素贝叶斯总结

优点：

1.朴素贝叶斯模型发源于古典数学理论，有稳定的分类效率。

2.对缺失数据不太敏感，算法也比较简单，常用于文本分类。

3.分类准确度高，速度快

缺点：

1.模型的原因导致预测效果不佳。

3.4分类模型的评估

判定手机的好用与否

estimator.score()一般最常见使用的是准确率，即预测结果正确的百分比

混淆矩阵

在分类任务下，预测结果(Predicted Condition)与正确标记(True Condition)之间存在四种不同的组合，构成混淆矩阵(适用于多分类)

精确率与召回率

精确率：预测结果为正例样本中真实为正例的比例（查得准）

召回率：真实为正例的样本中预测结果为正例的比例（查的全，对正样本的区分能力）

分类模型评估API

sklearn.metrics.classification_report

sklearn.metrics.classification_report(y_true, y_pred, target_names=None)

• y_true：真实目标值

• y_pred：估计器预测目标值

• target_names：目标类别名称

return：每个类别精确率与召回率

from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_reportdef naviebays():"""朴素贝叶斯进行文本分类:return: None"""news = fetch_20newsgroups(subset='all')# 数据分割x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25)# 对数据进行特征抽取tf = TfidfVectorizer()x_train = tf.fit_transform(x_train)x_test = tf.transform(x_test)# 进行算法流程mlt = MultinomialNB(alpha=1.0)mlt.fit(x_train, y_train)y_predict = mlt.predict(x_test)print(y_predict) # [17 9 0 ... 6 2 16]# 准确率print(mlt.score(x_test, y_test)) # 0.855475383956# 精确率和召回率print('每个类别的精确率和召回率：', classification_report(y_test, y_predict, target_names=news.target_names))return Noneif __name__ == '__main__':naviebays()'''[ 0 17 16 ... 8 11 0]0.8491086587436333每个类别的精确率和召回率： precision recall f1-score supportalt.atheism 0.900.730.81 206comp.graphics 0.890.760.82 233comp.os.ms-windows.misc 0.840.830.83 235comp.sys.ibm.pc.hardware 0.730.870.79 245comp.sys.mac.hardware 0.920.850.88 247comp.windows.x 0.930.840.89 249misc.forsale 0.910.720.80 242rec.autos 0.930.900.92 250rec.motorcycles 0.950.950.95 254rec.sport.baseball 0.940.960.95 252rec.sport.hockey 0.890.980.93 234sci.crypt 0.820.970.89 267sci.electronics 0.870.790.83 248sci.med 0.970.890.93 247sci.space 0.850.970.90 220soc.religion.christian 0.540.970.69 254talk.politics.guns 0.780.960.86 240talk.politics.mideast 0.900.970.93 234talk.politics.misc 0.990.580.74 202talk.religion.misc 1.000.180.30 153accuracy 0.854712macro avg 0.880.830.834712weighted avg 0.870.850.844712'''

模型的选择与调优

1、交叉验证

2、网格搜索

交叉验证

交叉验证：为了让被评估的模型更加准确可信

交叉验证过程

交叉验证：将拿到的数据，分为训练和验证集。以下图为例：将数据分成5份，其中一份作

为验证集。然后经过5次(组)的测试，每次都更换不同的验证集。即得到5组模型的结果，

取平均值作为最终结果。又称5折交叉验证。

超参数搜索-网格搜索

通常情况下，有很多参数是需要手动指定的（如k-近邻算法中的K值），这种叫超参数。但是手动过程繁杂，所以需要对模型预设几种超参数组合。每组超参数都采用交叉验证来进行评估。最后选出最优参数组合建立模型。

超参数搜索-网格搜索API

sklearn.model_selection.GridSearchCV

GridSearchCV

sklearn.model_selection.GridSearchCV(estimator,param_grid=None,cv=None)

对估计器的指定参数值进行详尽搜索

estimator：估计器对象

param_grid：估计器参数(dict){“n_neighbors”:[1,3,5]}

cv：指定几折交叉验证

fit：输入训练数据

score：准确率

结果分析：

best_score_: 在交叉验证中测试的最好结果

best_estimator_：最好的参数模型

cv_results_: 每次交叉验证后的验证集准确率结果和训练集准确率结果

K-近邻网格搜索案例

将前面的k-近邻算法案例改成网格搜索

import pandas as pdfrom sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.preprocessing import StandardScalerdef nkkcls():"""K-近邻预测用户签到位置:return: None"""data = pd.read_csv('./train.csv')# print(data.head(10)) sql语句里面的where条件缩小数据# data = data.query('x > 1.0 & x < 1.25 & y > 2.5 & y<2.75')# print(data)# 处理时间time_value = pd.to_datetime(data['time'], unit='s')time_value = pd.DatetimeIndex(time_value)data['day'] = time_value.daydata['hour'] = time_value.hourdata['weekday'] = time_value.weekday# print(data)data = data.drop(['time'], axis=1)place_count = data.groupby('place_id').count()# print(place_count)tf = place_count[place_count.row_id > 3].reset_index()data = data[data['place_id'].isin(tf.place_id)]# 分类算法特征值目标值x = data.drop(['place_id'], axis=1)y = data['place_id']# 数据分割x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)# 特征工程(标准化)std = StandardScaler()# 对训练集和测试集的特征值进行标准化x_train = std.fit_transform(x_train)x_test = std.transform(x_test)# 进行算法流程knn = KNeighborsClassifier(n_neighbors=5)knn.fit(x_train, y_train)# 进行网格搜索param = {"n_neighbors": [3, 5, 10]}gc = GridSearchCV(knn, param, CV=2)gc.fit(x_train, y_train)# 在测试集上预测准确率print(gc.score(x_test, y_test))print("在交叉验证中最好的结果：", gc.best_score_)print("选择最好的模型：", gc.best_estimator_)print("每个超参数每次交叉验证的结果：", gc.cv_results_)if __name__ == '__main__':nkkcls()

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。