1200字范文 > 01_01 python机器学习_第一章学习内容整理_加载样本数据绘制散点图

01_01 python机器学习_第一章学习内容整理_加载样本数据绘制散点图

时间：2024-06-11 02:51:28

第一章学习内容整理_加载样本数据&绘制散点图

01 常用包说明

python可以解决很多问题,相应解决方案使用的包也很多,不太好记忆.

为了便于记忆,用大白话简单描述一下各个包的功能.

# 科学计算最基础的包# 本质就是个多维数组, 里面包含了一些操作数组的功能函数, 机器学习一切的起源import numpy as np# 以numpy为基础,封装的各种高级数学函数import scipy# 将各种数据已表单形式存储&表达, 可以直接使用numpy的数据# 虽然同numpy都是数据集合, 但numpy侧重的是多为数组, 而且数据类型必须都一致# pandas则侧重数据的集合,简单理解就是想象成excel表格import pandas as pd# 绘图工具可以直接使用numpy的数据格式进行绘图import matplotlib# 交互式python# 通常我们在cmd执行python后,进入python环境,编码是没有语法提醒的.# Ipython下则可以提供语法提醒import IPython# ---------------# 以下可以简单记忆, xxxxlearn表示各种机器学习用的包# 机器学习的基础库, 里面有免费的学习数据, 实例中用的就是这个包里的数据import sklearn# 机器学习绘图库import mglearn

02 从一个例子开始说起

from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitimport pandas as pdimport mglearnimport matplotlib.pyplot as ptiris_dataset = load_iris()X_train, X_test, y_train, y_test = \train_test_split(iris_dataset['data'],iris_dataset['target'], random_state=0)iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)pt.show()

如果你跟我一样,看的一头雾水,说明咱们是一个段位的,接下来我们就尝试搞懂这段程序.

先把程序功能简单说明一下,也好带着问题看程序.

功能:

使用机器学习包中提供的数据样本,创建学习用数据和测试用数据使用学习数据创建数据表格模板使用测试数据测验证数据表格是否能正常识别数据通过绘图来显示测试结果

接下来我们尝试分段来研究这段程序:

02_001 引入包的说明

# sklearn为机器学习基础包, 其下的datasets中提供了一些免费使用的数据样本, # load_iris是样本中关于鸢尾花建模的数据样本from sklearn.datasets import load_iris# train_test_split可以将sklearn的数据拆分# 默认情况下会将一份数据拆分成 75%的建模用数据和 25%的测试用数据from sklearn.model_selection import train_test_split# panda中提供数据的绘图功能# 直接用matplotlib不行吗? 一般来说数据在成图之前都会做一些设置,中间多一曾panda缓冲# 能解决绝大部分问题import pandas as pd# 虽然panda提供了绘图功能,但是主职并不是绘图,以此也不能完成复杂的配图方案.# mglearn是专门应对绘图的包,里面有很多视图配色方案,# 各种现成的数据视图,本着不重复造轮子的宗旨,建议绘图时选择mglearn# 另外需要注意的是: mglearn只是做成绘图用的数据import mglearn# 绘制图片的主要函数import matplotlib.pyplot as pt

02_002 使用机器学习包中提供的数据样本,创建学习用数据和测试用数据

加载数据样本:

# 加载鸢尾花数据包, 加载后的数据格式类似于dictiris_dataset = load_iris()print(type(iris_dataset))>>> <class 'sklearn.utils.Bunch'>iris_dataset.keys()>>> dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])# 说明一下这个数据包中各个key的作用:# =='data'==# 用于存放测量数据可以看出这个例子的数据是二维数组>>> iris_dataset.dataarray([[5.1, 3.5, 1.4, 0.2],[4.9, 3. , 1.4, 0.2],[4.7, 3.2, 1.3, 0.2],[4.6, 3.1, 1.5, 0.2],[5. , 3.6, 1.4, 0.2],....])# 二维数组的形状# 一维表示测试数据的个数# 二维表示每个测试数据的特征具体值>>> iris_dataset.data.shape(150, 4)# =='feature_names'==# 用于存放特征信息. 特征个数跟data中的二维数据个数应当是一样的.>>> iris_dataset.feature_names['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']# =='target_names'==# 单个所有特征信息汇总后最终会得到一个结论, target_names就是存放最终的结论,# 本例中所有数据共分为三类.>>> iris_dataset.target_namesarray(['setosa', 'versicolor', 'virginica'], dtype='<U10')# =='target'==# 存放每个数据特征汇总后所指向的结论, 为了节省内存, # 没有直接写名字,而是使用target_names中的位置信息来描述结论>>> iris_dataset.targetarray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])# =='DESCR'==# 存放的是关于这个数据集合的说明性文档,也就是readme, 记住是description的缩写就好理解了# =='filename'==# 数据读取来源, 数据的实体文件# 你在本机一定会搜到这个文件, 文件里面记录的都是纯测量数据>>> iris_dataset.filename'iris.csv'# =='data_module'==# 数据出处,理解成签名信息就好>>> iris_dataset.data_module'sklearn.datasets.data'

拆分数据样本, 做成建模数据和测试用数据:

>>> X_train, X_test, y_train, y_test = \train_test_split(iris_dataset['data'],iris_dataset['target'], random_state=0)# train_test_split的函数定义, 可以看出,'data'和'target都是以arry形式传入函数,# train_test_split都会照单全收,然后返回对应结果.# random_state=0 这个需要说明一下,这个叫随机种子,0表示数据不偏移,这会保证我们在多次运行程序# 的情况下,每次做成的建模数据和测试数据始终是一致的.#def train_test_split(# *arrays,# test_size=None,# train_size=None,# random_state=None,# shuffle=True,# stratify=None,):# train_test_split函数的作用, 将第一个传入的arrays参数的每一项拆分成俩个# 其比例是 75%,25%# 上面的例子中 iris_dataset['data']被拆分成 75%的X_train 和25%的X_test# iris_dataset['target']同理 75%的y_train和25%的y_test# 这个比例是可以调整的指定test_size就可以, 设置范围是 0~1# 比如 test_size=0.5 表示拆分成等比例两份# 但是官方建议用默认比例,可能是经过大量测试发现默认的是黄金比例吧# 拆分前数据大小>>> iris_dataset.data.shape(150, 4)# 差分后数据大小>>> X_train.shape(112, 4)>>> X_test.shape(38, 4)>>> y_train.shape(112,)>>> y_test.shape(38,)# 确认拆分比例>>> 112 / 1500.7466666666666667>>> 38 / 1500.25333333333333335

02_003 使用学习数据创建数据表格模板

>>> iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)# DataFrame支持的格式很多,没法展开了说,就事论事说说咱这个例子# 先贴一下函数参数说明:# Parameters# ----------# data: ndarray(structured or homogeneous), Iterable, dict, or DataFrame# Dict can contain Series, arrays, constants, dataclass or list-like objects. If# data is a dict, column order follows insertion-order. If a dict contains Series# which have an index defined, it is aligned by its index.# .. versionchanged: : 0.25.0#If data is a list of dicts, column order follows insertion-order.#index: Index or array-like# Index to use for resulting frame. Will default to RangeIndex if# no indexing information part of input data and no index provided.#columns: Index or array-like# Column labels to use for resulting frame when data does not have them,# defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,# will perform column selection instead.#dtype: dtype, default None# Data type to force. Only a single dtype is allowed. If None, infer.#copy: bool or None, default None# Copy data from inputs.# For dict data, the default of None behaves like ``copy = True``. For DataFrame# or 2d ndarray input, the default of None behaves like ``copy = False``.# .. versionchanged:: 1.3.0#See Also#--------#DataFrame.from_records: Constructor from tuples, also record arrays.#DataFrame.from_dict: From dicts of Series, arrays, or dicts.#read_csv: Read a comma-separated values(csv) file into DataFrame.#read_table: Read general delimited file into DataFrame.#read_clipboard: Read text from clipboard into DataFrame.# ------------>>> iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)# 上面例子只用到了一个参数 columns.# columns用于描述 x_train序列数据中每一个单个数据所拥有的列的含义>>> X_trainarray([[4.6, 3.1, 1.5, 0.2],[5.9, 3. , 5.1, 1.8],[5.1, 2.5, 3. , 1.1],......>>> iris_dataset.feature_names['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']>>> X_train.shape(112, 4)# X_train中有112个训练数据, 每个数据都有4个特征, 按位置分别是# 0:'sepal length (cm)', 1:'sepal width (cm)', 2:'petal length (cm)', 3:'petal width (cm)'# 有人提议想象成excel表格形式,我觉的很贴切

做成的数据表格类似下面这种格式:

第一列是额外做成的,标记数据位置, 其它的内容均来自于函数的接口设定

02_004 使用测试数据测验证数据表格是否能正常识别数据

鸢尾花数据样本特征中含有4个种类, 但是平面图新很难表现3个以上的特征,

比如柱状图,只有横纵坐标,也就是说用它来表示的话仅能表示数据样本中的2个特征,

这明显不是我们想要的结果,书中给出的解决方案是使用散点图来表示.

大概的意思就是把每个鸢尾花的特征都用点的方式来表示, 所有数据中,相同特征的点用一种颜色来表示

这样的确可以解决多余两个特征的数据样本显示的问题.

想制作散点图,就需要使用[pd.plotting.scatter_matrix]函数,这个函数使用的数据元类型是DataFrame,

所以才会出现上面步骤中 [pd.DataFrame(X_train, columns=iris_dataset.feature_names)]的代码.

不管是那种编程语言,每一步的设置都是为后续代码作准备的,有时死扣住一段代码

研究,不如先通览一遍来的要快.

如果想看scatter_matrix的函数帮助可以找下:

> .venv\Lib\site-packages\pandas\plotting\_misc.py

书中的列子是这样写的

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

我在执行过程中,并没有出现书中的图片,单步调试的时候出现的内容:

grrarray([[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal length (cm)'>,<AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal length (cm)'>,<AxesSubplot:xlabel='petal length (cm)', ylabel='sepal length (cm)'>,<AxesSubplot:xlabel='petal width (cm)', ylabel='sepal length (cm)'>],[<AxesSubplot:xlabel='sepal length (cm)', ylabel='sepal width (cm)'>,<AxesSubplot:xlabel='sepal width (cm)', ylabel='sepal width (cm)'>,<AxesSubplot:xlabel='petal length (cm)', ylabel='sepal width (cm)'>,<AxesSubplot:xlabel='petal width (cm)', ylabel='sepal width (cm)'>],[<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>,<AxesSubplot:xlabel='sepal width (cm)', ylabel='petal length (cm)'>,<AxesSubplot:xlabel='petal length (cm)', ylabel='petal length (cm)'>,<AxesSubplot:xlabel='petal width (cm)', ylabel='petal length (cm)'>],[<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal width (cm)'>,<AxesSubplot:xlabel='sepal width (cm)', ylabel='petal width (cm)'>,<AxesSubplot:xlabel='petal length (cm)', ylabel='petal width (cm)'>,<AxesSubplot:xlabel='petal width (cm)', ylabel='petal width (cm)'>]],dtype=object)

通过函数帮助发现这个就是返回个数组吗, 我说怎么没有图片呢,哎!!!

>Returns>------->numpy.ndarray> A matrix of scatter plots.

散点图函数的接口:

> def scatter_matrix(>frame,>alpha=0.5,>figsize=None,>ax=None,>grid=False,>diagonal="hist",>marker=".",>density_kwds=None,>hist_kwds=None,>range_padding=0.05,>**kwargs, ):

既然用到这了,本着了解一下态度,简单记录一下:

pandas scatter_matrix 帮助文档

grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)# frame # 数据表格# alpha# 透明度# figsize# 绘制最终图片是的大小# marker# 散点途中点的样式# c, hist_kwds, s, cmap# 属于kwargs范畴的自定义参数, 看到这里为止还不知道是干什么用的,但是通过帮助文档来看是向下传递,暂时先不管它# **kwds# Options to pass to matplotlib scatter plotting method.

02_005 通过绘图来显示测试结果

数据已经准备好了,接下来就是显示,

pt.show()

这段代码就可以完成绘图显示.与通常函数使用相比,show函数在调用是并没有传递任何参数,那它是怎样完成绘图的,又是怎样找到绘图用的数据源呢?

02_005_001 调查开始

因为图片显示代码很简单,没有设置任何东西,说明图片的相关设置应当是前面步骤完成的.

首先回顾一下上一步代码,我i们需要从这里开始调查

import pandas as pdgrr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

查找scatter_matrix函数

找到是找到了,但是没有看到’scatter_matrix’, 那说明这个函数有特殊的处理,

遇到这种情况一般去’init.py’下看看.

.venv\Lib\site-packages\pandas\plotting

跟预想的一样,在这里定义的, 所以’scatter_matrix’是写在’_misc '中

from pandas.plotting._misc import (andrews_curves,autocorrelation_plot,bootstrap_plot,deregister as deregister_matplotlib_converters,lag_plot,parallel_coordinates,plot_params,radviz,register as register_matplotlib_converters,scatter_matrix,table,)

_misc.scatter_matrix 处理内容如下:

引用一个函数, 然后将所有收到的参数如数塞进这个函数然后返回.

from pandas.plotting._core import _get_plot_backenddef scatter_matrix(frame,alpha=0.5,figsize=None,ax=None,grid=False,diagonal="hist",marker=".",density_kwds=None,hist_kwds=None,range_padding=0.05,**kwargs,):plot_backend = _get_plot_backend("matplotlib")return plot_backend.scatter_matrix(frame=frame,alpha=alpha,figsize=figsize,ax=ax,grid=grid,diagonal=diagonal,marker=marker,density_kwds=density_kwds,hist_kwds=hist_kwds,range_padding=range_padding,**kwargs,)

确认_get_plot_backend返回个啥

_backends: dict[str, types.ModuleType] = {}def _get_plot_backend(backend: str | None = None):backend = backend or get_option("plotting.backend")# _backends默认是空字典,所以这个if没走进去if backend in _backends:return _backends[backend]# 由下面的函数定义可以看出, _load_backend(backend) 返回的是[pandas.plotting._matplotlib]module = _load_backend(backend)_backends[backend] = modulereturn moduledef _load_backend(backend: str) -> types.ModuleType:from importlib.metadata import entry_pointsif backend == "matplotlib":# Because matplotlib is an optional dependency and first-party backend,# we need to attempt an import here to raise an ImportError if needed.try:module = importlib.import_module("pandas.plotting._matplotlib")except ImportError:raise ImportError("matplotlib is required for plotting when the "'default backend "matplotlib" is selected.') from Nonereturn module)

到这里先总结一下:

分析到现在[_get_plot_backend(“matplotlib”)]的实际调用位置是

[pandas.plotting._matplotlib] 也就是[pandas.plotting._matplotlib.misc]中的[scatter_matrix]方法

分析scatter_matrix中到底干了些啥

这部分需要简略说明,它不是我们学习的重点,我们只需要粗略了解功能就行,展开了分析就相当于学习这个包的用法,学习成本较高,

方法体中有这样一段:

ax.scatter(df[b][common], df[a][common], marker=marker, alpha=alpha, **kwds)

这段应该相当于调用 matplotlib.pyplot.scatter, 有精力的小伙伴可以自己去研究

02_005_002 调查结论

import pandas as pdgrr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

相当于将参数转了一手传给了matplotlib.pyplot.scatter函数, 而matplotlib的主要作用就是绘图

02_005_003 绘图参数调查

接下来看看scatter的函数接口, 来了解例子中的代码都干了些啥

scatter_matrix管方文档

# c # color的意思, 这里用的是纯数字数组, 每个数字是一种颜色# y_train是想面数据拆分后 75%的训练数据[X_train]的结果部分, 因为我们传入的iris_dataframe是# 用X_train做成的,所以必须用y_train来表示每个训练数据的结果# cmap# color map配色方案, 就像是各种主题都有一种配色方案,选哪一种就在这里指定#>>> import mglearn# >>> mglearn.cm3# <matplotlib.colors.ListedColormap object at 0x0000021B59E339A0># marker# 散点图中用什么图标来表示结果# s # 图标的大小 Size# hist_kwds={'bins': 20}# bins表示柱状图, 20表示柱状图的个数# 生成的图片中,对角线上的图都是柱状图,其他的为散点图grr = pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15), marker='o',hist_kwds={'bins': 20}, s=60, alpha=.8, cmap=mglearn.cm3)

最终显示的图片效果:

我的理解:

对象线部分横纵坐标都是自己,完全匹配,所以是柱状图

这个图我们怎么来确定是否正确?

我的理解:

函数接口中我们曾经设置过: [c=y_train], y_train中只有三种结果,也就是说,如果散点图部分只存在3中颜色,就说明这个就OK.