1200字范文 > 数据挖掘中数据预处理方法_数据挖掘中的数据预处理

数据挖掘中数据预处理方法_数据挖掘中的数据预处理

时间：2020-12-09 13:27:04

数据挖掘中数据预处理方法

In the previous article, we have discussed the Data Exploration with which we have started a detailed journey towards data mining. We have learnt about Data Exploration, Statistical Description of Data, Concept of Data Visualization & Various technique of Data Visualization.

在上一篇文章中，我们讨论了数据探索，并由此开始了详细的数据挖掘之旅。我们已经了解了数据探索，数据统计描述，数据可视化的概念以及各种数据可视化技术。

In this article we will be discussing,

在本文中，我们将讨论

Need of Data Preprocessing

需要数据预处理

Data Cleaning Process

数据清理流程

Data Integration Process

数据整合流程

Data Reduction Process

数据缩减流程

Data Transformation Process

数据转换过程

1)需要数据预处理 (1) Need of Data Preprocessing)

Data preprocessing refers to the set of techniques implemented on the databases to remove noisy, missing, and inconsistent data. Different Data preprocessing techniques involved in data mining are data cleaning, data integration, data reduction, and data transformation.

数据预处理是指在数据库上实施的用于消除噪声，丢失和不一致数据的技术集。数据挖掘中涉及的不同数据预处理技术是数据清理，数据集成，数据缩减和数据转换。

The need for data preprocessing arises from the fact that the real-time data and many times the data of the database is often incomplete and inconsistent which may result in improper and inaccurate data mining results. Thus to improve the quality of data on which the observation and analysis are to be done, it is treated with these four steps of data preprocessing. More the improved data, More will be the accurate observation and prediction.

数据预处理的需求源于以下事实：实时数据以及很多时候数据库的数据通常不完整且不一致，这可能导致数据挖掘结果不正确和不准确。因此，为了提高要进行观察和分析的数据的质量，可以通过数据预处理的这四个步骤对其进行处理。改进的数据越多，准确的观察和预测就越多。

Fig 1: Steps of Data Preprocessing

图1：数据预处理步骤

2)数据清理流程 (2) Data Cleaning Process)

Data in the real world is usually incomplete, incomplete and noisy. The data cleaning process includes the procedure which aims at filling the missing values, smoothing out the noise which determines the outliers and rectifies the inconsistencies in data. Let us discuss the basic methods of data cleaning,

现实世界中的数据通常不完整，不完整且嘈杂。数据清除过程包括旨在填补缺失值，消除噪声的过程，该噪声确定了异常值并纠正了数据中的不一致之处。让我们讨论数据清理的基本方法，

2.1. Missing Values

2.1。缺失值

Assume that you are dealing with any data like sales and customer data and you observe that there are several attributes from which the data is missing. One cannot compute data with missing values. In this case, there are some methods which sort out this problem. Let us go through them one by one,

假设您正在处理任何数据(例如销售和客户数据)，并且发现缺少一些属性。不能计算缺少值的数据。在这种情况下，有一些方法可以解决此问题。让我们一一讲解

2.1.1. Ignore the tuple:

2.1.1。忽略元组：

If there is no class label specified then we could go for this method. It is not effective in the case if the percentage of missing values per attribute changes considerably.

如果未指定类标签，则可以使用此方法。如果每个属性的缺失值百分比发生很大变化，则此方法无效。

2.1.2. Enter the missing value manually or fill it with global constant:

2.1.2。手动输入缺少的值或用全局常数填充它：

When the database contains large missing values, then filling manually method is not feasible. Meanwhile, this method is time-consuming. Another method is to fill it with some global constant.

当数据库包含较大的缺失值时，手动填充方法不可行。同时，此方法很耗时。另一种方法是用一些全局常数填充它。

2.1.3. Filling the missing value with attribute mean or by using the most probable value:

2.1.3。使用属性均值或使用最可能的值来填充缺失值：

Filling the missing value with attribute value can be the other option. Filling with the most probable value uses regression, Bayesian formulation or decision tree.

用属性值填充缺失值可以是另一种选择。用回归，贝叶斯公式或决策树填充最可能的值。

2.2. Noisy Data

2.2。噪音数据

Noise refers to any error in a measured variable. If a numerical attribute is given you need to smooth out the data by eliminating noise. Some data smoothing techniques are as follows,

噪声是指测量变量中的任何误差。如果给定了数字属性，则需要通过消除噪声来平滑数据。一些数据平滑技术如下，

2.2.1. Binning:

2.2.1。装箱：

Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.

按bin方式进行平滑：在按bin方式进行平滑处理中，将bin中的每个值替换为bin的平均值。

Smoothing by bin median: In this method, each bin value is replaced by its bin median value.

按bin中值进行平滑：在这种方法中，每个bin值都将替换为其bin中值。

Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Every value of bin is then replaced with the closest boundary value.

按bin边界进行平滑：在按bin边界进行平滑处理中，将给定bin中的最小值和最大值标识为bin边界。然后将bin的每个值替换为最接近的边界值。

Let us understand with an example,

让我们以一个例子来理解，

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

价格排序数据(美元)：4、8、9、15、21、21、24、25、26、28、29、34

Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34Smoothing by bin median:- Bin 1: 9 9, 9, 9- Bin 2: 24, 24, 24, 24- Bin 3: 29, 29, 29, 29

2.2.2. Regression:

2.2.2。回归：

Regression is used to predict the value. Linear regression uses the formula of a straight line which predicts the value of y on the specified value of x whereas multiple linear regression is used to predict the value of a variable is predicted by using given values of two or more variables.

回归用于预测值。线性回归使用直线公式来预测y在x的指定值上的值，而多元线性回归用于预测变量的值是通过使用两个或多个变量的给定值来预测的。

3)数据整合过程 (3) Data Integration Process)

Data Integration is a data preprocessing technique that involves combining data from multiple heterogeneous data sources into a coherent data store and supply a unified view of the info. These sources may include multiple data cubes, databases or flat files.

数据集成是一种数据预处理技术，涉及将来自多个异构数据源的数据组合到一个一致的数据存储中，并提供信息的统一视图。这些源可能包括多个数据多维数据集，数据库或平面文件。

3.1. Approaches

3.1。方法

There are mainly 2 major approaches for data integration – one is"tight coupling approach"and another is the"loose coupling approach".

数据集成主要有2种主要方法-一种是“紧密耦合方法”，另一种是“松散耦合方法”。

Tight Coupling:

紧耦合：

Here, a knowledge warehouse is treated as an information retrieval component.

在这里，知识仓库被视为信息检索组件。

In this coupling, data is combined from different sources into one physical location through the method ofETL – Extraction, Transformation, and Loading.

在这种耦合中，通过ETL(提取，转换和加载)方法将数据从不同源组合到一个物理位置。

Loose Coupling:

松耦合：

Here, an interface is as long as it takes the query from the user, transforms it during away the source database can understand then sends the query on to the source databases to get the result. And the data only remains within the actual source databases.

在这里，接口只要它从用户那里获取查询，并在源数据库可以理解的时间内对其进行转换，然后将查询发送到源数据库以获取结果。并且数据仅保留在实际的源数据库中。

3.2. Issues in Data Integration

3.2。数据集成中的问题

There are not any issues to think about during data integration: Schema Integration, Redundancy, Detection and determination of knowledge value conflicts. These are explained in short as below,

数据集成期间没有任何问题可考虑：架构集成，冗余，知识值冲突的检测和确定。这些简述如下：

3.1.1. Schema Integration:

3.1.1。模式集成：

Integrate metadata from different sources.

集成来自不同来源的元数据。

The real-world entities from multiple sources are matched mentioned because of the entity identification problem.

由于实体标识问题，提到了来自多个来源的真实实体。

For example, How can the info analyst and computer make certain that customer id in one database and customer number in another regard to an equivalent attribute.

例如，信息分析师和计算机如何才能确定一个数据库中的客户ID和其他方面的客户编号是否具有等效属性。

3.2.2. Redundancy:

3.2.2。冗余：

An attribute could also be redundant if it is often derived or obtaining from another attribute or set of the attribute.

如果某个属性通常是从另一个属性或该属性的集合派生或获取的，则它也可能是多余的。

Inconsistencies in attribute also can cause redundancies within the resulting data set.

属性不一致还会导致结果数据集内的冗余。

Some redundancies are often detected by correlation analysis.

经常通过相关分析来检测一些冗余。

3.3.3. Detection and determination of data value conflicts:

3.3.3。检测和确定数据值冲突：

This is the third important issues in data integration. Attribute values from another different source may differ for an equivalent world entity. An attribute in one system could also be recorded at a lower level abstraction than the "same" attribute in another.

这是数据集成中的第三个重要问题。对于等效的世界实体，来自另一个不同来源的属性值可能有所不同。与另一个系统中的“ same”属性相比，一个系统中的属性也可以以较低的抽象级别记录。

4)数据缩减流程 (4) Data Reduction Process)

Data warehouses usually store large amounts of data the data mining operation takes a long time to process this data. The data reduction technique helps to minimize the size of the dataset without affecting the result. The following are the methods that are commonly used for data reduction,

数据仓库通常存储大量数据，数据挖掘操作需要很长时间才能处理此数据。数据缩减技术有助于在不影响结果的情况下最小化数据集的大小。以下是通常用于数据缩减的方法，

Data cube aggregation

数据立方体聚合

Refers to a method where aggregation operations are performed on data to create a data cube, which helps to analyze business trends and performance.

指对数据执行聚合操作以创建数据多维数据集的方法，该方法有助于分析业务趋势和性能。

Attribute subset selection

属性子集选择

Refers to a method where redundant attributes or dimensions or irrelevant data may be identified and removed.

指可以识别和删除冗余属性或尺寸或不相关数据的方法。

Dimensionality reduction

降维

Refers to a method where encoding techniques are used to minimize the size of the data set.

指的是一种使用编码技术来最小化数据集大小的方法。

Numerosity reduction

减少雾度

Refers to a method where smaller data representation replaces the data.

指的是较小的数据表示替换数据的方法。

Discretization and concept hierarchy generation

离散化和概念层次生成

Refers to methods where higher conceptual values replace raw data values for attributes. Data discretization is a type of numerosity reduction for the automatic generation of concept hierarchies.

指的是较高的概念值替换属性的原始数据值的方法。数据离散化是一种用于自动生成概念层次结构的数量减少方法。