1200字范文 > ssis组件_使用SSIS Hadoop组件连接到Apache Hive和Apache Pig

ssis组件_使用SSIS Hadoop组件连接到Apache Hive和Apache Pig

时间：2020-06-26 21:59:08

ssis组件

In our previously published articles in this series, we talked about many SSIS Hadoop components added in SQL Server , such as the Hadoop connection manager, Hadoop file system task, HDFS file source, and HDFS file destination.

在本系列的先前发布的文章中，我们讨论了SQL Server 中添加的许多SSIS Hadoop组件，例如Hadoop连接管理器，Hadoop文件系统任务，HDFS文件源和HDFS文件目标。

In this article, we will be talking about Hadoop Hive and Hadoop Pig Tasks. We will first give a brief overview of Apache Hive and Apache Pig. Then, we will illustrate the related SSIS Hadoop components and alternatives.

在本文中，我们将讨论Hadoop Hive和Hadoop Pig Tasks。我们首先将简要概述Apache Hive和Apache Pig。然后，我们将说明相关的SSIS Hadoop组件和替代方案。

阿帕奇蜂巢 (Apache Hive)

Apache Hive is an open-source data warehousing software developed by Facebook built on the top of Hadoop. It allows querying data using a SQL-like language called HiveQL or using Apache Spark SQL. It can store data within a separate repository, and it allows building external tables on the top of data stored outside Hive repositories.

Apache Hive是Facebook在Hadoop之上构建的开源数据仓库软件。它允许使用类似于SQL的语言称为HiveQL或使用Apache Spark SQL查询数据。它可以将数据存储在单独的存储库中，并且允许在Hive存储库外部存储的数据之上构建外部表。

In general, these technologies work perfectly under Linux operating systems, but it can also be installed on Windows using the Cygwin tool (check the external links section for more information).

通常，这些技术在Linux操作系统下可以完美地运行，但是也可以使用Cygwin工具将其安装在Windows上（有关更多信息，请参阅外部链接部分）。

阿帕奇猪 (Apache Pig)

Apache Pig is an open-source framework developed by Yahoo used to write and execute Hadoop MapReduce jobs. It is designed to facilitate writing MapReduce programs with a high-level language called PigLatin instead of using complicated Java code. It also can be extended with user-defined functions.

Apache Pig是Yahoo开发的开源框架，用于编写和执行Hadoop MapReduce作业。它旨在帮助使用称为PigLatin的高级语言编写MapReduce程序，而不是使用复杂的Java代码。也可以使用用户定义的功能进行扩展。

Apache Pig converts the PigLatin scripts into MapReduce using a wrapper layer in an optimized way, which decreases the need to optimize scripts manually to improve their efficiency.

Apache Pig使用包装层以优化的方式将PigLatin脚本转换为MapReduce，从而减少了手动优化脚本以提高效率的需求。

Similar to Apache Hive and other software, this technology works better on Linux-based operating systems, while it can be installed on Windows (check the external links section for more information).

与Apache Hive和其他软件类似，该技术可以在基于Linux的操作系统上更好地工作，同时可以安装在Windows上（有关更多信息，请参阅外部链接部分）。

WebHDFS和WebHCat服务 (WebHDFS and WebHCat services)

As we mentioned in the first article in this series, there are two types of connections within the Hadoop connection manager: (1) WebHDFS used for HDFS commands and (2) WebHCat used for apache Hive and Pig tasks.

正如我们在本系列的第一篇文章中提到的那样，Hadoop连接管理器中有两种连接类型：（1）用于HDFS命令的WebHDFS和（2）用于Apache Hive和Pig任务的WebHCat。

It is worth to mention that WebHDFS and WebHCat are two REST APIs used to communicate with Hadoop components. These APIs allow us to execute Hadoop-related commands regardless of the current operating system and if Hadoop can be accessed via shell commands. Note that WebHDFS is installed with Hadoop, while WebHCat is installed with Apache Hive.

值得一提的是，WebHDFS和WebHCat是用于与Hadoop组件通信的两个REST API。这些API允许我们执行与Hadoop相关的命令，而不管当前的操作系统如何，以及是否可以通过Shell命令访问Hadoop。请注意，WebHDFS与Hadoop一起安装，而WebHCat与Apache Hive一起安装。

To start the WebHCat API, you should run the following command:

要启动WebHCat API，您应该运行以下命令：

$HIVE_HOME/hcatalog/sbin/webhcat_server.sh start

$ HIVE_HOME / hcatalog / sbin / webhcat_server.sh开始

使用Hadoop连接管理器连接到WebHCat (Connecting to WebHCat using Hadoop connection manager)

Connecting to WebHCat is very similar to WebHDFS, as explained in the first article. Noting that the default port is 50111.

如第一篇文章所述，连接到WebHCat与WebHDFS非常相似。注意默认端口是50111。

To make sure that connection is well configured, we can use the “Test connection” button.

为了确保正确配置连接，我们可以使用“测试连接”按钮。

Hadoop Hive任务 (Hadoop Hive Task)

The Hadoop component related to Hive is called “Hadoop Hive Task”. This component is designed to execute HiveQL statements. It uses a WebHCat Hadoop connection to send a statement to the Apache Hive server.

与Hive相关的Hadoop组件称为“ Hadoop Hive任务”。该组件旨在执行HiveQL语句。它使用WebHCat Hadoop连接将语句发送到Apache Hive服务器。

This Hadoop component is very simple, as shown in the screenshot below, its editor contains only a few parameters to configure:

这个Hadoop组件非常简单，如下面的屏幕截图所示，其编辑器仅包含一些要配置的参数：

Name: The task name名称：任务名称Description: The task description描述：任务描述HadoopConnection: We should select the related Hadoop connection managerHadoopConnection：我们应该选择相关的Hadoop连接管理器SourceType: There are two choices:SourceType：有两种选择：DirectInput: Write a HiveQL script manuallyDirectInput：手动编写HiveQL脚本ScriptFile: Use a script file stored within HadoopScriptFile：使用存储在Hadoop中的脚本文件InlineScript: (available for DirectInput Source), we should write here the HiveQL statementInlineScript：（可用于DirectInput Source），我们应该在此处编写HiveQL语句HadoopScriptFilePath: (available for ScriptFile Source), we should specify the Hadoop file pathHadoopScriptFilePath ：（可用于ScriptFile Source），我们应该指定Hadoop文件路径TimeoutInMinutes: The command timeout in minutes: If zero is entered, then the command will run asynchronouslyTimeoutInMinutes：命令超时（以分钟为单位）：如果输入零，则该命令将异步运行

例 (Example)

To run an example, we used to used the following HiveQL statement to create a Hive table:

为了运行一个示例，我们曾经使用以下HiveQL语句创建一个Hive表：

CREATE TABLE IF NOT EXISTS employee (eid int, name String, salary String, destination String) STORED AS TEXTFILE;

After executing the package, we started the Hive shell from a command prompt and executed the following command to show available tables within the default database:

执行完程序包后，我们从命令提示符启动Hive Shell，并执行以下命令以显示默认数据库中的可用表：

SHOW TABLES;

The result shows that the employee table is created:

结果显示创建了employee表：

同步与异步命令 (Synchronous vs. Asynchronous commands)

To illustrate the difference between synchronous and asynchronous commands, we ran the following experiment:

为了说明同步命令和异步命令之间的区别，我们进行了以下实验：

First, we set the “TimeoutInMunites” property to 1440 (default) and executed the package. As shown in the screenshot below, the immediate window keeps showing the execution information sent from the Hive server.

首先，我们将“ TimeoutInMunites”属性设置为1440（默认值）并执行包。如下面的屏幕快照所示，立即窗口始终显示从Hive服务器发送的执行信息。

If we set the TimeoutInMinutes property to 0 and we execute the package, the task shows that it is completed successfully one a job is scheduled in the Hadoop cluster.

如果将TimeoutInMinutes属性设置为0并执行该程序包，则该任务将显示该任务已成功完成，并且已在Hadoop群集中计划了一项作业。

Hadoop Pig任务 (Hadoop Pig Tasks)

The Hadoop component related to Apache Pig is called the “Hadoop Pig task”. This component is almost the same as Hadoop Hive Task since it has the same properties and uses a WebHCat connection. The only difference is that it executes a PigLatin script rather than HiveQL.

与Apache Pig相关的Hadoop组件称为“ Hadoop Pig任务”。该组件与Hadoop Hive Task几乎相同，因为它具有相同的属性并使用WebHCat连接。唯一的区别是它执行PigLatin脚本而不是HiveQL。

SSIS Hive Hadoop组件替代：Microsoft Hive ODBC驱动程序 (SSIS Hive Hadoop component alternative: Microsoft Hive ODBC driver)

There is another method available to connect with the Apache Hive server in SSIS other than using the SSIS Hadoop components, which is the Microsoft Hive ODBC Driver. This allows creating an ODBC connection with Apache Hive. ODBC Driver connects directly to the running Hive server (HiveServer1 or HiveServer2).

除了使用SSIS Hadoop组件（Microsoft Hive ODBC驱动程序）外，还有另一种方法可以与SSIS中的Apache Hive服务器连接。这允许使用Apache Hive创建ODBC连接。 ODBC驱动程序直接连接到正在运行的Hive服务器（HiveServer1或HiveServer2）。

First, we should download the driver from the official Microsoft Download Link. Note that there are two drivers (32-bit and 64-bit). After downloading and installing the driver, we should add an ODBC source following these steps:

首先，我们应该从官方的Microsoft下载链接下载驱动程序。请注意，有两个驱动程序（32位和64位）。下载并安装驱动程序后，我们应按照以下步骤添加ODBC源：

Navigate to Control Panel > System and Security > Administrative Tools 导航对控制面板>系统和安全性>管理工具

Open the ODBC Data Sources (32-bit or 64-bit)

打开ODBC数据源（32位或64位）

Figure 10 – ODBC Data Sources shortcuts

图10 – ODBC数据源快捷方式

We should add a User or System DSN (note that Sample System DSN was created during installation)

我们应该添加一个用户或系统DSN（请注意，示例系统DSN是在安装过程中创建的）

Figure 11 – Sample Microsoft Hive DSN

图11 –示例Microsoft Hive DSN

After clicking on Add button, we should select Microsoft Hive ODBC driver from the drivers’ list

单击添加按钮后，我们应该从驱动程序列表中选择Microsoft Hive ODBC驱动程序

Figure 12 – Selecting Microsoft Hive ODBC driver

图12 –选择Microsoft Hive ODBC驱动程序

Host: the Hive server host address主机：Hive服务器的主机地址Port: the Hive server port number端口：Hive服务器端口号

Database:the database name

数据库：数据库名称

Figure 13 – Microsoft Hive ODBC DSN setup

图13 – Microsoft Hive ODBC DSN设置

We should test the connection before creating the ODBC DSN

我们应该在创建ODBC DSN之前测试连接

Figure 14 – Testing connection

图14 –测试连接

After creating the ODBC DSN, we should create a new ODBC connection manager in SSIS:

创建ODBC DSN之后，我们应该在SSIS中创建一个新的ODBC连接管理器：

Figure 15 – Adding ODBC connection manager

图15 –添加ODBC连接管理器

Then we should select the ODBC DSN we created while configuring the connection manager:

然后，我们应该选择在配置连接管理器时创建的ODBC DSN：

Figure 16 – Using the created DSN in the connection manager

图16 –在连接管理器中使用创建的DSN

优点 (Advantages)

Using Microsoft Hive ODBC driver has many benefits:

使用Microsoft Hive ODBC驱动程序有很多好处：

It can be used with earlier versions of SQL Server (before ) 可以与早期版本SQL Server（之前）一起使用 We can use apache Hive as a source or destination within the data flow task 我们可以将apache Hive用作数据流任务中的源或目标 It can be used for cloud-based and on-premise Hadoop clusters 它可用于基于云的本地Hadoop集群 Many SSIS components can use ODBC connections (example: Execute SQL Task) 许多SSIS组件可以使用ODBC连接（例如：执行SQL任务）

外部链接 (External Links)

Installing Hadoop 3.2.1 single node cluster on Windows 10 step-by-step guide 在Windows 10上安装Hadoop 3.2.1单节点群集逐步指南 Installing Apache Hive 3.1.2 on Windows 10 step-by-step guide 在Windows 10上安装Apache Hive 3.1.2分步指南 Installing Apache Pig 0.17.0 on Windows 10 step-by-step guide 在Windows 10上安装Apache Pig 0.17.0分步指南

结论 (Conclusion)

In this article, we talked about the Apache Hive and Apache Pig. Then, we explained what WebHDFS and WebHCat services are. We illustrated the Hive and Pig related Hadoop components in SSIS. And Finally, we showed how to use Microsoft Hive ODBC driver as an alternative of the Hive Hadoop Component.

在本文中，我们讨论了Apache Hive和Apache Pig。然后，我们解释了什么是WebHDFS和WebHCat服务。我们在SSIS中说明了与Hive和Pig相关的Hadoop组件。最后，我们展示了如何使用Microsoft Hive ODBC驱动程序作为Hive Hadoop组件的替代方案。