1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > 大数据开发基础入门与项目实战(三)Hadoop核心及生态圈技术栈之4.Hive DDL DQL和数据操作

大数据开发基础入门与项目实战(三)Hadoop核心及生态圈技术栈之4.Hive DDL DQL和数据操作

时间:2019-06-07 17:54:14

相关推荐

大数据开发基础入门与项目实战(三)Hadoop核心及生态圈技术栈之4.Hive DDL DQL和数据操作

文章目录

1.HQL操作之DDL命令(1)数据库操作(2)建表语法(3)内部表及外部表(4)分区表(5)分桶表(6)修改表及删除表 5.HQL操作之数据操作(1)load装载数据(2)insert插入数据 6.HQL操作之DQL命令(1)简单查询(2)简单子句(3)group by分组子句(4)表连接(5)order by排序子句(6)sort by排序(7)distribute by和cluster by排序

1.HQL操作之DDL命令

Hive数据库的层次如下:

可以看到,一个数据库可以包括多张表,一张表可以分为多个分区,同时一个分区还可以分为多个分桶。

DDL,即数据定义语言(data definition language),主要的操作包括CREATE、ALTER、DROP等,主要用来定义、修改数据库对象的结构或数据类型。

要查看最完整的DDL官网命令,可以查看/confluence/display/Hive/LanguageManual+DDL。

(1)数据库操作

Hive有一个默认的数据库default,在操作HQL时,如果不明确要使用哪个库,则使用默认数据库。

Hive数据库命名规则如下:

数据库名、表名均不区分大小写;

名字不能使用数字开头;

不能使用关键字,尽量不使用特殊符号;

操作数据库的完整语法如下:

-- 创建数据库CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name[COMMENT database_comment][LOCATION hdfs_path][MANAGEDLOCATION hdfs_path][WITH DBPROPERTIES (property_name=property_value, ...)];-- 查看数据库-- 查看所有数据库show database;-- 查看数据库信息desc database database_name;desc database extended database_name;describe database extended database_name;-- 使用数据库use database_name;-- 删除数据库-- 删除一个空数据库drop database databasename;-- 如果数据库不为空,使用cascade强制删除drop database databasename cascade;

使用示意如下:

hive (default)> create database mydb;OKTime taken: 0.18 secondshive (default)> show databases;OKdatabase_namedefaultmydbtest1Time taken: 0.049 seconds, Fetched: 3 row(s)hive (default)> dfs -ls /user/hive/warehouse;Found 3 itemsdrwxrwxrwx - root supergroup0 -09-21 14:39 /user/hive/warehouse/mydb.dbdrwxrwxrwx - root supergroup0 -09-21 14:09 /user/hive/warehouse/s1drwxrwxrwx - root supergroup0 -09-20 18:52 /user/hive/warehouse/test1.dbhive (default)> create database if not exists mydb;OKTime taken: 0.078 secondshive (default)> create database if not exists mydb2> comment 'this is my db2'> location '/user/hive/mydb2.db';OKTime taken: 0.14 secondshive (default)> show databases;OKdatabase_namedefaultmydbmydb2test1Time taken: 0.024 seconds, Fetched: 4 row(s)hive (default)> use mydb;OKTime taken: 0.041 secondshive (mydb)> desc database mydb2;OKdb_name comment location owner_nameowner_typeparametersmydb2 this is my db2 hdfs://node01:9000/user/hive/mydb2.db root USER Time taken: 0.029 seconds, Fetched: 1 row(s)hive (mydb)> desc database extended mydb2;OKdb_name comment location owner_nameowner_typeparametersmydb2 this is my db2 hdfs://node01:9000/user/hive/mydb2.db root USER Time taken: 0.025 seconds, Fetched: 1 row(s)hive (mydb)> drop database test1;OKTime taken: 0.338 secondshive (mydb)> show databases;OKdatabase_namedefaultmydbmydb2Time taken: 0.021 seconds, Fetched: 3 row(s)hive (mydb)> create table t1(id int);OKTime taken: 0.164 secondshive (mydb)> drop database mydb2 cascade;OKTime taken: 0.083 secondshive (mydb)> show databases;OKdatabase_namedefaultmydbTime taken: 0.049 seconds, Fetched: 2 row(s)hive (mydb)> show tables;OKtab_namet1Time taken: 0.032 seconds, Fetched: 1 row(s)hive (mydb)>

可以看到,创建数据库可以通过选项设置备注和自定义存放路径;

在删除数据库时,如果数据库不为空,则不能直接删除,而要使用cascade指定强制删除。

(2)建表语法

Hive中创建表的语法如下:

-- as方式create [external] table [IF NOT EXISTS] table_name[(colName colType [comment 'comment'], ...)][comment table_comment][partition by (colName colType [comment col_comment], ...)][clustered BY (colName, colName, ...)[sorted by (col_name [ASC|DESC], ...)] into num_bucketsbuckets][row format row_format][stored as file_format][LOCATION hdfs_path][TBLPROPERTIES (property_name=property_value, ...)][AS select_statement];-- like方式CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS][db_name.]table_nameLIKE existing_table_or_view_name[LOCATION hdfs_path];

其中,各部分的含义如下:

其中,存储子句用于建表时指定SerDe,格式如下:

ROW FORMAT DELIMITED-- 字段分隔符[FIELDS TERMINATED BY char]-- 集合分隔符[COLLECTION ITEMS TERMINATED BY char]-- map的键值分隔符[MAP KEYS TERMINATED BY char]-- 行分隔符[LINES TERMINATED BY char] | SERDE serde_name[WITH SERDEPROPERTIES (property_name=property_value,property_name=property_value, ...)]

SerDe是Serialize/Deserilize的简称,Hive使用Serde进行行对象的序列与反序列化;

如果没有指定ROW FORMAT或者ROW FORMATDELIMITED,将会使用默认的SerDe;

建表时还需要为表指定列,在指定列的同时也会指定自定义的SerDe,Hive通过SerDe确定表的具体的列的数据。

(3)内部表及外部表

在创建表的时候,可指定表的类型。表有两种类型,分别是内部表(管理表)、外部表:

默认情况下(不指定external关键字),创建内部表,如果要创建外部表,需要使用关键字external;

在删除内部表时,表的定义(元数据)和数据同时被删除;

在删除外部表时,仅删除表的定义,数据被保留;

在生产环境中,多使用外部表。

在测试表之前,准备数据,vim /home/hadoop/data/t1.dat,输入以下内容:

2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong3;lishi;book,code;nanjing:jiangning,taiwan:taibei4;wangwu;music,book;heilongjiang:haerbin

先测试内部表:

hive (default)> use mydb;OKTime taken: 0.034 secondshive (mydb)> show tables;OKtab_nameTime taken: 0.026 seconds-- 创建内部表hive (mydb)> create table t1(> id int,> name string,> hobby array<string>,> addr map<string, string>> )> row format delimited> fields terminated by ";"> collection items terminated by ","> map keys terminated by ":"> ;OKTime taken: 0.156 secondshive (mydb)> show tables;OKtab_namet1Time taken: 0.031 seconds, Fetched: 1 row(s)-- 显示表的定义,显示的信息较少hive (mydb)> desc t1;OKcol_name data_type commentid int namestring hobby array<string> addrmap<string,string>Time taken: 0.064 seconds, Fetched: 4 row(s)-- 显示表的定义,显示的信息多,格式友好hive (mydb)> desc formatted t1;OKcol_name data_type comment# col_name data_typecomment id int namestring hobby array<string> addrmap<string,string># Detailed Table Information Database:mydb Owner: root CreateTime: Wed Sep 22 17:02:03 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/t1 Table Type: MANAGED_TABLE Table Parameters:COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}numFiles0 numRows 0 rawDataSize 0 totalSize0 transient_lastDdlTime 1632301323# Storage Information SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: colelction.delim , field.delim ; mapkey.delim : serialization.format ; Time taken: 0.089 seconds, Fetched: 36 row(s)-- 加载数据hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t1;Loading data to table mydb.t1OKTime taken: 0.461 seconds-- 查询数据hive (mydb)> select * from t1;OKt1.id t1.name t1.hobby t1.addr2 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}3 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}4 wangwu ["music","book"] {"heilongjiang":"haerbin"}Time taken: 0.169 seconds, Fetched: 3 row(s)-- 查看数据文件hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;Found 1 items-rwxrwxrwx 3 root supergroup 148 -09-22 17:02 /user/hive/warehouse/mydb.db/t1/t1.dathive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t1/t1.dat;2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong3;lishi;book,code;nanjing:jiangning,taiwan:taibei4;wangwu;music,book;heilongjiang:haerbin-- 删除表,表和数据同时被删除hive (mydb)> drop table t1;OKTime taken: 0.198 secondshive (mydb)> show tables;OKtab_nameTime taken: 0.025 seconds-- 再次查询数据文件,已经被删除hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t1;ls: `/user/hive/warehouse/mydb.db/t1': No such file or directoryCommand -ls /user/hive/warehouse/mydb.db/t1 failed with exit code = 1Query returned non-zero code: 1, cause: nullhive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/;hive (mydb)>

可以看到,在创建内部表时,类型是MANAGED_TABLE;

在删除内部表之后,不仅表的定义被删除,数据(表HDFS中对应的文件)也被删除。

再测试使用外部表:

hive (mydb)> create external table t2(> id int,> name string,> hobby array<string>,> addr map<string, string>> )> row format delimited> fields terminated by ";"> collection items terminated by ","> map keys terminated by ":"> ;OKTime taken: 0.19 secondshive (mydb)> show tables;OKtab_namet2Time taken: 0.044 seconds, Fetched: 1 row(s)hive (mydb)> desc formatted t2;OKcol_name data_type comment# col_name data_typecomment id int namestring hobby array<string> addrmap<string,string># Detailed Table Information Database:mydb Owner: root CreateTime: Wed Sep 22 17:12:16 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/t2 Table Type: EXTERNAL_TABLE Table Parameters:COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}EXTERNALTRUEnumFiles0 numRows 0 rawDataSize 0 totalSize0 transient_lastDdlTime 1632301936# Storage Information SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: colelction.delim , field.delim ; mapkey.delim : serialization.format ; Time taken: 0.075 seconds, Fetched: 37 row(s)hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t1;FAILED: SemanticException [Error 10001]: Line 1:61 Table not found 't1'hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t2;Loading data to table mydb.t2OKTime taken: 0.463 secondshive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t2;Found 1 items-rwxrwxrwx 3 root supergroup 148 -09-22 17:15 /user/hive/warehouse/mydb.db/t2/t1.dathive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t2/t1.dat;2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong3;lishi;book,code;nanjing:jiangning,taiwan:taibei4;wangwu;music,book;heilongjiang:haerbinhive (mydb)> select * from t2;OKt2.id t2.name t2.hobby t2.addr2 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}3 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"}4 wangwu ["music","book"] {"heilongjiang":"haerbin"}Time taken: 0.192 seconds, Fetched: 3 row(s)hive (mydb)> drop table t2;OKTime taken: 0.239 secondshive (mydb)> show tables;OKtab_nameTime taken: 0.022 secondshive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t2;Found 1 items-rwxrwxrwx 3 root supergroup 148 -09-22 17:15 /user/hive/warehouse/mydb.db/t2/t1.dathive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/t2/t1.dat;2;zhangsan;book,TV,code;beijing:chaoyang,shagnhai:pudong3;lishi;book,code;nanjing:jiangning,taiwan:taibei4;wangwu;music,book;heilongjiang:haerbinhive (mydb)>

可以看到,创建了外部表后,查看表的详细信息时,表类型是EXTERNAL_TABLE;

同时在删除表之后,只是删除了表定义,并没有删除表的数据。

内部表和外部表之间还可以进行转换。

使用如下:

hive (mydb)> create table t1(> id int,> name string,> hobby array<string>,> addr map<string, string>> )> row format delimited> fields terminated by ";"> collection items terminated by ","> map keys terminated by ":"> ;OKTime taken: 0.111 secondshive (mydb)> desc formatted t1;OKcol_name data_type comment# col_name data_typecomment id int namestring hobby array<string> addrmap<string,string># Detailed Table Information Database:mydb Owner: root CreateTime: Wed Sep 22 17:25:55 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/t1 Table Type: MANAGED_TABLE Table Parameters:... Time taken: 0.075 seconds, Fetched: 36 row(s)hive (mydb)> alter table t1 set tblproperties("EXTERNAL"="TRUE");OKTime taken: 0.176 secondshive (mydb)> desc formatted t1;OKcol_name data_type comment# col_name data_typecomment id int namestring hobby array<string> addrmap<string,string># Detailed Table Information Database:mydb Owner: root CreateTime: Wed Sep 22 17:25:55 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/t1 Table Type: EXTERNAL_TABLE Table Parameters:COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}EXTERNALTRUElast_modified_by rootlast_modified_time1632302837numFiles0 numRows 0 rawDataSize 0 totalSize0 transient_lastDdlTime 1632302837# Storage Information ...Time taken: 0.062 seconds, Fetched: 39 row(s)hive (mydb)> alter table t1 set tblproperties("EXTERNAL"="FALSE");OKTime taken: 0.111 secondshive (mydb)> desc formatted t1;OKcol_name data_type comment# col_name data_typecomment id int namestring hobby array<string> addrmap<string,string># Detailed Table Information Database:mydb Owner: root CreateTime: Wed Sep 22 17:25:55 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/t1 Table Type: MANAGED_TABLE Table Parameters:COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}EXTERNALFALSElast_modified_by rootlast_modified_time1632302857numFiles0 numRows 0 rawDataSize 0 totalSize0 transient_lastDdlTime 1632302857# Storage Information ...Time taken: 0.072 seconds, Fetched: 39 row(s)hive (mydb)> drop table t1;OKTime taken: 0.114 secondshive (mydb)> show tables;OKtab_nameTime taken: 0.033 secondshive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/;Found 1 itemsdrwxrwxrwx - root supergroup0 -09-22 17:15 /user/hive/warehouse/mydb.db/t2hive (mydb)>

可以看到,实现了两种表类型之间的转换。

综上,想保留外部表时使用外部表,并且生产中多用外部表。

(4)分区表

Hive在执行查询时,一般会扫描整个表的数据。由于表的数据量大,全表扫描消耗时间长、效率低。

而有时候,查询只需要扫描表中的一部分数据即可,Hive引入了分区表的概念,将表的数据存储在不同的子目录中,每一个子目录对应一个分区。只查询部分分区数据时,可避免全表扫描,提高查询效率。

在实际中,通常根据时间、地区等信息进行分区。

现在使用如下:

先进行分区表创建与数据加载:

-- 创建表hive (mydb)> create table t3(> id int,> name string,> hobby array<string>,> addr map<string, string>> )> partitioned by (dt string)> row format delimited> fields terminated by ";"> collection items terminated by ","> map keys terminated by ":"> ;OKTime taken: 0.702 secondshive (mydb)> desc formatted t3;OKcol_name data_type comment# col_name data_typecomment id int namestring hobby array<string> addrmap<string,string># Partition Information# col_name data_typecomment dt string # Detailed Table Information Database:mydb Owner: root CreateTime: Thu Sep 23 02:46:32 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/t3 Table Type: MANAGED_TABLE Table Parameters:COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}numFiles0 numPartitions 0 numRows 0 rawDataSize 0 totalSize0 transient_lastDdlTime 1632336392# Storage Information SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: colelction.delim , field.delim ; mapkey.delim : serialization.format ; Time taken: 0.58 seconds, Fetched: 42 row(s)-- 加载数据hive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t3 partition(dt="-09-22");Loading data to table mydb.t3 partition (dt=-09-22)OKTime taken: 1.841 secondshive (mydb)> load data local inpath "/home/hadoop/data/t1.dat" into table t3 partition(dt="-09-23");Loading data to table mydb.t3 partition (dt=-09-23)OKTime taken: 0.63 secondshive (mydb)> select * from t3;OKt3.id t3.name t3.hobby t3.addr t3.dt2 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-223 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-224 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-222 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-233 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-234 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-23Time taken: 3.712 seconds, Fetched: 6 row(s)hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3;Found 2 itemsdrwxrwxrwx - root supergroup0 -09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=-09-22drwxrwxrwx - root supergroup0 -09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=-09-23hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3/dt=-09-22;Found 1 items-rwxrwxrwx 3 root supergroup 148 -09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=-09-22/t1.dathive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3/dt=-09-23;Found 1 items-rwxrwxrwx 3 root supergroup 148 -09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=-09-23/t1.dathive (mydb)> show partitions t3;OKpartitiondt=-09-22dt=-09-23Time taken: 0.132 seconds, Fetched: 2 row(s)-- 新增分区hive (mydb)> alter table t3> add partition(dt="-09-24");OKTime taken: 0.274 secondshive (mydb)> alter table t3> add partition(dt="-09-25");OKTime taken: 0.186 secondshive (mydb)> show partitions t3;OKpartitiondt=-09-22dt=-09-23dt=-09-24dt=-09-25Time taken: 0.107 seconds, Fetched: 4 row(s)

可以看到,如果表设置了分区,会在表的详细信息中展示出来;

查询数据时也显示了每条记录的分区,但是分区字段不是表中已经存在的数据,可以将分区字段看成伪列。

再指定数据路径新增分区,如下:

-- 准备数据hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=-09-22 /user/hive/warehouse/mydb.db/t3/dt=-09-26;hive (mydb)> dfs -cp /user/hive/warehouse/mydb.db/t3/dt=-09-22 /user/hive/warehouse/mydb.db/t3/dt=-09-27;hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/t3/;Found 6 itemsdrwxrwxrwx - root supergroup0 -09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=-09-22drwxrwxrwx - root supergroup0 -09-23 02:47 /user/hive/warehouse/mydb.db/t3/dt=-09-23drwxrwxrwx - root supergroup0 -09-23 02:50 /user/hive/warehouse/mydb.db/t3/dt=-09-24drwxrwxrwx - root supergroup0 -09-23 02:51 /user/hive/warehouse/mydb.db/t3/dt=-09-25drwxr-xr-x - root supergroup0 -09-23 02:53 /user/hive/warehouse/mydb.db/t3/dt=-09-26drwxr-xr-x - root supergroup0 -09-23 02:53 /user/hive/warehouse/mydb.db/t3/dt=-09-27-- 指定路径设置分区hive (mydb)> alter table t3> add partition(dt="-09-26") location '/user/hive/warehouse/mydb.db/t3/dt=-09-26';OKTime taken: 0.138 secondshive (mydb)> select * from t3;OKt3.id t3.name t3.hobby t3.addr t3.dt2 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-223 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-224 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-222 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-233 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-234 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-232 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-263 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-264 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-26Time taken: 0.323 seconds, Fetched: 9 row(s)-- 修改分区的HDFS路径hive (mydb)> alter table t3 partition(dt="-09-26") set location "/user/hive/warehouse/mydb.db/t3/dt=-09-27";OKTime taken: 0.269 secondshive (mydb)> select * from t3;OKt3.id t3.name t3.hobby t3.addr t3.dt2 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-223 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-224 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-222 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-233 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-234 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-232 zhangsan ["book","TV","code"] {"beijing":"chaoyang","shagnhai":"pudong"}-09-263 lishi ["book","code"] {"nanjing":"jiangning","taiwan":"taibei"} -09-264 wangwu ["music","book"] {"heilongjiang":"haerbin"}-09-26Time taken: 0.263 seconds, Fetched: 9 row(s)hive (mydb)> show partitions t3;OKpartitiondt=-09-22dt=-09-23dt=-09-24dt=-09-25dt=-09-26Time taken: 0.133 seconds, Fetched: 5 row(s)-- 删除分区hive (mydb)> alter table t3 drop partition(dt="-09-24");Dropped the partition dt=-09-24OKTime taken: 0.57 secondshive (mydb)> alter table t3 drop partition(dt="-09-25"), partition(dt="-09-26");Dropped the partition dt=-09-25Dropped the partition dt=-09-26OKTime taken: 0.273 secondshive (mydb)> show partitions t3;OKpartitiondt=-09-22dt=-09-23Time taken: 0.079 seconds, Fetched: 2 row(s)hive (mydb)>

删除多个分区时,用逗号隔开。

(5)分桶表

当单个的分区或者表的数据量过大,分区不能更细粒度的划分数据,就需要使用分桶技术将数据划分成更细的粒度。将数据按照指定的字段进行分成多个桶中去,即将数据按照字段进行划分,数据按照字段划分到多个文件当中去。

Hive中分桶的原理是分桶字段.hashCode % 分桶个数,这与MapReduce中Shuffle时分区的规则是类似的,即key.hashCode % reductTask

使用如下:

-- 创建分桶表hive (mydb)> create table course(> id int,> name string,> score int> )> clustered by (id) into 3 buckets> row format delimited fields terminated by "\t";OKTime taken: 0.105 seconds-- 创建普通表hive (mydb)> create table course_common(> id int,> name string,> score int> )> row format delimited fields terminated by "\t";OKTime taken: 0.133 seconds-- 普通表加载数据hive (mydb)> load data local inpath "/home/hadoop/data/course.dat" into table course_common;Loading data to table mydb.course_commonOKTime taken: 0.55 secondshive (mydb)> select * from course_common;OKcourse_common.id course_common.namecourse_common.score1 java 901 c 781 python 911 hadoop 802 java 752 c 762 python 802 hadoop 933 java 983 c 743 python 893 hadoop 915 java 936 c 767 python 878 hadoop 88Time taken: 0.244 seconds, Fetched: 16 row(s)-- 给桶表加载数据hive (mydb)> insert into table course select * from course_common;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_0923033603_53b34c32-c8b7-4d73-ac85-f8419ca26678Total jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 3...Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 9.18 sec HDFS Read: 16039 HDFS Write: 365 SUCCESSTotal MapReduce CPU Time Spent: 9 seconds 180 msecOKcourse_common.id course_common.namecourse_common.scoremTime taken: 52.252 secondshive (mydb)> desc formatted course;OKcol_name data_type comment# col_name data_typecomment id int namestring score int # Detailed Table Information Database:mydb Owner: root CreateTime: Thu Sep 23 03:34:14 CST LastAccessTime: UNKNOWN Retention: 0 Location:hdfs://node01:9000/user/hive/warehouse/mydb.db/course Table Type: MANAGED_TABLE Table Parameters:COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}numFiles3 numRows 16 rawDataSize 148 totalSize164 transient_lastDdlTime 1632339416# Storage Information SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: 3 Bucket Columns: [id] Sort Columns: [] Storage Desc Params: field.delim \t serialization.format \t Time taken: 0.096 seconds, Fetched: 33 row(s)hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/course;Found 3 items-rwxrwxrwx 3 root supergroup 48 -09-23 03:36 /user/hive/warehouse/mydb.db/course/000000_0-rwxrwxrwx 3 root supergroup 53 -09-23 03:36 /user/hive/warehouse/mydb.db/course/000001_0-rwxrwxrwx 3 root supergroup 63 -09-23 03:36 /user/hive/warehouse/mydb.db/course/000002_0-- 观察分桶数据hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000000_0;3 hadoop 913 python 893 c 743 java 986 c 76hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000001_0;7 python 871 hadoop 801 python 911 c 781 java 90hive (mydb)> dfs -cat /user/hive/warehouse/mydb.db/course/000002_0;8 hadoop 885 java 932 python 802 c 762 java 752 hadoop 93hive (mydb)>

可以看到,在创建分桶表之后,也可以从表的详细信息获取到分桶的信息;

并且,不能直接向分桶表中添加数据,而需要使用insert ... select ...从普通表中导入数据;

同时,分桶的规则是分桶字段.hashCode % 分桶数,这里设置的分桶个数是3,所以对分桶字段的哈希码值进行对3求余、进入不同的桶;

从Hive 2.x开始,不需要设置参数hive.enforce.bucketing=true即可支持分桶。

(6)修改表及删除表

HIve修改和删除表的操作如下:

-- 修改表名,renamehive (mydb)> alter table course_common rename to course1;OKTime taken: 0.163 seconds-- 修改列名,change columnhive (mydb)> alter table course1> change column id cid int;OKTime taken: 0.139 seconds-- 修改字段类型,change columnhive (mydb)> alter table course1> change column cid cid string;OKTime taken: 0.128 seconds-- string不能转为inthive (mydb)> alter table course1> change column cid cid int;FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :cidhive (mydb)> desc course1;OKcol_name data_type commentcid string namestring score int Time taken: 0.16 seconds, Fetched: 3 row(s)-- 增加字段,add columnshive (mydb)> alter table course1> add columns(common string);OKTime taken: 0.168 secondshive (mydb)> select * from course1;OKcourse1.cidcourse1.name course1.score mon1 java 90NULL1 c 78NULL1 python 91NULL1 hadoop 80NULL2 java 75NULL2 c 76NULL2 python 80NULL2 hadoop 93NULL3 java 98NULL3 c 74NULL3 python 89NULL3 hadoop 91NULL5 java 93NULL6 c 76NULL7 python 87NULL8 hadoop 88NULLTime taken: 1.531 seconds, Fetched: 16 row(s)-- 删除字段,replace columnshive (mydb)> alter table course1> replace columns(> cid string, cname string, cscore int);OKTime taken: 0.297 secondshive (mydb)> desc course1;OKcol_name data_type commentcid string cname string cscore int Time taken: 0.133 seconds, Fetched: 3 row(s)-- 删除表hive (mydb)> drop table course1;OKTime taken: 0.157 secondshive (mydb)> show tables;OKtab_namecourset3Time taken: 0.071 seconds, Fetched: 2 row(s)hive (mydb)>

需要注意,修改字段数据类型时,要满足数据类型转换的要求。如int可以转为string,但是string不能转为int;

删除字段使用replace columns,仅仅只是在元数据中删除了字段,并没有改动HDFS上的数据文件。

可以对Hive DDL总结如下:

主要操作对象是数据库和表

表的分类:

主要命令包括create、alter 、drop。

5.HQL操作之数据操作

(1)load装载数据

数据导入有4种方式:

装载数据(load)

插入数据(insert)

创建表并插入数据(as select)

使用import导入数据

装载数据(load)的基本语法如下:

LOAD DATA [LOCAL] INPATH 'filepath'[OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1,partcol2=val2 ...)]

其中,各部分的含义如下:

现在进行测试,先进行准备工作,准备数据文件vim /home/hadoop/data/sourceA.txt,内容如下:

1,fish1,SZ2,fish2,SH3,fish3,HZ4,fish4,QD5,fish5,SR

再将其上传到HDFS中,如下:

[root@node03 ~]$ hdfs dfs -mkdir -p /user/hadoop/data/[root@node03 ~]$ hdfs dfs -put /home/hadoop/data/sourceA.txt /user/hadoop/data/# Hive中创建数据文件[root@node03 ~]$ hdfs dfs -mkdir /user/hive/tabB;[root@node03 ~]$ hdfs dfs -put /home/hadoop/data/sourceA.txt /user/hive/tabB

Hive中操作如下:

-- 创建表hive (mydb)> CREATE TABLE tabA ( > id int> ,name string> ,area string > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';OKTime taken: 0.085 seconds-- 加载本地文件到Hivehive (mydb)> load data local inpath '/home/hadoop/data/sourceA.txt'> into table tabA;Loading data to table mydb.tabaOKTime taken: 0.31 seconds-- 检查本地文件,仍然存在hive (mydb)> select * from tabA;OKtaba.id taba.name taba.area1 fish1 SZ2 fish2 SH3 fish3 HZ4 fish4 QD5 fish5 SRTime taken: 0.259 seconds, Fetched: 5 row(s)-- 加载HDFS文件到Hivehive (mydb)> load data inpath '/user/hadoop/data/sourceA.txt'> into table tabA;Loading data to table mydb.tabaOKTime taken: 0.449 seconds-- 检查HDFS文件系统,文件已经不存在hive (mydb)> select * from tabA;OKtaba.id taba.name taba.area1 fish1 SZ2 fish2 SH3 fish3 HZ4 fish4 QD5 fish5 SR1 fish1 SZ2 fish2 SH3 fish3 HZ4 fish4 QD5 fish5 SRTime taken: 0.25 seconds, Fetched: 10 row(s)-- 加载数据覆盖表中已有数据hive (mydb)> load data local inpath '/home/hadoop/data/sourceA.txt'> overwrite into table tabA;Loading data to table mydb.tabaOKTime taken: 0.362 secondshive (mydb)> select * from tabA;OKtaba.id taba.name taba.area1 fish1 SZ2 fish2 SH3 fish3 HZ4 fish4 QD5 fish5 SRTime taken: 0.148 seconds, Fetched: 5 row(s)-- 创建表时加载数据hive (mydb)> CREATE TABLE tabB ( > id int,> name string,> area string > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','> location '/user/hive/tabB';OKTime taken: 0.129 secondshive (mydb)> select * from tabB;OKtabb.id tabb.name tabb.area1 fish1 SZ2 fish2 SH3 fish3 HZ4 fish4 QD5 fish5 SRTime taken: 0.182 seconds, Fetched: 5 row(s)hive (mydb)>

再查看本地,如下:

[root@node03 ~]$ ls /home/hadoop/data/course.dat s1.dat sourceA.txt t1.dat[root@node03 ~]$ hdfs dfs -ls /user/hadoop/data[root@node03 ~]$

可以看到,从文件中加载数据后,本地文件还在,但是HDFS文件已经不存在于原路径下;

使用overwrite加载数据时,数据表中原来的数据都会被清空。

(2)insert插入数据

insert插入数据,使用如下:

-- 创建分区表hive (mydb)> create table tabC(> id int, name string, area string)> partitioned by(month string);OKTime taken: 0.524 secondshive (mydb)> desc tabC;OKcol_name data_type commentid int namestring areastring month string # Partition Information# col_name data_typecomment month string Time taken: 0.733 seconds, Fetched: 9 row(s)-- 插入单条数据hive (mydb)> insert into tabC> partition(month="09")> values(1, "Corley", "Beijing");Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_col0 _col1 _col2Time taken: 6.384 secondshive (mydb)> select * from tabC;OKtabc.id tabc.name tabc.area tabc.month1 Corley Beijing 09Time taken: 0.263 seconds, Fetched: 1 row(s)-- 插入多条数据hive (mydb)> insert into tabC> partition(month="09")> values(2, "Jack", "Tianjin"), (3, "Bob", "Shanghai");Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_col0 _col1 _col2Time taken: 2.749 secondshive (mydb)> select * from tabC;OKtabc.id tabc.name tabc.area tabc.month1 Corley Beijing 092 Jack Tianjin 093 BobShanghai 09Time taken: 0.362 seconds, Fetched: 3 row(s)-- 插入查询的结果数据hive (mydb)> insert into tabC> partition(month="10")> select id, name, area from tabC;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKidname areaTime taken: 2.727 secondshive (mydb)> select * from tabC;OKtabc.id tabc.name tabc.area tabc.month1 Corley Beijing 092 Jack Tianjin 093 BobShanghai 091 Corley Beijing 102 Jack Tianjin 103 BobShanghai 10Time taken: 0.235 seconds, Fetched: 6 row(s)-- 多表(多分区)插入模式hive (mydb)> from tabC> insert overwrite table tabC partition(month="11")> select id, name ,area where month="09"> insert overwrite table tabC partition(month="12")> select id, name ,area where month="09" or month="10";Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKidname areaTime taken: 3.771 secondshive (mydb)> select * from tabC;OKtabc.id tabc.name tabc.area tabc.month1 Corley Beijing 092 Jack Tianjin 093 BobShanghai 091 Corley Beijing 102 Jack Tianjin 103 BobShanghai 101 Corley Beijing 112 Jack Tianjin 113 BobShanghai 111 Corley Beijing 122 Jack Tianjin 123 BobShanghai 121 Corley Beijing 122 Jack Tianjin 123 BobShanghai 12Time taken: 0.244 seconds, Fetched: 15 row(s)hive (mydb)>

insert插入数据有3种方式:

手动插入单挑或多条数据

使用查询结果数据作为插入数据

多表(多分区)插入模式

还可以使用as select在创建表时插入数据:

-- 根据查询结果创建表hive (mydb)> create table if not exists tabD> as select * from tabC;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 1 seconds 590 msecOKtabc.id tabc.name tabc.area tabc.monthTime taken: 37.142 secondshive (mydb)> select * from tabD;OKtabd.id tabd.name tabd.area tabd.month1 Corley Beijing 092 Jack Tianjin 093 BobShanghai 091 Corley Beijing 102 Jack Tianjin 103 BobShanghai 101 Corley Beijing 112 Jack Tianjin 113 BobShanghai 111 Corley Beijing 122 Jack Tianjin 123 BobShanghai 121 Corley Beijing 122 Jack Tianjin 123 BobShanghai 12Time taken: 0.163 seconds, Fetched: 15 row(s)hive (mydb)> desc tabD;OKcol_name data_type commentid int namestring areastring month string Time taken: 0.049 seconds, Fetched: 4 row(s)hive (mydb)>

在创建表时使用查询结果作为插入的数据时,没有将分区信息复制过来,只是复制普通的字段数据,所以表tabD数据中没有分区信息。

先使用insert overwrite导出数据:

-- 1.将查询结果导出到本地hive (mydb)> insert overwrite local directory '/home/hadoop/data/tabC'> select * from tabC;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 1 seconds 740 msecOKtabc.id tabc.name tabc.area tabc.monthTime taken: 33.941 seconds-- 2.将查询结果格式化输出到本地hive (mydb)> insert overwrite local directory '/home/hadoop/data/tabC'> row format delimited fields terminated by ' '> select * from tabC;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 1 seconds 610 msecOKtabc.id tabc.name tabc.area tabc.monthTime taken: 28.068 seconds-- 3.将查询结果导出到HDFShive (mydb)> insert overwrite directory '/user/hadoop/data/tabC'> row format delimited fields terminated by ' '> select * from tabC;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 1 seconds 140 msecOKtabc.id tabc.name tabc.area tabc.monthTime taken: 20.725 secondshive (mydb)>

3种导出方式对应的文件系统查看如下:

[root@node03 ~]$ ll -ht /home/hadoop/data/总用量 16Kdrwxr-xr-x 2 root root 43 9月 24 22:25 tabC-rw-r--r-- 1 root root 55 9月 24 17:18 sourceA.txt-rw-r--r-- 1 root root 164 9月 23 03:32 course.dat-rw-r--r-- 1 root root 148 9月 22 16:58 t1.dat-rw-r--r-- 1 root root 84 9月 21 13:59 s1.dat[root@node03 ~]$ cat -A /home/hadoop/data/tabC/000000_0 1^ACorley^ABeijing^A09$2^AJack^ATianjin^A09$3^ABob^AShanghai^A09$1^ACorley^ABeijing^A10$2^AJack^ATianjin^A10$3^ABob^AShanghai^A10$1^ACorley^ABeijing^A11$2^AJack^ATianjin^A11$3^ABob^AShanghai^A11$1^ACorley^ABeijing^A12$2^AJack^ATianjin^A12$3^ABob^AShanghai^A12$1^ACorley^ABeijing^A12$2^AJack^ATianjin^A12$3^ABob^AShanghai^A12$[root@node03 ~]$ cat /home/hadoop/data/tabC/000000_0 1 Corley Beijing 092 Jack Tianjin 093 Bob Shanghai 091 Corley Beijing 102 Jack Tianjin 103 Bob Shanghai 101 Corley Beijing 112 Jack Tianjin 113 Bob Shanghai 111 Corley Beijing 122 Jack Tianjin 123 Bob Shanghai 121 Corley Beijing 122 Jack Tianjin 123 Bob Shanghai 12[root@node03 ~]$ hdfs dfs -ls /user/hadoop/dataFound 1 itemsdrwxr-xr-x - root supergroup0 -09-24 22:29 /user/hadoop/data/tabC[root@node03 ~]$ hdfs dfs -cat /user/hadoop/data/tabC/000000_01 Corley Beijing 092 Jack Tianjin 093 Bob Shanghai 091 Corley Beijing 102 Jack Tianjin 103 Bob Shanghai 101 Corley Beijing 112 Jack Tianjin 113 Bob Shanghai 111 Corley Beijing 122 Jack Tianjin 123 Bob Shanghai 121 Corley Beijing 122 Jack Tianjin 123 Bob Shanghai 12[root@node03 ~]$

再使用DFS命令导出数据到本地:

hive (mydb)> dfs -ls /user/hive/warehouse/mydb.db/tabc;Found 4 itemsdrwxrwxrwx - root supergroup0 -09-24 21:53 /user/hive/warehouse/mydb.db/tabc/month=09drwxrwxrwx - root supergroup0 -09-24 21:54 /user/hive/warehouse/mydb.db/tabc/month=10drwxrwxrwx - root supergroup0 -09-24 21:59 /user/hive/warehouse/mydb.db/tabc/month=11drwxrwxrwx - root supergroup0 -09-24 21:59 /user/hive/warehouse/mydb.db/tabc/month=12hive (mydb)> dfs -get /user/hive/warehouse/mydb.db/tabc/month=09 /home/hadoop/data/tabC;hive (mydb)>

再查看本地,如下:

[root@node03 ~]$ ll -ht /home/hadoop/data/总用量 16Kdrwxr-xr-x 3 root root 63 9月 24 22:40 tabC-rw-r--r-- 1 root root 55 9月 24 17:18 sourceA.txt-rw-r--r-- 1 root root 164 9月 23 03:32 course.dat-rw-r--r-- 1 root root 148 9月 22 16:58 t1.dat-rw-r--r-- 1 root root 84 9月 21 13:59 s1.dat[root@node03 ~]$ cat -A /home/hadoop/data/tabC/month\=09/000000_01^ACorley^ABeijing$

这种方式的本质是进行数据文件的拷贝。

也可以在本地执行hive命令导出数据到本地,如下:

[root@node03 ~]$ hive -e "select * from mydb.tabC" > tabc.datwhich: no hbase in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/software/java/jdk1.8.0_231/bin:/opt/software/hadoop-2.9.2/bin:/opt/software/hadoop-2.9.2/sbin:/opt/software/hive-2.3.7/bin:/root/bin)Logging initialized using configuration in file:/opt/software/hive-2.3.7/conf/hive-log4j2.properties Async: trueOKTime taken: 11.372 seconds, Fetched: 15 row(s)[root@node03 ~]$ cat tabc.dat tabc.id tabc.name tabc.area tabc.month1 Corley Beijing 092 Jack Tianjin 093 BobShanghai 091 Corley Beijing 102 Jack Tianjin 103 BobShanghai 101 Corley Beijing 112 Jack Tianjin 113 BobShanghai 111 Corley Beijing 122 Jack Tianjin 123 BobShanghai 121 Corley Beijing 122 Jack Tianjin 123 BobShanghai 12

本质是执行查询并将查询结果重定向到文件。

Hive有专门的导出命令expert,导出数据到HDFS,如下:

hive (mydb)> export table tabC to '/user/hadoop/data/tabC2';Copying data from file:/tmp/root/88c65aff-1880-43e6-a9c0-452b1b0f63a8/hive_-09-24_22-50-40_489_1035737842386809043-1/-local-10000/_metadataCopying file: file:/tmp/root/88c65aff-1880-43e6-a9c0-452b1b0f63a8/hive_-09-24_22-50-40_489_1035737842386809043-1/-local-10000/_metadataCopying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=09Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=09/000000_0Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=09/000000_0_copy_1Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=10Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=10/000000_0Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=11Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=11/000000_0Copying data from hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=12Copying file: hdfs://node01:9000/user/hive/warehouse/mydb.db/tabc/month=12/000000_0OKTime taken: 1.242 seconds

本地查看如下:

[root@node03 ~]$ hdfs dfs -ls /user/hadoop/data/tabC2Found 5 items-rwxr-xr-x 3 root supergroup 6086 -09-24 22:50 /user/hadoop/data/tabC2/_metadatadrwxr-xr-x - root supergroup0 -09-24 22:50 /user/hadoop/data/tabC2/month=09drwxr-xr-x - root supergroup0 -09-24 22:50 /user/hadoop/data/tabC2/month=10drwxr-xr-x - root supergroup0 -09-24 22:50 /user/hadoop/data/tabC2/month=11drwxr-xr-x - root supergroup0 -09-24 22:50 /user/hadoop/data/tabC2/month=12

可以看到,使用export导出数据时,不仅有数据,还有表的元数据信息。

export导出的数据,可以使用import命令导入到Hive表中,如下:

hive (mydb)> create table tabE like tabC;OKTime taken: 0.429 secondshive (mydb)> desc tabE;OKcol_name data_type commentid int namestring areastring month string # Partition Information# col_name data_typecomment month string Time taken: 0.444 seconds, Fetched: 9 row(s)hive (mydb)> select * from tabE;OKtabe.id tabe.name tabe.area tabe.monthTime taken: 2.391 secondshive (mydb)> import table tabE from '/user/hadoop/data/tabC2';Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=09Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=09/000000_0Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=09/000000_0_copy_1Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=10Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=10/000000_0Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=11Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=11/000000_0Copying data from hdfs://node01:9000/user/hadoop/data/tabC2/month=12Copying file: hdfs://node01:9000/user/hadoop/data/tabC2/month=12/000000_0Loading data to table mydb.tabe partition (month=09)Loading data to table mydb.tabe partition (month=10)Loading data to table mydb.tabe partition (month=11)Loading data to table mydb.tabe partition (month=12)OKTime taken: 4.861 secondshive (mydb)> select * from tabE;OKtabe.id tabe.name tabe.area tabe.month1 Corley Beijing 092 Jack Tianjin 093 BobShanghai 091 Corley Beijing 102 Jack Tianjin 103 BobShanghai 101 Corley Beijing 112 Jack Tianjin 113 BobShanghai 111 Corley Beijing 122 Jack Tianjin 123 BobShanghai 121 Corley Beijing 122 Jack Tianjin 123 BobShanghai 12Time taken: 0.232 seconds, Fetched: 15 row(s)hive (mydb)>

可以总结,使用like tname创建的表结构与原表一致,而使用create ... as select ...结构可能不一致,例如不会携带分区信息。

truncate可以用来截断表,也就是清空数据,如下:

hive (mydb)> truncate table tabE;OKTime taken: 0.833 secondshive (mydb)> select * from tabE;OKtabe.id tabe.name tabe.area tabe.monthTime taken: 0.287 secondshive (mydb)> alter table tabE set tblproperties("EXTERNAL"="TRUE");OKTime taken: 0.167 secondshive (mydb)> truncate table tabE;FAILED: SemanticException [Error 10146]: Cannot truncate non-managed table tabE.hive (mydb)>

需要注意,truncate仅能操作内部表,操作外部表时会报错。

总结如下:

数据导入方式如下:

load data

insert

create table … as select …

import table

数据导出方式如下:

insert overwrite … diretory …

hdfs dfs -get

hive -e “select …” > file

export table …

除此之外,Hive的数据导入与导出还可以使用其他工具,包括Sqoop、DataX等。

6.HQL操作之DQL命令

DQL即Data Query Language数据查询语言,是HQL的重点。

书写SQL语句时,注意事项如下:

SQL语句对大小写不敏感

SQL语句可以写一行(简单SQL),也可以写多行(复杂SQL)

关键字不能缩写,也不能分行

各子句一般要分行

使用缩进格式,提高SQL语句的可读性

(1)简单查询

先准备数据文件,vim /home/hadoop/data/emp.dat,输入内容如下:

7369,SMITH,CLERK,7902,-12-17,800,,207499,ALLEN,SALESMAN,7698,-02-20,1600,300,307521,WARD,SALESMAN,7698,-02-22,1250,500,307566,JONES,MANAGER,7839,-04-02,2975,,207654,MARTIN,SALESMAN,7698,-09-28,1250,1400,307698,BLAKE,MANAGER,7839,-05-01,2850,,307782,CLARK,MANAGER,7839,-06-09,2450,,107788,SCOTT,ANALYST,7566,-07-13,3000,,207839,KING,PRESIDENT,,-11-07,5000,,107844,TURNER,SALESMAN,7698,-09-08,1500,0,307876,ADAMS,CLERK,7788,-07-13,1100,,207900,JAMES,CLERK,7698,-12-03,950,,307902,FORD,ANALYST,7566,-12-03,3000,,207934,MILLER,CLERK,7782,-01-23,1300,,10

再创建表和导入数据,如下:

hive (mydb)> CREATE TABLE emp(>empno int,>ename string, >job string, >mgr int, >hiredate DATE, >sal int, >comm int, >deptno int> )row format delimited fields terminated by ",";OKTime taken: 0.179 secondshive (mydb)> load data local inpath '/home/hadoop/data/emp.dat' into table emp;Loading data to table mydb.empOKTime taken: 0.712 secondshive (mydb)> select * from emp;OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207499 ALLEN SALESMAN 7698 -02-00 300307521 WARD SALESMAN 7698 -02-221250 500307566 JONES MANAGER 7839 -04-022975 NULL 207654 MARTIN SALESMAN 7698 -09-281250 1400 307698 BLAKE MANAGER 7839 -05-012850 NULL 307782 CLARK MANAGER 7839 -06-092450 NULL 107788 SCOTT ANALYST 7566 -07-133000 NULL 207839 KING PRESIDENT NULL -11-075000 NULL 107844 TURNER SALESMAN 7698 -09-081500 0 307876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 307902 FORD ANALYST 7566 -12-033000 NULL 207934 MILLER CLERK 7782 -01-231300 NULL 10Time taken: 0.329 seconds, Fetched: 14 row(s)hive (mydb)>

再进行简单查询,如下:

-- 省略from子句的查询hive (mydb)> select 123 * 321;OK_c039483Time taken: 0.139 seconds, Fetched: 1 row(s)hive (mydb)> select current_date;OK_c0-09-24Time taken: 0.113 seconds, Fetched: 1 row(s)-- 使用列别名hive (mydb)> select 123 * 321 as pro;OKpro39483Time taken: 0.125 seconds, Fetched: 1 row(s)hive (mydb)> select current_date curdate;OKcurdate-09-24Time taken: 0.124 seconds, Fetched: 1 row(s)-- 全表查询hive (mydb)> select * from emp;OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207499 ALLEN SALESMAN 7698 -02-00 300307521 WARD SALESMAN 7698 -02-221250 500307566 JONES MANAGER 7839 -04-022975 NULL 207654 MARTIN SALESMAN 7698 -09-281250 1400 307698 BLAKE MANAGER 7839 -05-012850 NULL 307782 CLARK MANAGER 7839 -06-092450 NULL 107788 SCOTT ANALYST 7566 -07-133000 NULL 207839 KING PRESIDENT NULL -11-075000 NULL 107844 TURNER SALESMAN 7698 -09-081500 0 307876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 307902 FORD ANALYST 7566 -12-033000 NULL 207934 MILLER CLERK 7782 -01-231300 NULL 10Time taken: 0.277 seconds, Fetched: 14 row(s)-- 选择特定列查询hive (mydb)> select ename, sal, comm from emp;OKename salcommSMITH 800NULLALLEN 1600 300WARD 1250 500JONES 2975 NULLMARTIN 1250 1400BLAKE 2850 NULLCLARK 2450 NULLSCOTT 3000 NULLKING 5000 NULLTURNER 1500 0ADAMS 1100 NULLJAMES 950NULLFORD 3000 NULLMILLER 1300 NULLTime taken: 0.172 seconds, Fetched: 14 row(s)-- 使用函数hive (mydb)> select count(*) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c014Time taken: 6.238 seconds, Fetched: 1 row(s)hive (mydb)> select count(1) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c014Time taken: 1.823 seconds, Fetched: 1 row(s)hive (mydb)> select count(empno) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c014Time taken: 1.697 seconds, Fetched: 1 row(s)hive (mydb)> select count(comm) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c04Time taken: 2.079 seconds, Fetched: 1 row(s)hive (mydb)> select sum(sal) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c029025Time taken: 1.746 seconds, Fetched: 1 row(s)hive (mydb)> select max(sal) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c05000Time taken: 1.737 seconds, Fetched: 1 row(s)hive (mydb)> select min(sal) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c0800Time taken: 1.789 seconds, Fetched: 1 row(s)hive (mydb)> select avg(sal) from emp;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c02073.214285714286Time taken: 1.629 seconds, Fetched: 1 row(s)-- 使用limit子句限制返回的行数hive (mydb)> select * from emp limit 3;OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207499 ALLEN SALESMAN 7698 -02-00 300307521 WARD SALESMAN 7698 -02-221250 50030Time taken: 0.338 seconds, Fetched: 3 row(s)hive (mydb)>

需要注意,使用count函数时,如果传入的是字段,则不统计NULL,所以要统计数据的行数时,一般不传入某个字段,而是传入*1

(2)简单子句

WHERE子句紧随FROM子句,使用WHERE子句,过滤不满足条件的数据;

where 子句中不能使用列的别名。

where子句的简单用法如下:

hive (mydb)> select ename name, sal from emp;OKname salSMITH 800ALLEN 1600WARD 1250JONES 2975MARTIN 1250BLAKE 2850CLARK 2450SCOTT 3000KING 5000TURNER 1500ADAMS 1100JAMES 950FORD 3000MILLER 1300Time taken: 0.145 seconds, Fetched: 14 row(s)hive (mydb)> select ename name, sal from emp where length(ename)=5;OKname salSMITH 800ALLEN 1600JONES 2975BLAKE 2850CLARK 2450SCOTT 3000ADAMS 1100JAMES 950Time taken: 0.183 seconds, Fetched: 8 row(s)hive (mydb)> select ename name, sal from emp where length(name)=5;FAILED: SemanticException [Error 10004]: Line 1:45 Invalid table alias or column reference 'name': (possible column names are: empno, ename, job, mgr, hiredate, sal, comm, deptno)hive (mydb)> select ename, sal from emp where sal > 2000;OKename salJONES 2975BLAKE 2850CLARK 2450SCOTT 3000KING 5000FORD 3000Time taken: 0.333 seconds, Fetched: 6 row(s)hive (mydb)>

可以看到,where子句中不能使用字段的别名作为查询条件。

where子句中会涉及到较多的比较运算和 逻辑运算。

常见的比较运算符如下:

更完整的比较运算符可参考官方文档/confluence/display/Hive/LanguageManual+UDF。

使用比较运算符如下:

hive (mydb)> select sal, comm, sal+comm from emp;OKsalcomm _c2800NULL NULL1600 3001900...3000 NULL NULL1300 NULL NULLTime taken: 0.257 seconds, Fetched: 14 row(s)hive (mydb)> select sal, comm, sal+comm from emp;OKsalcomm _c2800NULL NULL1600 30019001250 5001750...3000 NULL NULL1300 NULL NULLTime taken: 0.17 seconds, Fetched: 14 row(s)hive (mydb)> select * from emp where comm != NULL;OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptnoTime taken: 0.246 secondshive (mydb)> select * from emp where comm is not NULL;OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7499 ALLEN SALESMAN 7698 -02-00 300307521 WARD SALESMAN 7698 -02-221250 500307654 MARTIN SALESMAN 7698 -09-281250 1400 307844 TURNER SALESMAN 7698 -09-081500 0 30Time taken: 0.193 seconds, Fetched: 4 row(s)Time taken: 0.192 seconds, Fetched: 1 row(s)hive (mydb)> select * from emp where deptno in (20, 30);OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207499 ALLEN SALESMAN 7698 -02-00 30030...7900 JAMES CLERK 7698 -12-03950NULL 307902 FORD ANALYST 7566 -12-033000 NULL 20Time taken: 0.247 seconds, Fetched: 11 row(s)hive (mydb)> select * from emp where ename like 'S%';OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207788 SCOTT ANALYST 7566 -07-133000 NULL 20Time taken: 0.39 seconds, Fetched: 2 row(s)hive (mydb)> select * from emp where ename like '%S';OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7566 JONES MANAGER 7839 -04-022975 NULL 207876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 30Time taken: 0.165 seconds, Fetched: 3 row(s)hive (mydb)> select * from emp where ename like '%S%';OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207788 SCOTT ANALYST 7566 -07-133000 NULL 207876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 30Time taken: 0.098 seconds, Fetched: 5 row(s)hive (mydb)> select * from emp where sal between 1000 and 2000;OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7499 ALLEN SALESMAN 7698 -02-00 300307521 WARD SALESMAN 7698 -02-221250 500307654 MARTIN SALESMAN 7698 -09-281250 1400 307844 TURNER SALESMAN 7698 -09-081500 0 307876 ADAMS CLERK 7788 -07-131100 NULL 207934 MILLER CLERK 7782 -01-231300 NULL 10Time taken: 0.855 seconds, Fetched: 6 row(s)hive (mydb)> select * from emp where ename like 'S%' or ename like '%S';OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207788 SCOTT ANALYST 7566 -07-133000 NULL 207876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 30Time taken: 0.151 seconds, Fetched: 5 row(s)hive (mydb)> select * from emp where ename rlike '^S.*|.*S$';OKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7369 SMITH CLERK 7902 -12-17800NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207788 SCOTT ANALYST 7566 -07-133000 NULL 207876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 30Time taken: 0.152 seconds, Fetched: 5 row(s)hive (mydb)> select null=null;OK_c0NULLTime taken: 0.078 seconds, Fetched: 1 row(s)hive (mydb)> select null==null;OK_c0NULLTime taken: 0.074 seconds, Fetched: 1 row(s)hive (mydb)> select null<=>null;OK_c0trueTime taken: 0.068 seconds, Fetched: 1 row(s)hive (mydb)> select null is null;OK_c0trueTime taken: 0.066 seconds, Fetched: 1 row(s)hive (mydb)>

可以看到,通常情况下NULL参与运算,返回值为NULL,并且判断字段(不)NULL时,不能使用=,而要使用isis not,否则会得到异常的结果;

null<=>nullnull is null的结果相同,都是true。

逻辑运算符包括and、or和not。

(3)group by分组子句

GROUP BY语句通常与聚组函数一起使用,按照一个或多个列对数据进行分组,对每个组进行聚合操作。

使用如下:

hive (mydb)> select avg(sal)>from emp> group by deptno;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOK_c02916.66666666666652175.01566.6666666666667Time taken: 2.428 seconds, Fetched: 3 row(s)-- 计算emp表每个部门的平均工资hive (mydb)> select deptno, avg(sal)>from emp> group by deptno;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKdeptno _c1102916.666666666666575.0301566.6666666666667Time taken: 1.921 seconds, Fetched: 3 row(s)-- 计算emp每个部门中每个岗位的最高薪水hive (mydb)> select deptno, job , max(sal)>from emp> group by deptno, job;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKdeptno job_c220ANALYST 300010CLERK 130020CLERK 110030CLERK 95010MANAGER 245020MANAGER 297530MANAGER 285010PRESIDENT 500030SALESMAN 1600Time taken: 1.587 seconds, Fetched: 9 row(s)hive (mydb)> select deptno, max(sal)>from emp> group by deptno;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKdeptno _c1105000203000302850Time taken: 1.719 seconds, Fetched: 3 row(s)-- 求每个部门的平均薪水大于2000的部门hive (mydb)> select deptno, avg(sal) avgsal>from emp> group by deptno> having avgsal > 2000;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKdeptno avgsal102916.666666666666575.0Time taken: 1.881 seconds, Fetched: 2 row(s)hive (mydb)> [root@node03 ~]$

现在对where和having进行总结:

where子句针对表中的数据发挥作用;having针对查询结果(聚组以后的结果)发挥作用

where子句不能有分组函数;having子句可以有分组函数

having一般只用于group by分组统计之后

(4)表连接

Hive支持通常的SQL JOIN语句,默认情况下,仅支持等值连接,不支持非等值连接。

JOIN 语句中经常会使用表的别名,使用别名可以简化SQL语句的编写,使用表名前缀可以提高SQL的解析效率。

连接查询操作分为两大类,内连接和外连接,而外连接可进一步细分为三种类型,如下:

内连接[inner] join

外连接outer join

    - 左外连接left [outer] join,左表的数据全部显示

    - 右外连接right [outer] join,右表的数据全部显示

    - 全外连接full [outer] join,两张表的数据都显示

图示如下:

先准备数据,vim /home/hadoop/data/u1.txt,输入如下:

1,a2,b3,c4,d5,e6,f

vim /home/hadoop/data/u2.txt,输入如下:

4,d5,e6,f7,g8,h9,i

创建表并加载数据,如下:

hive (mydb)> create table if not exists u1(>id int,>name string)> row format delimited fields terminated by ',';OKTime taken: 0.823 secondshive (mydb)> create table if not exists u2(>id int,>name string)> row format delimited fields terminated by ',';OKTime taken: 0.143 secondshive (mydb)> load data local inpath '/home/hadoop/data/u1.txt' into table u1;Loading data to table mydb.u1OKTime taken: 0.949 secondshive (mydb)> load data local inpath '/home/hadoop/data/u2.txt' into table u2;Loading data to table mydb.u2OKTime taken: 0.773 secondshive (mydb)> select * from u1;OKu1.id u1.name1 a2 b3 c4 d5 e6 fTime taken: 1.587 seconds, Fetched: 6 row(s)hive (mydb)> select * from u2;OKu2.id u2.name4 d5 e6 f7 g8 h9 iTime taken: 0.205 seconds, Fetched: 6 row(s)hive (mydb)>

再测试4种连接方式:

-- 内连接hive (mydb)> select * from u1 join u2 on u1.id = u2.id;Automatically selecting local only mode for query...-09-25 01:40:36Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-40-19_146_1199023122301087997-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (386 bytes)-09-25 01:40:36End of local task; Time Taken: 2.182 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)-09-25 01:40:40,275 Stage-3 map = 100%, reduce = 0%Ended Job = job_local250415828_0001MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 72 HDFS Write: 195 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKu1.id u1.name u2.id u2.name4 d 4 d5 e 5 e6 f 6 fTime taken: 21.206 seconds, Fetched: 3 row(s)-- 左外连接hive (mydb)> select * from u1 left join u2 on u1.id = u2.id;Automatically selecting local only mode for query...-09-25 01:41:27Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-41-11_852_942067958094048095-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (386 bytes)-09-25 01:41:27End of local task; Time Taken: 1.788 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)-09-25 01:41:30,494 Stage-3 map = 100%, reduce = 0%Ended Job = job_local794342130_0002MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 243 HDFS Write: 408 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKu1.id u1.name u2.id u2.name1 a NULL NULL2 b NULL NULL3 c NULL NULL4 d 4 d5 e 5 e6 f 6 fTime taken: 18.726 seconds, Fetched: 6 row(s)-- 右外连接hive (mydb)> select * from u1 right join u2 on u1.id = u2.id;Automatically selecting local only mode for query...-09-25 01:41:55Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-41-41_317_899584510973126689-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (386 bytes)-09-25 01:41:55End of local task; Time Taken: 1.925 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)-09-25 01:41:58,864 Stage-3 map = 100%, reduce = 0%Ended Job = job_local644970176_0003MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 480 HDFS Write: 621 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKu1.id u1.name u2.id u2.name4 d 4 d5 e 5 e6 f 6 fNULL NULL 7 gNULL NULL 8 hNULL NULL 9 iTime taken: 17.577 seconds, Fetched: 6 row(s)-- 全外连接hive (mydb)> select * from u1 full join u2 on u1.id = u2.id;Automatically selecting local only mode for query...In order to set a constant number of reducers:set mapreduce.job.reduces=<number>Job running in-process (local Hadoop)-09-25 01:42:16,604 Stage-1 map = 100%, reduce = 100%Ended Job = job_local905853098_0004MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 2199 HDFS Write: 2142 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKu1.id u1.name u2.id u2.name1 a NULL NULL2 b NULL NULL3 c NULL NULL4 d 4 d5 e 5 e6 f 6 fNULL NULL 7 gNULL NULL 8 hNULL NULL 9 iTime taken: 1.802 seconds, Fetched: 9 row(s)hive (mydb)>

除此之外,还可以进行多表连接;

连接 n张表,至少需要 n-1 个连接条件,例如连接四张表至少需要三个连接条件。

举例,多表连接查询,查询老师对应的课程,以及对应的分数,对应的学生,如下:

select *from techer t left join course c on t.t_id = c.t_idleft join score s on s.c_id = c.c_idleft join student stu on s.s_id = stu.s_id;

Hive总是按照从左到右的顺序执行,Hive会对每对 JOIN 连接对象启动一个MapReduce 任务。

上面的例子中会首先启动一个MapReduce Job对表t和表c进行连接操作;然后再启动一个MapReduce Job将第一个MapReduce Job的输出和表s进行连接操作;然后再继续启动一个MapReduce Job将第二个MapReduce Job的输出和表stu进行连接操作,所以总共会有3个MapReduce Job。

可以看到,连接条件会占用较多的连接资源。

Hive种也可以产生笛卡尔积,满足以下条件将会产生笛卡尔积:

没有连接条件

连接条件无效

所有表中的所有行互相连接

如果表A、B分别有M、N条数据,其笛卡尔积的结果将有 M*N 条数据,缺省条件下HIve不支持笛卡尔积运算,需要设置参数hive.strict.checks.cartesian.product=false才能进行笛卡尔积运算。

使用如下:

hive (mydb)> select * from u1, u2;FAILED: SemanticException Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features.hive (mydb)> set hive.strict.checks.cartesian.product;hive.strict.checks.cartesian.product=truehive (mydb)> set hive.strict.checks.cartesian.product=false;hive (mydb)> select * from u1, u2;Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross productAutomatically selecting local only mode for queryWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_0925015241_c38361bc-8bd1-4473-8e2c-ec9479516299Total jobs = 1-09-25 01:52:55Starting to launch local task to process map join;maximum memory = 518979584-09-25 01:52:57Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-52-41_759_493077905194279454-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable-09-25 01:52:57Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-52-41_759_493077905194279454-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (320 bytes)-09-25 01:52:57End of local task; Time Taken: 1.549 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorJob running in-process (local Hadoop)-09-25 01:53:00,570 Stage-3 map = 100%, reduce = 0%Ended Job = job_local1556109485_0005MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 1044 HDFS Write: 1707 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKu1.id u1.name u2.id u2.name1 a 4 d2 b 4 d3 c 4 d4 d 4 d5 e 4 d6 f 4 d1 a 5 e2 b 5 e3 c 5 e4 d 5 e5 e 5 e6 f 5 e1 a 6 f2 b 6 f3 c 6 f4 d 6 f5 e 6 f6 f 6 f1 a 7 g2 b 7 g3 c 7 g4 d 7 g5 e 7 g6 f 7 g1 a 8 h2 b 8 h3 c 8 h4 d 8 h5 e 8 h6 f 8 h1 a 9 i2 b 9 i3 c 9 i4 d 9 i5 e 9 i6 f 9 iTime taken: 18.844 seconds, Fetched: 36 row(s)hive (mydb)> select count(*) from u1, u2;Warning: Map Join MAPJOIN[15][bigTable=?] in task 'Stage-2:MAPRED' is a cross productAutomatically selecting local only mode for queryWARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = root_0925015310_41ce7307-4ea7-41f3-8c40-6c1927d3feb7Total jobs = 1-09-25 01:53:25Starting to launch local task to process map join;maximum memory = 518979584-09-25 01:53:27Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-53-10_349_6581783439017176314-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile40--.hashtable-09-25 01:53:27Uploaded 1 File to: file:/tmp/root/8c73cb74-4529-401a-a664-93c1ea29f8d0/hive_-09-25_01-53-10_349_6581783439017176314-1/-local-10005/HashTable-Stage-2/MapJoin-mapfile40--.hashtable (296 bytes)-09-25 01:53:27End of local task; Time Taken: 1.715 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>In order to set a constant number of reducers:set mapreduce.job.reduces=<number>Job running in-process (local Hadoop)-09-25 01:53:30,449 Stage-2 map = 100%, reduce = 100%Ended Job = job_local480461869_0006MapReduce Jobs Launched: Stage-Stage-2: HDFS Read: 3750 HDFS Write: 3516 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK_c036Time taken: 20.145 seconds, Fetched: 1 row(s)hive (mydb)>

(5)order by排序子句

Hive中的order by子句与MySQL中存在一定的区别。

order by子句用于对最终的结果进行排序,一般出现在select语句的结尾;

默认使用升序(ASC),可以使用DESC,跟在字段名之后表示降序;

ORDER BY执行全局排序,只有一个reduce任务。

使用如下:

hive (mydb)> select * from emp order by deptno;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7934 MILLER CLERK 7782 -01-231300 NULL 107839 KING PRESIDENT NULL -11-075000 NULL 107782 CLARK MANAGER 7839 -06-092450 NULL 107876 ADAMS CLERK 7788 -07-131100 NULL 207788 SCOTT ANALYST 7566 -07-133000 NULL 207369 SMITH CLERK 7902 -12-17800NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207902 FORD ANALYST 7566 -12-033000 NULL 207844 TURNER SALESMAN 7698 -09-081500 0 307499 ALLEN SALESMAN 7698 -02-00 300307698 BLAKE MANAGER 7839 -05-012850 NULL 307654 MARTIN SALESMAN 7698 -09-281250 1400 307521 WARD SALESMAN 7698 -02-221250 500307900 JAMES CLERK 7698 -12-03950NULL 30Time taken: 7.211 seconds, Fetched: 14 row(s)hive (mydb)> select empno, ename, job, mgr, sal+comm salsum, deptno> from emp> order by salsum desc;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKempno ename jobmgrsalsum deptno7654 MARTIN SALESMAN 7698 2650 307499 ALLEN SALESMAN 7698 1900 307521 WARD SALESMAN 7698 1750 307844 TURNER SALESMAN 7698 1500 307934 MILLER CLERK 7782 NULL 107902 FORD ANALYST 7566 NULL 207900 JAMES CLERK 7698 NULL 307876 ADAMS CLERK 7788 NULL 207839 KING PRESIDENT NULL NULL 107788 SCOTT ANALYST 7566 NULL 207782 CLARK MANAGER 7839 NULL 107698 BLAKE MANAGER 7839 NULL 307566 JONES MANAGER 7839 NULL 207369 SMITH CLERK 7902 NULL 20Time taken: 2.068 seconds, Fetched: 14 row(s)hive (mydb)> select empno, ename, job, mgr, sal+nvl(comm, 0) salsum, deptno> from emp> order by salsum desc;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKempno ename jobmgrsalsum deptno7839 KING PRESIDENT NULL 5000 107902 FORD ANALYST 7566 3000 207788 SCOTT ANALYST 7566 3000 207566 JONES MANAGER 7839 2975 207698 BLAKE MANAGER 7839 2850 307654 MARTIN SALESMAN 7698 2650 307782 CLARK MANAGER 7839 2450 107499 ALLEN SALESMAN 7698 1900 307521 WARD SALESMAN 7698 1750 307844 TURNER SALESMAN 7698 1500 307934 MILLER CLERK 7782 1300 107876 ADAMS CLERK 7788 1100 207900 JAMES CLERK 7698 950307369 SMITH CLERK 7902 80020Time taken: 1.739 seconds, Fetched: 14 row(s)hive (mydb)> select empno, ename, job, mgr, sal+nvl(comm, 0) salsum, deptno> from emp> order by deptno, salsum desc;Automatically selecting local only mode for query...Total MapReduce CPU Time Spent: 0 msecOKempno ename jobmgrsalsum deptno7839 KING PRESIDENT NULL 5000 107782 CLARK MANAGER 7839 2450 107934 MILLER CLERK 7782 1300 107788 SCOTT ANALYST 7566 3000 207902 FORD ANALYST 7566 3000 207566 JONES MANAGER 7839 2975 207876 ADAMS CLERK 7788 1100 207369 SMITH CLERK 7902 800207698 BLAKE MANAGER 7839 2850 307654 MARTIN SALESMAN 7698 2650 307499 ALLEN SALESMAN 7698 1900 307521 WARD SALESMAN 7698 1750 307844 TURNER SALESMAN 7698 1500 307900 JAMES CLERK 7698 95030Time taken: 1.849 seconds, Fetched: 14 row(s)hive (mydb)> select empno, ename, job, mgr, sal+nvl(comm, 0) salsum> from emp> order by deptno, salsum desc;FAILED: SemanticException [Error 10004]: Line 3:9 Invalid table alias or column reference 'deptno': (possible column names are: empno, ename, job, mgr, salsum)hive (mydb)>

其中,nvl函数的作用是在传入的字段的值为空时,将字段的值设置为第二个参数的值,一般在字段参与运算时,会用到该函数;

同时需要保证,排序字段要出现在select子句中,否则查询语句无法执行,上面的查询语句因为select子句中缺少deptno、而order by子句中存在deptno,所以不能正常执行。

(6)sort by排序

对于大规模数据而言order by效率低;

在很多业务场景,我们并不需要全局有序的数据、而只需要局部有序的数据即可,此时可以使用sort by;

sort by可以为每个reduce产生一个排序文件,在reduce内部进行排序,得到局部有序的结果。

现在使用如下:

-- 设置reduce个数hive (mydb)> set mapreduce.job.reduces;mapreduce.job.reduces=-1hive (mydb)> set mapreduce.job.reduces=2;hive (mydb)> set mapreduce.job.reduces;mapreduce.job.reduces=2-- 按照工资降序查看员工信息hive (mydb)> select * from emp sort by sal desc;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 7 seconds 860 msecOKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7902 FORD ANALYST 7566 -12-033000 NULL 207788 SCOTT ANALYST 7566 -07-133000 NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207844 TURNER SALESMAN 7698 -09-081500 0 307521 WARD SALESMAN 7698 -02-221250 500307654 MARTIN SALESMAN 7698 -09-281250 1400 307876 ADAMS CLERK 7788 -07-131100 NULL 207900 JAMES CLERK 7698 -12-03950NULL 307369 SMITH CLERK 7902 -12-17800NULL 207839 KING PRESIDENT NULL -11-075000 NULL 107698 BLAKE MANAGER 7839 -05-012850 NULL 307782 CLARK MANAGER 7839 -06-092450 NULL 107499 ALLEN SALESMAN 7698 -02-00 300307934 MILLER CLERK 7782 -01-231300 NULL 10Time taken: 42.303 seconds, Fetched: 14 row(s)-- 将查询结果导入到文件中(按照工资降序)。生成两个输出文件,每个文件内部数据按工资降序排列hive (mydb)> insert overwrite local directory '/home/hadoop/output/sortsal'> select * from emp sort by sal desc;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 9 seconds 50 msecOKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptnoTime taken: 46.032 secondshive (mydb)>

运行后,查看本地,如下:

[root@node03 ~]$ ll /home/hadoop/output/sortsal/总用量 8-rw-r--r-- 1 root root 411 9月 25 15:20 000000_0-rw-r--r-- 1 root root 230 9月 25 15:20 000001_0[root@node03 ~]$ cat -A /home/hadoop/output/sortsal/000000_07902^AFORD^AANALYST^A7566^A-12-03^A3000^A\N^A20$7788^ASCOTT^AANALYST^A7566^A-07-13^A3000^A\N^A20$7566^AJONES^AMANAGER^A7839^A-04-02^A2975^A\N^A20$7844^ATURNER^ASALESMAN^A7698^A-09-08^A1500^A0^A30$7521^AWARD^ASALESMAN^A7698^A-02-22^A1250^A500^A30$7654^AMARTIN^ASALESMAN^A7698^A-09-28^A1250^A1400^A30$7876^AADAMS^ACLERK^A7788^A-07-13^A1100^A\N^A20$7900^AJAMES^ACLERK^A7698^A-12-03^A950^A\N^A30$7369^ASMITH^ACLERK^A7902^A-12-17^A800^A\N^A20$[root@node03 ~]$ cat -A /home/hadoop/output/sortsal/000001_07839^AKING^APRESIDENT^A\N^A-11-07^A5000^A\N^A10$7698^ABLAKE^AMANAGER^A7839^A-05-01^A2850^A\N^A30$7782^ACLARK^AMANAGER^A7839^A-06-09^A2450^A\N^A10$7499^AALLEN^ASALESMAN^A7698^A-02-20^A1600^A300^A30$7934^AMILLER^ACLERK^A7782^A-01-23^A1300^A\N^A10$[root@node03 ~]$

可以看到,reduce个数(mapreduce.job.reduces参数的值)默认为-1,此时Hive可以自行计算reduce的个数,当数据很小时就会只计算出一个reduce,所以要想有多个reduce,需要手动设置;

此时有多个reduce,不能再启用本地模式,而是使用多个MR Job;

在查询的结果中,无论是打印出来到控制台,还是输出到文件,在局部都是有序的。

(7)distribute by和cluster by排序

distribute by用于分区排序;

distribute by 将特定的行发送到特定的reducer中,便于后继的聚合与排序操作;

distribute by 类似于MR中的分区操作,可以结合sort by操作,使分区数据有序,结合使用时distribute by 要写在sort by之前。

使用如下:

-- 启动2个reduce taskhive (mydb)> set mapreduce.job.reduces=2;-- 先按deptno分区,在分区内按sal+comm排序,将结果输出到文件,观察输出结果hive (mydb)> insert overwrite local directory '/home/hadoop/output/distBy'> select empno, ename, deptno, job, sal+nvl(comm, 0) salsum> from emp> distribute by deptno> sort by salsum desc;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 6 seconds 310 msecOKempno ename deptno jobsalsumTime taken: 39.482 secondshive (mydb)> select distinct deptno from emp;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 7 seconds 730 msecOKdeptno102030Time taken: 38.173 seconds, Fetched: 3 row(s)-- 启动3个reduce task,将数据分到3个区中hive (mydb)> set mapreduce.job.reduces=3;hive (mydb)> insert overwrite local directory '/home/hadoop/output/distby'> select empno, ename, deptno, job, sal+nvl(comm, 0) salsum> from emp> distribute by deptno> sort by salsum desc;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 10 seconds 890 msecOKempno ename deptno jobsalsumTime taken: 36.433 secondshive (mydb)>

查看本地,如下:

# 2个reduce task[root@node03 ~]$ ll /home/hadoop/output/distBy/总用量 4-rw-r--r-- 1 root root 374 9月 25 15:34 000000_0-rw-r--r-- 1 root root 0 9月 25 15:34 000001_0[root@node03 ~]$ cat -A /home/hadoop/output/distBy/000000_0 7839^AKING^A10^APRESIDENT^A5000$7902^AFORD^A20^AANALYST^A3000$7788^ASCOTT^A20^AANALYST^A3000$7566^AJONES^A20^AMANAGER^A2975$7698^ABLAKE^A30^AMANAGER^A2850$7654^AMARTIN^A30^ASALESMAN^A2650$7782^ACLARK^A10^AMANAGER^A2450$7499^AALLEN^A30^ASALESMAN^A1900$7521^AWARD^A30^ASALESMAN^A1750$7844^ATURNER^A30^ASALESMAN^A1500$7934^AMILLER^A10^ACLERK^A1300$7876^AADAMS^A20^ACLERK^A1100$7900^AJAMES^A30^ACLERK^A950$7369^ASMITH^A20^ACLERK^A800$[root@node03 ~]$ cat -A /home/hadoop/output/distBy/000001_0 # 3个reduce task[root@node03 ~]$ ll /home/hadoop/output/distby/总用量 12-rw-r--r-- 1 root root 164 9月 25 15:42 000000_0-rw-r--r-- 1 root root 81 9月 25 15:42 000001_0-rw-r--r-- 1 root root 129 9月 25 15:42 000002_0[root@node03 ~]$ cat -A /home/hadoop/output/distby/000000_0 7698^ABLAKE^A30^AMANAGER^A2850$7654^AMARTIN^A30^ASALESMAN^A2650$7499^AALLEN^A30^ASALESMAN^A1900$7521^AWARD^A30^ASALESMAN^A1750$7844^ATURNER^A30^ASALESMAN^A1500$7900^AJAMES^A30^ACLERK^A950$[root@node03 ~]$ cat -A /home/hadoop/output/distby/000001_0 7839^AKING^A10^APRESIDENT^A5000$7782^ACLARK^A10^AMANAGER^A2450$7934^AMILLER^A10^ACLERK^A1300$[root@node03 ~]$ cat -A /home/hadoop/output/distby/000002_0 7788^ASCOTT^A20^AANALYST^A3000$7902^AFORD^A20^AANALYST^A3000$7566^AJONES^A20^AMANAGER^A2975$7876^AADAMS^A20^ACLERK^A1100$7369^ASMITH^A20^ACLERK^A800$[root@node03 ~]$

可以看到,因为分区规则是分区字段.hashCode % 分区数,并且分区字段deptno的值包括10、20、30,分区数为2,计算得到的分区编号都是0,所以设置reduce为2时最后得到的数据都在第一个分区文件000000_0中,设置reduce为3时,查询结果会分布到不同的文件中。

当distribute by与sort by是同一个字段时,可使用cluster by简化语法;

cluster by只能是升序,不能指定排序规则。

使用如下:

hive (mydb)> select * from emp distribute by deptno sort by deptno;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 10 seconds 130 msecOKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7654 MARTIN SALESMAN 7698 -09-281250 1400 307900 JAMES CLERK 7698 -12-03950NULL 307698 BLAKE MANAGER 7839 -05-012850 NULL 307521 WARD SALESMAN 7698 -02-221250 500307844 TURNER SALESMAN 7698 -09-081500 0 307499 ALLEN SALESMAN 7698 -02-00 300307934 MILLER CLERK 7782 -01-231300 NULL 107839 KING PRESIDENT NULL -11-075000 NULL 107782 CLARK MANAGER 7839 -06-092450 NULL 107788 SCOTT ANALYST 7566 -07-133000 NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207876 ADAMS CLERK 7788 -07-131100 NULL 207902 FORD ANALYST 7566 -12-033000 NULL 207369 SMITH CLERK 7902 -12-17800NULL 20Time taken: 43.018 seconds, Fetched: 14 row(s)hive (mydb)> select * from emp cluster by deptno;WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases....Total MapReduce CPU Time Spent: 8 seconds 650 msecOKemp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal m emp.deptno7654 MARTIN SALESMAN 7698 -09-281250 1400 307900 JAMES CLERK 7698 -12-03950NULL 307698 BLAKE MANAGER 7839 -05-012850 NULL 307521 WARD SALESMAN 7698 -02-221250 500307844 TURNER SALESMAN 7698 -09-081500 0 307499 ALLEN SALESMAN 7698 -02-00 300307934 MILLER CLERK 7782 -01-231300 NULL 107839 KING PRESIDENT NULL -11-075000 NULL 107782 CLARK MANAGER 7839 -06-092450 NULL 107788 SCOTT ANALYST 7566 -07-133000 NULL 207566 JONES MANAGER 7839 -04-022975 NULL 207876 ADAMS CLERK 7788 -07-131100 NULL 207902 FORD ANALYST 7566 -12-033000 NULL 207369 SMITH CLERK 7902 -12-17800NULL 20Time taken: 36.315 seconds, Fetched: 14 row(s)hive (mydb)>

可以看到,两种方式效果相同,但是这里没有实际的意义。

现在对排序总结如下:

order by:执行全局排序,效率低,生产环境中慎用

sort by:使数据局部有序(在reduce内部有序)

distribute by:按照指定的条件将数据分组,常与sort by联用,使数据局部有序

cluster by:当distribute by与sort by是同一个字段时,可使用cluster by简化语法

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。