1200字范文 > hive 指定字段插入数据_为hive增加列存储

hive 指定字段插入数据_为hive增加列存储

时间：2020-05-14 22:26:10

为hive增加列存储应该算是我在hive上进行的第一个比较大的尝试。

目前已经实现对列表的INSERT操作。

在这一周时间里，我对hive的理解更加的深入。

废话不说，下面开始介绍下hive的列存储功能。

1,创建列存储表

hive> create table sunwg(id int,name string,score int) organize by columns (id);

Time taken: 0.201 seconds

注：通过organize by columns标识表sunwg是列表，后面的id为列表的主键。

列表的主键是各个字表进行关联的基础，目前需要指定列表的主键。

也曾经想过系统自动产生个主键，在某些场景可能是有必要的，不过目前没想到很好的方式来实现唯一的主键。

等想好了在说。

hive> show tables;

sunwg

sunwg_cf_key

test01

Time taken: 0.063 seconds

能够看到系统中不光增加了表sunwg，还产生了另外一个表sunwg_cf_key。

表sunwg_cf_key是列表的主键表。

hive> describe extended sunwg;

id int

name string

score int

Detailed Table Information

Table(tableName:sunwg, dbName:default, owner:hjl, createTime:1300924357, lastAccessTime:0, retention:0,

sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null), FieldSchema(name:name, type:string,

comment:null), FieldSchema(name:score, type:int, comment:null)],

location:hdfs://hadoop00:9000/hjl/sunwg, inputFormat

rg.apache.hadoop.mapred.TextInputFormat,

outputFormat

rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1,

serdeInfo:SerDeInfo(name:null, serializationLib

rg.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,

parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[],

parameters:{COLUMNS=TRUE, transient_lastDdlTime=1300924357}, viewOriginalText:null, viewExpandedText:null,

tableType:COLUMNS_TABLE)

Time taken: 0.054 seconds

在最后我们能看到表sunwg的类型为COLUMNS_TABLE。这是我为列表起的名字。><

2,为列表增加列组name

hive> alter table sunwg add columnfamily name (name);

Time taken: 0.107 seconds

注：columnfamily是列表的关键字，name是列表的名字，而括号中的name是列表中包含的列。

hive> show tables;

sunwg

sunwg_cf_key

sunwg_cf_name

test01

Time taken: 0.037 seconds

列组实际上底层是一张实际的表，表名字为sunwg_cf_name。目前是通过名字来判断列表和列组的关系。

虽然能实现功能，但还是有些土，以后考虑有必要的话，把这种对照关系保存在元数据中。

hive> describe sunwg_cf_name;

id int

name string

Time taken: 0.058 seconds

虽然表sunwg_cf_name中仅仅声明了一个字段name，但是我会自动把主键id也加到列组中，id是如此的重要。

3,为列表增加列组score

这次我也在列表中增加对id的声明。

hive> alter table sunwg add columnfamily score (id, score);

Time taken: 0.103 seconds

hive> describe test04_cf_score;

id int

score int

Time taken: 0.052 seconds

呵呵，没有额外的id列产生，我在构造列组的情况下会对声明的列进行检查的。

避免产生无谓的浪费。

4,向列表sunwg插入数据

hive> select * from test01;

100 tom 90

101 mary 80

10270

103 kate 60

104 jone 50

Time taken: 0.071 seconds

hive> insert overwrite table sunwg select id,name,score from test01;

Loading data to table sunwg_cf_score

Loading data to table sunwg_cf_name

Loading data to table sunwg_cf_key

5 Rows loaded to sunwg_cf_key

5 Rows loaded to sunwg_cf_score

4 Rows loaded to sunwg_cf_name

Time taken: 38.924 seconds

对表sunwg的INSERT被转化成对sunwg_cf_score，sunwg_cf_name，sunwg_cf_key三张列组的INSERT。

能够看到表sunwg_cf_name仅仅增加了4条记录，比其他的列表要少一条记录。

这条记录就是id为102的那条记录，因为这表记录中的name值为空，没有必要存储在表中。

hive> select * from sunwg_cf_name;

104 jone

100 tom

101 mary

103 kate

Time taken: 0.082 seconds

原理比较简单，就是将对表sunwg的INSERT转化成对列组的插入。

不过实际上这部分的工作量还是有些大的，因为需要

1：列组的列和SELECT后列的对照

2：插入列组数据的非空检查

3：WHERE条件的合并

4：GROUP条件的合并

。。。

这样就基本实现对列表的插入。

如果对列组进行单独的操作可能会对其他的列组产生影响。

比如对sunwg_cf_name增加一些新的id，那么需要对sunwg_cf_key也做一些更新的操作。

这些操作是系统自动完成，还是用户手工调用，我还没有想清楚，先放放。

接下来会主要实现对列表的SELECT工作。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。