1200字范文,内容丰富有趣,写作的好帮手!
1200字范文 > hive中groupby优化_工作中总结的关于hive的优化方案

hive中groupby优化_工作中总结的关于hive的优化方案

时间:2019-12-11 07:59:06

相关推荐

hive中groupby优化_工作中总结的关于hive的优化方案

这些优化全是在公司中遇到瓶颈后的一些解决方案,精华部分,当然大数据时代任何问题的解决方法都不止一种,智者见智

slow相关参数能够设置map阶段执行完成在执行ruduce,默认值是0.05,设置成1后就会让map完全拿完数据了再执行reduce阶段,这样不会因为map端数据倾斜了执行map时也在执行reduce两个阶段的槽位都被占着,也就是内存利用率问题

set mapreduce.job.pletedmaps =1;

(数据应用)就像香肠,最好别看见它们是怎么做出来的。 ——Otto von Bismarck

作业优先级set mapred.job.priority=VERY_HIGH;

stage-label.sql中小文件太多,还比较小,几k大小的数据,最多一次一个job产生了9万多个map,100个reduce

第一条sql:

设置reduce数无效,合并map无效,压缩无效,最后设置bzip2压缩有点效果,因为bzip2支出文件切分,但是cpu会很吃力,毕竟一小时有1-2T的数据量

我决定废弃gzip和bzip2,使用DeflateCodec(只是在stage-label.sql中)

解决办法:

1.在这个脚本中有很多条sql,我采用优化sql,在where条件中不能加别名,有的sql为了给一个type字段过滤就在一个大表查询中加了嵌套,

废弃嵌套,把用函数处理后的字段放在where后不加别名

2.有的sql不产生reduce,在where后加distribute by rand(1234),强制产生reduce,在加上mapred.reduce.tasks参数

3.加上压缩属性:set press.output=true;

set press=true;

set pression.type=BLOCK;

set pression.codec=org.apache.press.DeflateCodec;

set mapred.reduce.tasks=50;

65.5g跑出来到目标表,24.5g,DeflateCodec压缩后8.5g,压缩比大约35%

第二条sql:

以上一条sql目标表为源表向另一张表插入数据,文件不切分,即使是压缩文件,map执行前也没有解压文件,因为就产生了50个map,依然是没有reduce阶段。

产生了几十个小文件(kb大小),显然在一个脚本中设置全局reduce不合理

解决办法:

1.getEc这个udf函数中:如果传入的参数是null,返回空字符串,如果是'null',不处理,可以也返回空字符串(暂不优化),把where ec is null去掉,

然后把嵌套去掉,直接把udf处理后的字段放在where后过滤,然后把udf处理的字段 !='',改成传入的参数 != '',如此一来少调用了udf,少了where条件,

少了嵌套(少了一部大数据量的查询)

2.让其强制产生reduce,并再次设置reduce数量(用于覆盖第一次设置的reduce数量属性):set mapred.reduce.tasks=1;

有效果,确实会合并小文件,看来defalteCodec也是可以合并的,但是reduce阶段比较慢,猜测是反序列化后合并在做压缩导致的

又测试了下不压缩只覆盖reduce数量参数,reduce阶段确是会快很多,只需要反序列化即可:(还是没有改压缩方式,默认压缩,效率也还行)

set mapreduce.reduce.tasks=1;

distribute by rand(1);

第三条sql:

待优化:

use db;

insert into table tj_idfa

select '${date_desc}',count(distinct imei),

count(distinct idfa),count(distinct id_card),count(distinct tel_no),count(distinct user_name)

from (

select imei,idfa,id_card,tel_no,user_name

from z02_dw_tab_telphone_info_cde where pt_date='${date_desc}'

group by imei,imsi,idfa,id_card,tel_no,user_name) b;

优化1:

use db;

create table tmp_z02_dw_tab_telphone_info_cde

as

select imei,idfa,id_card,tel_no,user_name

from z02_dw_tab_telphone_info_cde where pt_date='${date_desc}'

group by imei,imsi,idfa,id_card,tel_no,user_name;

insert into table tj_idfa

select '${date_desc}',

count(distinct case when imei is null and imei = 'null' and imei = 'NULL' then '999' else imei),

count(distinct case when idfa is null and idfa = 'null' and idfa = 'NULL' then '999' else idfa),

count(distinct case when id_card is null and id_card = 'null' and id_card = 'NULL' then '999' else id_card),

count(distinct case when tel_no is null and tel_no = 'null' and tel_no = 'NULL' then '999' else tel_no),

count(distinct case when user_name is null and user_name = 'null' and user_name = 'NULL' then '999' else user_name)

from tmp_z02_dw_tab_telphone_info_cde;

优化2:

use db;

create table tmp_z02_dw_tab_telphone_info_cde

as

select imei,idfa,id_card,tel_no,user_name

from z02_dw_tab_telphone_info_cde where pt_date='${date_desc}'

group by imei,imsi,idfa,id_card,tel_no,user_name;

from tmp_z02_dw_tab_telphone_info_cde

insert into table tj_idfa

select '${date_desc}',1+count(distinct case when imei is null and imei = 'null' and imei = 'NULL' then '999' else imei)

'' as idfa,'' as id_card,'' as tel_no,'' as user_name where imei is not null

select '${date_desc}','' as imei,1+count(distinct case when idfa is null and idfa = 'null' and idfa = 'NULL' then '999' else idfa) ,

'' as id_card,'' as tel_no,'' as user_name where idfa is not null

select '${date_desc}','' as imei,'' as idfa,1+count(distinct case when id_card is null and id_card = 'null' and id_card = 'NULL' then '999' else id_card) ,

'' as tel_no,'' as user_name where id_card is not null

select '${date_desc}','' as imei,'' as idfa,'' as id_card,1+count(distinct case when tel_no is null and tel_no = 'null' and tel_no = 'NULL' then '999' else tel_no) ,

'' as user_name where tel_no is not null

select '${date_desc}','' as imei,'' as idfa,'' as id_card,'' as user_name ,1+count(distinct case when user_name is null and user_name = 'null' and user_name = 'NULL' then '999' else user_name)

where user_name is not null ;

换用tez执行引擎:

有以下不任性之处:不可跟linux本地磁盘交互,比如load data local inpath 用不了,add jar 本地jar包用不了,把jar上传到hdfs就ok了

需要使用一张表时,如果某个分区下没数据,那么也用不了tez

一个好的优化方案就是:

在load本地数据时,set hive.execution.engine=mr;或者用hdfs dfs -put 再alter table

写个shell清楚分区下没数据的文件夹,或者直接没文件就删除hive表分区(一个好的思路就是用tez执行引擎,然后select count(*) from tabname ;tez会自动把哪些分区下没有数据的目录列出来,复制出来 -rmr就ok)

当然写个脚本自动化是更好了

好处就是在指定作业时,单session中只初始化一次

在cli窗口执行语句,时间长没结果会报错

Error: org.apache.thrift.transport.TTransportException: .SocketException: Broken pipe (state=08S01,code=0)

磁盘存储解决方案:

set press.output=true;

set press.codec=true;

set mapreduce.press.codec=org.apache.press.DefaultCodec;

insert overwrite table compress_test partition(d) select a,b,c,'0517' from compress_test where d = 0517;//因为原始数据是没有压缩的,cpu比较饱和,首选defalutcodec

set mapred.child.java.opts = -Xmx1024m;

set mapred.reduce.tasks=500;

set hive.optimize.skewjoin = true;

set hive.auto.convert.join = true;

set hive.map.aggr=true;

set hive.groupby.skewindata=true;

优化注意要点:索引,做表关联时,在主表源头下压数据,count(distinct)必须用时用子查询:select count(*) from (select column from tab group bu column),避免三层嵌套,若是必须用多层,使用临时表,大小表关联,使用mapjoin,分析数据,空值率,比如字段空值中只有'',就不用写太多添加过滤空值了,

注意避免空值关联,源表梳理,血缘关系,公用数据上层给出,或者一个脚本中有多张表公用源表,用mutil group by,参数的设置,排序避免使用order by,压缩的设置,线文件的控制数据倾斜,union all语法特别复杂时可以用两个脚本替代,先insert overwrite 再insert into。数据的分析很重要,比如知道url有很多空值null,需要统计uv,select count(distinct url)+1 from tab where url is not null

还有个优化是有必要提一下,解析udf时是有必要使用map,,然后类似where column != '' and column != 'null' and column != 'NULL' and column is not null;可以封装一个udf方法,checkNull,支持传多参数,若是其中有参数值等于上面四种则返回false

set hive.groupby.skewindata=true;

set hive.hadoop.bineinputformat=true;--合并小文件

set hive.input.format=org.apache.Hadoop.hive.bineHiveInputFormat;

set hive.map.aggr=true;

set hive.merge.mapfiles = true;

set hive.merge.mapredfiles = true;

set mapred.max.split.size=256000000;

set hive.merge.smallfiles.avgsize=32000000;

set hive.merge.size.per.task=67108864;

set hive.exec.reducers.bytes.per.reducer=67108864;

set mapred.job.reuse.jvm.num.tasks=-1;

set press.output=true;--默认是flase

set press=true;--默认是flase

set pression.type=BLOCK;--默认是recode

set pression.codec=org.apache.press.BZip2Codec;

set pression.codecs=org.apache.press.BZip2Codec;;---DeflateCodec

set mapreduce.job.pletedmaps=1;--hive中map执行完再执行reduce,内存利用最大化

set hive.exec.parallel=true;是否可以并行执行--默认是flase

set hive.exec.parallel.thread.number=8;默认是8

临时表最好也建成分区表,避免小时级作业没跑完下个小时起了job删了上个小时建的表,如果建成分区表,下次执行只要删除两个小时前的分区即可,或者建表表名中加上小时分区也行

最后删除中间表时删除两张表,当前小时和上一个小时的,以免上一个小时程序挂了但是中间表没有删

multiinsert

from dwd_ev_pub_ec_info_text

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'Ec_汽车' as data_type,'ec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='car'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'Ec_金融' as data_type,'ec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='fin'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'Ec_家电' as data_type,'ec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='hea'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'Ec_房产' as data_type,'ec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='house'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'Ec_手机' as data_type,'ec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='mobile';

from dwd_ev_pub_act_inter_text

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'notEc_汽车','notec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='car'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'notEc_金融','notec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='fin'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'notEc_家电','ec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='hea'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'notEc_房产','notec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='house'

insert into table z03_total_count partition(pt_date='${date_desc}')

select 'notEc_手机','notec' as ec_type,count(1) as totals_count

wherept_hour rlike '${date_desc}*' and indu_cd='mobile';

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。