04-25 01:56 阅读 153

Hive性能调优(四)——如何解决数据倾斜问题

一.不可拆卸大文件

二.处理大量相同的键

1.含有大量无意义的数据

2.某个key值数量远大于其他key

三.参考文章

一.不可拆卸大文件

BZip2 Gzip Lz4 Snappy 未压缩

所用时间（ms） 17724 2448 550 351

压缩大小（Mb） 16 19 28 33 166

前面测试过数据的压缩。

Gzip不支持分割，只能一个map读取，map端会引起数据倾斜。

二.处理大量相同的键

1.含有大量无意义的数据

含大量null空值，无意义数据时，聚合和表连接可能会发生数据倾斜。

聚合：

先进行采样，判断哪个key值数量很大。

key值后面加上随机数

select case when user_id is null then concat('null',rand()) else user_id end from log

group by user_id;

后面还要进行去随机数处理。

连接：

解决方案一(这种做法适合内连接)：在计算时排除这些数据。

select * from log a join user b on a.user_id is not null and a.user_id = b.user_id;

解决方案二(这种做法适合左外连接)：赋予空值新的 key 值。

select * from log a left outer join user b on

case when a.user_id is null then concat('hive',rand()) else a.user_id end = b.user_id;

2.某个key值数量远大于其他key

聚合：

解决方案：

先进行采样，判断哪个key值数量很大。

在聚合时，key值后面加上随机数：

select case when user_id is much then concat('much',rand()) else user_id end from log;

group by user_id;

得到结果后在进行去除随机数的处理。

连接：

解决方案一（适合大小表）：使用map join解决小表关联大表造成的数据倾斜问题，这个方法使用的频率很高。

解决方案二（适合大大表）：可以启用两个作业，第一个作业处理没有数据倾斜的数据，第二个作业将倾斜的数据放入缓存，进行map join操作，将两个join得到的两张表进行合并。

三.参考文章

Hive学习之路（十九）Hive的数据倾斜

————————————————

原文链接：https://blog.csdn.net/qq_38258720/article/details/115870102