背景:为满足业务方分析需求,数仓基础统计中存在大量多维度的指标计算。原有计算中根据不同维度拆分多个sql串行提交hive集群进行查询,多个任务在map阶段存在重复读取源数据等问题,因此引入grouping sets语法合并业务逻辑为一条sql。
一次map一次reduce完成一个基础指标的统计,减少集群资源浪费。现以安卓国内版天活跃启动统计任务为例进行测试。
1.sql1 按province,model维度统计source为dw的前后台启动活跃数据。
sql1 SELECT SOURCE, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT SOURCE, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' AND SOURCE='dw' GROUP BY SOURCE, province, model, deviceId) a GROUP BY SOURCE, province, model
2.sql2 按channel,province,model维度统计source为dw的前后台启动活跃数据。
sql2 SELECT SOURCE, channel, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT SOURCE, channel, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' AND SOURCE='dw' GROUP BY SOURCE, channel, province, model, deviceId) a GROUP BY SOURCE, channel, province, model
sql3 按app_version,province,model维度统计source为dw的前后台启动活跃数据。
sql3 SELECT SOURCE, app_version, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT SOURCE, app_version, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' AND SOURCE='dw' GROUP BY SOURCE, app_version, province, model, deviceId) a GROUP BY SOURCE, app_version, province, model
4.sql4 按app_version,channel,province,model维度统计source为dw的前后台启动活跃数据。
sql4 SELECT SOURCE, app_version, channel, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT SOURCE, app_version, channel, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' AND SOURCE='dw' GROUP BY SOURCE, app_version, channel, province, model, deviceId) a GROUP BY SOURCE, app_version, channel, province, model
5.sql5 按province,model维度统计source为所有平台的前后台启动活跃数据
sql5 SELECT 'all' AS SOURCE, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT 'all' AS SOURCE, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' GROUP BY 'all', province, model, deviceId) a GROUP BY 'all', province, model
6.sql6 按channel,province,model维度统计source为所有平台的前后台启动活跃数据
sql6 SELECT 'all' AS SOURCE, channel, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT 'all' AS SOURCE, channel, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' GROUP BY 'all', channel, province, model, deviceId) a GROUP BY 'all', channel, province, model
7.sql7 按app_version,province,model维度统计source为所有平台的前后台启动活跃数据
sql7 SELECT 'all' AS SOURCE, app_version, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT 'all' AS SOURCE, app_version, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' GROUP BY 'all', app_version, province, model, deviceId) a GROUP BY 'all', app_version, province, model
8.sql8 按app_version,channel,province,model维度统计source为所有平台的前后台启动活跃数据
sql8 SELECT 'all' AS SOURCE, app_version, channel, province, model, sum(front_start_pv), sum(if(front_start_pv>0,1,0)), sum(all_start_pv), sum(1) FROM (SELECT 'all' AS SOURCE, app_version, channel, province, model, deviceId, sum(front_start_pv) AS front_start_pv, sum(all_start_pv) AS all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' GROUP BY 'all', app_version, channel, province, model, deviceId) a GROUP BY 'all', app_version, channel, province, model
9.sql9 合并source,app_version,channel,province,model维度组合统计前后台启动活跃数据
sql9 SELECT nvl(source,"all") as source, app_version, channel, province, model, sum(front_start_pv) as front_start_pv, sum(if(front_start_pv>0,1,0)) as front_start_uv, sum(all_start_pv) as all_start_pv, sum(1) as all_start_uv FROM (SELECT source, nvl(app_version,"dw_null") as app_version, nvl(channel,"dw_null") as channel, nvl(province,"dw_null") as province, nvl(model,"dw_null") as model, deviceId, sum(front_start_pv) as front_start_pv, sum(all_start_pv) as all_start_pv FROM dw_wps_android.dwd_app_start_d WHERE dt='2020-04-04' GROUP BY source, nvl(app_version,"dw_null"), nvl(channel,"dw_null"), nvl(province,"dw_null"), nvl(model,"dw_null"), deviceId) a GROUP BY source, app_version, channel, province, model GROUPING SETS( (source,province,model), (source,channel,province,model), (source,app_version,province,model), (source,app_version,channel,province,model), (province,model), (channel,province,model), (app_version,province,model), (app_version,channel,province,model) )HAVING source='dw' OR source is null
sql行号 | 数据行 | time(秒) |
---|---|---|
sql1 | 191618 | 54.289 |
sql2 | 349692 | 186.34 |
sql3 | 811334 | 48.341 |
sql4 | 1059390 | 50.373 |
sql5 | 197314 | 66.308 |
sql6 | 405496 | 68.312 |
sql7 | 1077795 | 152.358 |
sql8 | 1376569 | 182.347 |
总和 | 5469208 | 808.668 |
sql行号 | 数据行 | time(秒) |
---|---|---|
sql9 | 5469208 | 134.432 |
说明:sql都是都测试队列相同集群环境下测试的。都是在container复用的情况下进行测试,排除了第一个sql从0申请资源执行时间过长的因素。
结论:采用grouping set语法进行sql合并能有效减少hive任务执行时间,同时以上9个sql占用的集群资源基本相当,优化后会大量减少集群资源的消耗。之后会测试数仓批量任务执行优化前后的时间对比,进行更全面的对比。