Hive开启向量化模式也是hiveSQL优化方法中的一种,可以提升hive查询速率,也叫hive矢量化。
问题1:那么什么是hive向量化模式呢?
问题2:hive向量化什么情况下可以被使用,或者说它有哪些使用场景呢?
问题3:如何查看hive向量化使用的相关信息?
hive向量化模式是hive的一个特性,也叫hive矢量化,在没有引入向量化的执行模式之前,一般的查询操作一次只处理一行数据,在向量化查询执行时一次处理1024行的块来简化系统底层的操作,提高了数据处理的性能。
在底层,hive提供的向量模式,并不是重写了Mapper函数,而是通过实现inputformat接口,创建了VectorizedParquetInputFormat类,来构建一个批量输入的数组。
向量化模式开启的方式如下:
-- 开启hive向量化模式 set hive.vectorized.execution.enabled = true;
Hive向量化模式并不是可以直接使用,它对使用的计算引擎,使用数据的数据类型,以及使用的SQL函数都有一定的要求。
不同的计算引擎支持程度不一样:MR计算引擎仅支持Map阶段的向量化,Tez和Spark计算引擎可以支持Map阶段和Reduce阶段的向量化。
hive文件存储类型必须为ORC或者Parquet等列存储文件类型。
以上数据类型为向量化模式支持的数据类型,如果使用其他数据类型,例如array和map等,开启了向量化模式查询,查询操作将使用标准的方式单行执行,但不会报错。
算数表达式: +, -, *, /, % 逻辑关系:AND, OR, NOT 比较关系(过滤器): <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters 使用 AND, OR, NOT, <, >, <=, >=, =, != 等布尔值表达式(非过滤器) 空值校验:IS [NOT] NULL 所有的数学函数,例如 SIN, LOG等 字符串函数: SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH 类型转换:cast Hive UDF函数, 包括标准和通用的UDF函数 日期函数:YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP IF条件表达式
以上函数表达式在运行时支持使用向量化模式。
查看hive向量化信息是前置的,可以通过执行计划命令explain vectorization查看向量化描述信息。当然,执行中,也可以通过日志了解向量化执行信息,但相对筛选关键信息比较复杂。
explain vectorization是在hive2.3.0版本之后发布的功能,可以查看map阶段和reduce阶段为什么没有开启矢量化模式,类似调试功能。
explain vectorization支持的语法:explain vectorization [only] [summary|operator|expression|detail]
接下来我们通过实例来查看以上命令的展示内容:
例1 关闭向量化模式的情况下,使用explain vectorization only。
-- 关闭向量化模式 set hive.vectorized.execution.enabled = false; explain vectorization only select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
执行结果:
PLAN VECTORIZATION: enabled: false #标识向量化模式没有开启 enabledConditionsNotMet: [hive.vectorized.execution.enabled IS false] #未开启原因
如上,如果关闭向量化模式,输出结果中PLAN VECTORIZATION 这里可以看到该模式没有被开启,原因是没有满足enabledConditionsNotMet 指代的条件。
例2 开启向量化模式的情况下,使用explain vectorization only。
-- 开启向量化模式 set hive.vectorized.execution.enabled = true; explain vectorization only select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
执行结果:
PLAN VECTORIZATION: enabled: true enabledConditionsMet: [hive.vectorized.execution.enabled IS true] STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Execution mode: vectorized Map Vectorization: enabled: true enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true groupByVectorOutput: true inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat allNative: false usesVectorUDFAdaptor: false vectorized: true Reduce Vectorization: enabled: false enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false Stage: Stage-0 Fetch Operator
执行结果有三部分内容:
其中STAGE PLANS打印的并不是explain中map和reduce阶段的运行信息,而是这两个阶段使用向量化模式的信息。
对以上案例内容进行关键词解读:
以上整个过程在map阶段执行了向量化模式,在reduce阶段没有执行向量化模式,是因为上文提到的reduce阶段mr计算引擎不支持,需要tez或spark计算引擎。
可以执行以下命令:
-- 开启向量化模式 set hive.vectorized.execution.enabled = true; explain vectorization only summary select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
会发现explain vectorization only
命令和explain vectorization only summary
命令执行结果完全一致。
后续其他命令也类似,explain vectorization
等同于explain vectorization summary
,summary参数是一个默认参数,可以忽略。
例3 使用explain vectorization命令查看hive向量化模式执行信息。
-- 开启向量化模式 set hive.vectorized.execution.enabled = true; explain vectorization select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
其执行结果是explain和explain vectorization only两者相加执行结果:
PLAN VECTORIZATION: enabled: true enabledConditionsMet: [hive.vectorized.execution.enabled IS true] STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: user_info_all Statistics: Num rows: 32634295 Data size: 783223080 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((age < 30) and (nick like '%小%')) (type: boolean) Statistics: Num rows: 5439049 Data size: 130537176 Basic stats: COMPLETE Column stats: NONE Select Operator ... #省略部分 # 向量化模式描述信息 Execution mode: vectorized Map Vectorization: enabled: true enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true groupByVectorOutput: true inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat allNative: false usesVectorUDFAdaptor: false vectorized: true Reduce Vectorization: enabled: false enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) ... #省略部分 Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink
... 为省略了一部分信息。
使用explain vectorization operator可以查看显示执行计划过程中运算符的向量化信息和explain运行阶段信息。
简化版为explain vectorization only operator,加only相对前者少的部分为explain运行阶段信息,下同。explain运行阶段信息我们就不查询了,感兴趣小伙伴可以自行查询查看。
例4 简化版为explain vectorization only operator查看hiveSQL矢量化描述信息。
set hive.vectorized.execution.enabled = true; explain vectorization only operator select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
执行结果:
PLAN VECTORIZATION: enabled: true enabledConditionsMet: [hive.vectorized.execution.enabled IS true] STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: # 表扫描的向量化信息 TableScan Vectorization: # 读表采用本地的向量化模式扫描 native: true # 过滤操作的向量化信息 Filter Vectorization: # 过滤操作的类 className: VectorFilterOperator # 过滤采用本地的向量化模式 native: true # 列筛选的向量化信息 Select Vectorization: className: VectorSelectOperator native: true # 聚合操作的向量化信息 Group By Vectorization: className: VectorGroupByOperator # 输出采用向量化输出 vectorOutput: true #非本地操作 native: false # reduce output向量化信息 Reduce Sink Vectorization: className: VectorReduceSinkOperator native: false # 已满足的Reduce Sink向量化条件 nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true # 不满足的Reduce Sink向量化条件 nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false # 向量化描述信息,同explain vectorization only,不作标注了。 Execution mode: vectorized Map Vectorization: enabled: true enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true groupByVectorOutput: true inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat allNative: false usesVectorUDFAdaptor: false vectorized: true Reduce Vectorization: enabled: false enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false Reduce Operator Tree: Group By Vectorization: vectorOutput: false native: false Stage: Stage-0 Fetch Operator
以上内容关键词在代码块有行注释标注,可以看到explain vectorization only operator命令多了在explain执行计划过程中增加了具体每一个运算符(operator)步骤的是否向量化及具体信息。如果不满足向量化步骤,哪些条件满足,哪些条件不满足,也做了标注。
expression:补充显示表达式的向量化信息,例如谓词表达式。还包括 summary 和 operator 的所有信息。
例5 简化版explain vectorization only expression命令查看hiveSQL执行计划表达式的向量化信息。
set hive.vectorized.execution.enabled = true; explain vectorization only expression select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
执行结果:
# 同explain vectorization PLAN VECTORIZATION: enabled: true enabledConditionsMet: [hive.vectorized.execution.enabled IS true] # 同explain vectorization STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Vectorization: native: true # 表示表扫描后有25列。 projectedOutputColumns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] Filter Vectorization: className: VectorFilterOperator native: true # 表示谓词过滤少选有两列,以及过滤条件的内容。 predicateExpression: FilterExprAndExpr(children: FilterLongColLessLongScalar(col 11, val 30), FilterStringColLikeStringScalar(col 7, pattern %小%)) Select Vectorization: className: VectorSelectOperator native: true # 表示进行列筛选的具体列,这里是第12列,数组下标为11.如果为空[],则表示任何一个列。 projectedOutputColumns: [11] Group By Vectorization: # 表示使用VectorUDAFCount的方法进行count计数统计以及输出类型。 aggregators: VectorUDAFCount(ConstantVectorExpression(val 0) -> 25:int) -> bigint className: VectorGroupByOperator vectorOutput: true # 聚合列 keyExpressions: col 11 native: false # 输出为一个新的数组,只有一列 projectedOutputColumns: [0] Reduce Sink Vectorization: className: VectorReduceSinkOperator native: false nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false # 向量化描述信息,同explain vectorization only,不作标注了。 Execution mode: vectorized Map Vectorization: enabled: true enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true groupByVectorOutput: true inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat allNative: false usesVectorUDFAdaptor: false vectorized: true Reduce Vectorization: enabled: false enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false Reduce Operator Tree: Group By Vectorization: vectorOutput: false native: false projectedOutputColumns: null Stage: Stage-0 Fetch Operator
以上打印信息内容可以看出 explain vectorization only expression命令相对打印的信息是更细粒度到字段级别的信息了。基本上将操作的每一列是否使用向量化处理都打印了出来,这样我们可以很好的判断哪些字段类型是不支持向量化模式的。
explain vectorization only detail 查看最详细级别的向量化信息。它包括summary、operator、expression的所有信息。
例6 explain vectorization only detail 查看最详细级别的向量化信息。
set hive.vectorized.execution.enabled = true; explain vectorization only detail select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
执行结果:
PLAN VECTORIZATION: enabled: true enabledConditionsMet: [hive.vectorized.execution.enabled IS true] STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 # 同explain vectorization only expression STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Vectorization: native: true projectedOutputColumns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] Filter Vectorization: className: VectorFilterOperator native: true predicateExpression: FilterExprAndExpr(children: FilterLongColLessLongScalar(col 11, val 30), FilterStringColLikeStringScalar(col 7, pattern %小%)) Select Vectorization: className: VectorSelectOperator native: true projectedOutputColumns: [11] Group By Vectorization: aggregators: VectorUDAFCount(ConstantVectorExpression(val 0) -> 25:int) -> bigint className: VectorGroupByOperator vectorOutput: true keyExpressions: col 11 native: false projectedOutputColumns: [0] Reduce Sink Vectorization: className: VectorReduceSinkOperator native: false nativeConditionsMet: hive.vectorized.execution.reducesink.new.enabled IS true, Not ACID UPDATE or DELETE IS true, No buckets IS true, No TopN IS true, No DISTINCT columns IS true, BinarySortableSerDe for keys IS true, LazyBinarySerDe for values IS true nativeConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false, Uniform Hash IS false # 向量化描述信息这里做了更详细的描述 Execution mode: vectorized Map Vectorization: enabled: true enabledConditionsMet: hive.vectorized.use.vectorized.input.format IS true groupByVectorOutput: true inputFileFormats: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat allNative: false usesVectorUDFAdaptor: false vectorized: true rowBatchContext: dataColumnCount: 24 includeColumns: [7, 11] dataColumns: uid:bigint, reg_time:string, cc:string, client:string, if_new:int, last_location:string, platform_reg:string, nick:string, gender:int, birthday:string, constellation:string, age:bigint, description:string, is_realname:int, realname_date:string, last_active_day:string, is_active:int, user_status:int, user_ua:string, vst_cnt:bigint, vst_dur:bigint, is_vip:int, chat_uv:bigint, chat_cnt:bigint partitionColumnCount: 1 partitionColumns: ymd:string scratchColumnTypeNames: bigint Reduce Vectorization: enabled: false enableConditionsMet: hive.vectorized.execution.reduce.enabled IS true enableConditionsNotMet: hive.execution.engine mr IN [tez, spark] IS false Reduce Operator Tree: Group By Vectorization: vectorOutput: false native: false projectedOutputColumns: null Stage: Stage-0 Fetch Operator
通过以上内容可以看出 explain vectorization only detail打印的信息其中执行计划部分内容和explain vectorization only expression粒度一致,在向量化描述信息部分做了更细粒度的描述,到字段级别。
以上就是hive向量化explain vectorization相关参数的使用,其命令在我们使用向量化模式中进行验证支持的函数和数据类型逐步递进,可以根据需要使用。
而hive向量化模式可以极大程度的优化hive执行速度。
例7 执行优化速度比对。
-- 代码1 开启向量化模式 set hive.vectorized.execution.enabled = true; select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age; -- 代码2 关闭向量化模式 set hive.vectorized.execution.enabled = false; select age,count(0) as num from temp.user_info_all where ymd = '20230505' and age < 30 and nick like '%小%' group by age;
执行结果:
# 代码1执行结果开启向量化模式 MapReduce Total cumulative CPU time: 1 minutes 1 seconds 740 msec Ended Job = job_1675664438694_13647623 MapReduce Jobs Launched: Stage-Stage-1: Map: 6 Reduce: 5 Cumulative CPU: 61.74 sec HDFS Read: 367242142 HDFS Write: 1272 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 1 seconds 740 msec OK 15 23 ... # 省略数据 29 81849 Time taken: 41.322 seconds, Fetched: 31 row(s) # 代码2执行结果关闭向量化模式 MapReduce Total cumulative CPU time: 1 minutes 39 seconds 190 msec Ended Job = job_1675664438694_13647754 MapReduce Jobs Launched: Stage-Stage-1: Map: 6 Reduce: 5 Cumulative CPU: 99.19 sec HDFS Read: 367226626 HDFS Write: 1272 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 39 seconds 190 msec OK 15 23 ... # 省略数据 29 81849 Time taken: 50.724 seconds, Fetched: 31 row(s)
以上结果可以看出,开启向量化模式执行结果查询耗时减少,虽然减少的不多,但在CPU使用上少了三分之一的资源。可见开启向量化模式不仅可以提高查询速度,还可以节省查询资源。
以上开启向量化模式为mr引擎测试结果,tez和spark还具有更优的执行表现。
下一期:Hive执行计划之只有map阶段SQL性能分析和解读
按例,欢迎点击此处关注我的个人公众号,交流更多知识。
后台回复关键字 hive,随机赠送一本鲁边备注版珍藏大数据书籍。