Based on 社区版源码。核心入口 JOIN::optimize
/** Optimizes one query block into a query execution plan (QEP.) This is the entry point to the query optimization phase. This phase applies both logical (equivalent) query rewrites, cost-based join optimization, and rule-based access path selection. Once an optimal plan is found, the member function creates/initializes all structures needed for query execution. The main optimization phases are outlined below: -# Logical transformations: - Outer to inner joins transformation. - Equality/constant propagation. - Partition pruning. - COUNT(*), MIN(), MAX() constant substitution in case of implicit grouping. - ORDER BY optimization. -# Perform cost-based optimization of table order and access path selection. See JOIN::make_join_plan() -# Post-join order optimization: - Create optimal table conditions from the where clause and the join conditions. - Inject outer-join guarding conditions. - Adjust data access methods after determining table condition (several times.) - Optimize ORDER BY/DISTINCT. -# Code generation - Set data access functions. - Try to optimize away sorting/distinct. - Setup temporary table usage for grouping and/or sorting. @retval 0 Success. @retval 1 Error, error code saved in member JOIN::error. */ int JOIN::optimize();
MySQL JOIN syntax: https://dev.mysql.com/doc/refman/8.0/en/join.html
Straight join: is similar to JOIN
, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. STRAIGHT_JOIN有两种用法:一种是加在JOIN处作为INNER JOIN的一种特殊类型hint该join的顺序;另一种是加在SELECT处使该select下的所有JOIN都强制为用户table的join顺序,从优化代码上看该用法与semi-join不可同时存在(Optimize_table_order::optimize_straight_join: DBUG_ASSERT(join->select_lex->sj_nests.is_empty())
)。
join order hint: Join-order hints affect the order in which the optimizer joins tables, including JOIN_FIXED_ORDER
, JOIN_ORDER
, JOIN_PREFIX
, JOIN_SUFFIX
各种JOIN类型:INNER JOIN, OUTER JOIN, SEMI JOIN, LEFT/RIGHT JOIN, etc.
Materialization(物化): Usually happens in subquery (sometimes known as semi-join). Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory.
Statistics (一般叫统计信息,有时也叫采样,但区分sample采样):从存储获取表的rowcount、min/max/sum/avg/keyrange等元信息,用于辅助plan优化。
table dependencies: A LEFT JOIN B : B depends on A and A's own dependencies。(待确认 DEPEND JOIN语义是否也是由table dependencies关系表示)
table access path: An access path may use either an index scan
, a table scan
, a range scan
or ref
access, known as join type in explain.
MySQL源码中JOIN对象的tables的存放layout(参考注释,单看变量名有歧义):
/** Before plan has been created, "tables" denote number of input tables in the query block and "primary_tables" is equal to "tables". After plan has been created (after JOIN::get_best_combination()), the JOIN_TAB objects are enumerated as follows: - "tables" gives the total number of allocated JOIN_TAB objects - "primary_tables" gives the number of input tables, including materialized temporary tables from semi-join operation. - "const_tables" are those tables among primary_tables that are detected to be constant. - "tmp_tables" is 0, 1 or 2 (more if windows) and counts the maximum possible number of intermediate tables in post-processing (ie sorting and duplicate removal). Later, tmp_tables will be adjusted to the correct number of intermediate tables, @see JOIN::make_tmp_tables_info. - The remaining tables (ie. tables - primary_tables - tmp_tables) are input tables to materialized semi-join operations. The tables are ordered as follows in the join_tab array: 1. const primary table 2. non-const primary tables 3. intermediate sort/group tables 4. possible holes in array 5. semi-joined tables used with materialization strategy */ uint tables; ///< Total number of tables in query block uint primary_tables; ///< Number of primary input tables in query block uint const_tables; ///< Number of primary tables deemed constant uint tmp_tables; ///< Number of temporary tables used by query
outer join -> inner join
if (select_lex->first_execution) { /** @todo This query block didn't transform itself in SELECT_LEX::prepare(), so belongs to a parent query block. That parent, or its parents, had to transform us - it has not; maybe it is itself in prepare() and evaluating the present query block as an Item_subselect. Such evaluation in prepare() is expected to be a rare case to be eliminated in the future ("SET x=(subq)" is one such case; because it locks tables before prepare()). */ if (select_lex->apply_local_transforms(thd, false)) DBUG_RETURN(error = 1); }
实际上这一步应该是在 prepare 的 resolver 阶段去做的,resolver 后 first_execution 会被置位 false。但是由于实现问题,有可能有些深层嵌套的子查询并没有被 prepare,相当于是一个临时 fix。
optimize_derived()
/* Run optimize phase for all derived tables/views used in this SELECT, including those in semi-joins. */ if (select_lex->materialized_derived_table_count) { for (TABLE_LIST *tl = select_lex->leaf_tables; tl; tl = tl->next_leaf) { if (tl->is_view_or_derived() && tl->optimize_derived(thd)) DBUG_RETURN(1); } }
递归 optimize derived table,包括但不仅限于各种子查询、CTE、table function等。如果能在优化器阶段判断这是一个constant table (if materializable table contains one or zero rows) ,则提前物化便于后面的优化判断。
optimize_cond()
a) applying transitivity to build multiple equality predicates (MEP): if x=y and y=z the MEP x=y=z is built. 对应 build_equal_items() b) apply constants where possible. If the value of x is known to be 42, x is replaced with a constant of value 42. By transitivity, this also applies to MEPs, so the MEP in a) will become 42=x=y=z. 对应 propagate_cond_constants() c) remove conditions that are always false or always true. 对应 remove_eq_conds()
优化 where_cond 和 having_cond 。这一块严格说其实也可以算Rewriter的范围。但MySQL和很多引擎不同的是,它在这里不会去对Item表达式树做改写,而是把可以优化掉的子树或变量标记为 constant,当执行阶段遇到constant的Item,直接去读它的cache值即可,不需要重新计算。
prune_table_partitions()
裁剪分区。对于分区表,如果能提前判断一些分区是不会被覆盖到的,那直接裁剪忽略掉。输入是join_cond (join tables) 或者where_cond (first table)。这一步在分布式计算引擎里是最常见的优化之一,因为分布式的存储往往会更依赖类似sharding的分区模式。
opt_sum_query()
利用统计信息行数、索引等去预先将能优化的 select fields 里的 agg 函数优化成常数。比如如果统计信息的行数是准确的 get_exact_record_count() ,则简单的 COUNT 可以被直接替换;如果 MAX/MIN 的列是索引的第一列,则可以直接 seek 索引来找到极值。
make_join_plan()
/** Calculate best possible join order and initialize the join structure. @return true if success, false if error. The JOIN object is populated with statistics about the query, and a plan with table order and access method selection is made. The list of tables to be optimized is taken from select_lex->leaf_tables. JOIN::where_cond is also used in the optimization. As a side-effect, JOIN::keyuse_array is populated with key_use information. Here is an overview of the logic of this function: */
Initialize JOIN data structures and setup basic dependencies between tables.
init_planner_arrays()
,初始化构建 JOIN_TAB 数组用于后续优化。Prepare table 之间的 join 依赖关系,包括 embed / outer join 等。
Update dependencies based on join information.
propagate_dependencies()
,基于outer join构建出table之间的依赖关系。在一些优化理论中,会把table之间的这种 join 依赖关系视为一种图结构。但边的构建会视实际实现而异,有依靠join关系构建的,有依靠 join cond 是否有索引构建的 等等。
Make key descriptions (update_ref_and_keys()).
update_ref_and_keys()
,根据 where cond / join cond 提前确定可能待选的keys。这会为后续决定 join order 提供支持。
Pull out semi-join tables based on table dependencies.
尝试子查询 semi join 上提。 SELECT * FROM t1 WHERE c1 IN (..t2)
--> SELECT * FROM t1 SEMI JOIN t2 WHERE c1=...
Extract tables with zero or one rows as const tables. Read contents of const tables, substitute columns from these tables with actual data. Also keep track of empty tables vs. one-row tables.
extract_const_tables()
, 如果统计信息表明该表是个 const table (0 or 1 row) ,则可以预读优化。
After const table extraction based on row count, more tables may have become functionally dependent. Extract these as const tables.
extract_func_dependent_tables()
Add new sargable predicates based on retrieved const values.
update_sargable_from_const()
。SARGABLE意为 Search ARGument ABLE ,是个计算引擎术语,原义表示SQL中可利用数据库自身索引优势对查询条件进行执行性能优化。SARGABLE_PARAM 对象是在update_ref_and_keys()
的 add_key_field()
中更新的,存下sargable的op(比如 =,>,< 等),然后在 update_sargable_from_const()
中结合 const_table 的优化,来优化 possible_keys 。
Calculate number of rows to be retrieved from each table.
estimate_rowcount()
,这里会去通过试读去估算出MySQL认为应该给当前table的row_count(以及io_cost 、 cpu_cost等),用于 cost_model 计算,cost结果会存在 TABLE->read_time 中。
Calculate cost of potential semi-join materializations.
optimize_semijoin_nests_for_materialization()
,对embeding、semijoin的子查询做局部的 join reorder。其实就是按照依赖顺序排个序。
Calculate best possible join order based on available statistics.
Optimize_table_order()
, 对非子查询的所有表做 join reorder。
Fill in remaining information for the generated join order.
make_join_select()
核心对象ha_statistics
。最主要的是records表示table rowcount。
class ha_statistics { ulonglong data_file_length; /* Length off data file */ ulonglong max_data_file_length; /* Length off data file */ ulonglong index_file_length; ulonglong max_index_file_length; ulonglong delete_length; /* Free bytes */ ulonglong auto_increment_value; /* The number of records in the table. 0 - means the table has exactly 0 rows other - if (table_flags() & HA_STATS_RECORDS_IS_EXACT) the value is the exact number of records in the table else it is an estimate */ ha_rows records; ha_rows deleted; /* Deleted records */ ulong mean_rec_length; /* physical reclength */ /* TODO: create_time should be retrieved from the new DD. Remove this. */ time_t create_time; /* When table was created */ ulong check_time; ulong update_time; uint block_size; /* index block size */ /* number of buffer bytes that native mrr implementation needs, */ uint mrr_length_per_rec; }
myrocks是在handler::info中更新stats的。而info在除了insert的写和部分查询场景会被调用以更新采样信息(调用处多达十余处)。
/** General method to gather info from handler ::info() is used to return information to the optimizer. SHOW also makes use of this data Another note, if your handler doesn't proved exact record count, you will probably want to have the following in your code: if (records < 2) records = 2; The reason is that the server will optimize for cases of only a single record. If in a table scan you don't know the number of records it will probably be better to set records to two so you can return as many records as you need. Along with records a few more variables you may wish to set are: records deleted data_file_length index_file_length delete_length check_time Take a look at the public variables in handler.h for more information. See also my_base.h for a full description. @param flag Specifies what info is requested */ virtual int info(uint flag) = 0; // 以下为可能的flag对应bit取值。 CONST除了初始化较少用;大部分情况下用VARIABLE,因为VARIABLE涉及的变量确实是较频繁更新的;ERRKEY在正常路径不会用到,用来报错查信息;AUTO专门针对自增值,自增值可从内存里table级别对象拿到。 /* Recalculate loads of constant variables. MyISAM also sets things directly on the table share object. Check whether this should be fixed since handlers should not change things directly on the table object. Monty comment: This should NOT be changed! It's the handlers responsibility to correct table->s->keys_xxxx information if keys have been disabled. The most important parameters set here is records per key on all indexes. block_size and primar key ref_length. For each index there is an array of rec_per_key. As an example if we have an index with three attributes a,b and c we will have an array of 3 rec_per_key. rec_per_key[0] is an estimate of number of records divided by number of unique values of the field a. rec_per_key[1] is an estimate of the number of records divided by the number of unique combinations of the fields a and b. rec_per_key[2] is an estimate of the number of records divided by the number of unique combinations of the fields a,b and c. Many handlers only set the value of rec_per_key when all fields are bound (rec_per_key[2] in the example above). If the handler doesn't support statistics, it should set all of the above to 0. update the 'constant' part of the info: handler::max_data_file_length, max_index_file_length, create_time sortkey, ref_length, block_size, data_file_name, index_file_name. handler::table->s->keys_in_use, keys_for_keyread, rec_per_key */ #define HA_STATUS_CONST 8 /* update the 'variable' part of the info: handler::records, deleted, data_file_length, index_file_length, check_time, mean_rec_length */ #define HA_STATUS_VARIABLE 16 /* This flag is used to get index number of the unique index that reported duplicate key. update handler::errkey and handler::dupp_ref see handler::get_dup_key() */ #define HA_STATUS_ERRKEY 32 /* update handler::auto_increment_value */ #define HA_STATUS_AUTO 64
基数 Cardinality:表示 distinct value的个数,比如性别字段取值只有男和女的话,则其基数为2。
选择度 Selectivity :the fraction of the rows will be selected. **Selectivity of index = Cardinality / (row_count) ** . 选择度越高越好,最高为 100%。像性别字段的例子,选择度一般就很低,作为索引列时区分度差,一个男性筛选条件下面的数据依然是处于较大量级。
代价 Cost : 常说的CBO即是通过估算代价来决定优化策略。不同数据库代价模型可能会不一样,常见的作为代价模型的输入包括但不限于 row_count ,cardinality,selectivity,io/cpu/mem cost, max/min/avg of columns等。
Partial Plan:MySQL里的一个概念,指 one part of the whole plan,至少是两个表的一个JOIN关系。
上表为MySQL默认的一些单元操作的cost值,这些值是用户可以配置修改的。
# MySQL cost model大致思路可以描述为一个递推关系。实际cost计算会比这个要复杂。 f(tbl_i) = prev_total_cost + tbl_i_cost -> f(tbl_i) = f(tbl_i-1) + row_evaluate_cost(prev_rows * cur_table_rows * selectivity) + tbl_i_extra_cost plan_cost = f(tbl_n)
Join 表示一个 query 的 join plan,同时也作为 plan 的 context 流转(因此在 para query 等一些优化实现中,并行查询里除了最上层的父查询有实际优化的价值外,Join 起的作用更像一个 context )。
best_positions存放最终优化的 table order 结果。
best_read 存放最终 cost
best_ref 存放输入的 table 序列,the optimizer optimizes best_ref 。
需注意,在一些单表简单查询下,上面的变量有可能为空。
make_join_plan
在 JOIN::optimize 里被调用,计算最佳的 join order 并构建 join plan。
Optimize_table_order
类负责实际的join reorder操作,入口方法为其惟一的 public 方法 choose_table_order
,在 make_join_plan
中被调用。 Optimize_table_order
依赖三个前提:
choose_table_order
Steps:
初始化 const_tables的 cost,如果全是 const_tables 则可以直接短路返回
如果是在一个 sjm(semi-join materialization) plan 优化过程中,则做一次排序将 semi-join (即子查询的query 提前预计算,可根据需求物化)
否则,非 STRAIGHT_JOIN 且 depend 无关的 tables 是按照 row_count 从小到大排序的
if (SELECT_STRAIGHT_JOIN option is set) reorder tables so dependent tables come after tables they depend on, otherwise keep tables in the order they were specified in the query else Apply heuristic: pre-sort all access plans with respect to the number of records accessed. Sort algo is merge-sort (tbl >= 5) or insert-sort (tbl < 5)
如果有 where_cond,需要把where_cond涉及的列 遍历设置到table->cond_set的bitmap中。
STRAIGHT_JOIN 的 tables 优化 optimize_straight_join
。 STRAIGHT_JOIN 相当于用户限定了 JOIN 的顺序,所以此处的优化工作如其注释所说:Select the best ways to access the tables in a query without reordering them.
非 STRAIGHT_JOIN 则使用 启发式贪心算法 greedy_search
进行 join reorder。
optimize_straight_join
:
DBUG_ASSERT(join->select_lex->sj_nests.is_empty());
与semi-join不兼容,只关注primary tables。best_access_path
计算其最优的access path,best_access_path
通俗的思路概括可参见上面table access path
explain文档中关于join types的介绍。set_prefix_join_cost计算当前表基于对应access path下的cost,并计入总的cost model。Cost计算如下:
m_row_evaluate_cost = 0.1 // default value /* Cost of accessing the table in course of the entire complete join execution, i.e. cost of one access method use (e.g. 'range' or 'ref' scan ) multiplied by estimated number of rows from tables earlier in the join sequence. */ read_cost = get_read_cost(table) void set_prefix_join_cost(uint idx, const Cost_model_server *cm) { if (idx == 0) { prefix_rowcount = rows_fetched; prefix_cost = read_cost + prefix_rowcount * m_row_evaluate_cost; } else { // this - 1 means last table prefix_rowcount = (this - 1)->prefix_rowcount * rows_fetched; prefix_cost = (this - 1)->prefix_cost + read_cost + prefix_rowcount * m_row_evaluate_cost; } // float filter_effect [0,1] means cond filters in executor may reduce rows. 1 means nothing filtered, 0 means all rows filtered and no rows left. It is used to calculate how many row combinations will be joined with the next table prefix_rowcount *= filter_effect; }
greedy_search
:
bool Optimize_table_order::best_extension_by_limited_search( table_map remaining_tables, uint idx, uint current_search_depth); procedure greedy_search input: remaining_tables output: partial_plan; { partial_plan = <>; do { (table, a) = best_extension_by_limited_search(partial_plan, remaining_tables, limit_search_depth); partial_plan = concat(partial_plan, (table, a)); remaining_tables = remaining_tables - table; } while (remaining_tables != {}) return pplan; } // 简单理解就是每一步找拓展出去join路径最佳的table,按顺序加进plan里面。 // 这种方案会很受选取的第一个表影响(因为第一个表没有join关系,只能依靠筛选之后的cardinality,一般都是小表),选小表作第一个表不一定是最优选择。一般Greedy的优化方案会把每个表都当第一个表去评估一次cost,然后从N个cost里选最小的作为最终plan。MySQL里只是返回找到的第一个完整的plan。
best_extension_by_limited_search
是一个启发式的搜索过程,search_depth即最大可搜索的深度。best_extension_by_limited_search
前半部分逻辑和optimize_straight_join
类似:
best_access_path
并计算cost。prune_level=PRUNE_BY_TIME_OR_ROWS
开启,则判断如果best_row_count和best_cost已经大于当前的rows和cost(注意新版本是and关系),且该表没有被后续的其他表依赖 (可以理解成该表是这个图路径上的最后一个节点,所以可以直接prune;但不一定是整个plan的最后一个节点),则将best_row_count和best_cost设为当前的。写的很绕的代码,结合整个循环看,大致就是每次找基于(rowcount, cost)二维最优的表,所谓的剪枝实际变成了类似加强贪心。