I will be updating this gist as I dig deeper into the optimization planner. My research will be strictily bound to SELECT qeury right now.
When running an explain query, you can find the output like ->
MariaDB [sample_db]> explain select * from sample_table;
+------+-------------+--------------+------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------------+------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | sample_table | ALL | NULL | NULL | NULL | NULL | 4 | |
+------+-------------+--------------+------+---------------+------+---------+------+------+-------+
The entry point for any select query is defined in file sql_select.cc
/**
An entry point to single-unit select (a select without UNION).
@param thd thread handler
@param rref_pointer_array a reference to ref_pointer_array of
the top-level select_lex for this query
@param tables list of all tables used in this query.
The tables have been pre-opened.
@param fields list of items in SELECT list of the top-level
select
e.g. SELECT a, b, c FROM t1 will have Item_field
for a, b and c in this list.
@param conds top level item of an expression representing
WHERE clause of the top level select
@param og_num total number of ORDER BY and GROUP BY clauses
arguments
@param order linked list of ORDER BY agruments
@param group linked list of GROUP BY arguments
@param having top level item of HAVING expression
@param proc_param list of PROCEDUREs
@param select_options select options (BIG_RESULT, etc)
@param result an instance of result set handling class.
This object is responsible for send result
set rows to the client or inserting them
into a table.
@param select_lex the only SELECT_LEX of this query
@param unit top-level UNIT of this query
UNIT is an artificial object created by the
parser for every SELECT clause.
e.g.
SELECT * FROM t1 WHERE a1 IN (SELECT * FROM t2)
has 2 unions.
@retval
FALSE success
@retval
TRUE an error
*/
bool
mysql_select(THD *thd, TABLE_LIST *tables, List<Item> &fields, COND *conds,
uint og_num, ORDER *order, ORDER *group, Item *having,
ORDER *proc_param, ulonglong select_options, select_result *result,
SELECT_LEX_UNIT *unit, SELECT_LEX *select_lex)
On first invokation, it tries to setup linkage, determine if this table is derived table or not, creates a JOIN
datastruct (not sure why) and then prepare the query.
Once the query is prepared, it works on optimizing the said JOIN
structure.
The actual query optimization happens in greedy_search()
and static enum_best_search best_extension_by_limited_search()
where the former is a greedy search (sherlock...) and the latter is exhasutive search upto certain depth.
Will be digging deeper in the exhaustive search tomorrow.
Reading the QEP implementation was a rather humbling exercise and I admit, I can't really come up with something more elegant that what they already have (and I could grasp like 10% of it).
The overall idea about running a SELECT query is:
- Initializing a JOIN struct (or a UNION struct, depending on the query).
- Prepare the join (gather the required statistics like table, column and index stats, which inturn are generally pre-computed, you'll read more about them later in the blog.)
- We try and optimize this JOIN
- first layer of optimization (
JOIN::optimize_inner()
) will be performed which roughly does the following:- Merges all the derived tables into this layer of select.
- Transforms the IN predicates into subqueries.
- Converts subqueries to semi-joins wherever applicable. See convert_join_subqueries_to_semijoins() to understand what do I mean here.
- Creates a bitmap of tables used from the SELECT list. (They loooooove bitmaps.)
- Create an EXPLAIN query if it's not already existing (They ensure that the explain query is built anyway and kept till the JOIN has not dropped even if you haven't explicitly run a EXPLAIN query).
- Do some optimizations for nested joins & constant subueries
- Evaluate all the EQ conditions in WHERE and HAVING clause.
- Transform the IN queries into EQ queries and optimize all the conditions.
- ....... Did not understand what's the planner doing for quite a while.
- Call
make_join_statistics()
which calculates the best possible join structure. - FINALLY MARK STATE TO
OPTIMIZATION_PHASE_1_DONE
! (PS : By default it's not required, it'll set it toOPTIMIZATION_DONE
instead by default)
- Second layer of optimization is triggered if value is set to
OPTIMIZATION_PHASE_1_DONE
.
- first layer of optimization (
- We finally execute the planned JOIN.
However, after reading this internal blog about conditional selectivity (How many rows of a table do we estimate to return in a join / condition evalution?), my eyes darted towards the statistics elements of the engine with a new goal - messing up with histograms, which I'll explain tomorrow.
First, why are histograms even present in a database? Amazing question, thankfully mariadb has an answer here -> Histogram Based Statistics
And how do you play with them? They got that basics here -> DECODE_HISTOGRAM
Will be posting a separate thread about hacking and playing with histograms.
Will add something here if I find anything insightful and not present in docs already :)