Skip to content

Instantly share code, notes, and snippets.

@Integralist
Last active June 10, 2026 09:23
Show Gist options
  • Select an option

  • Save Integralist/51ae85eb6f522a6e8399cacefa385258 to your computer and use it in GitHub Desktop.

Select an option

Save Integralist/51ae85eb6f522a6e8399cacefa385258 to your computer and use it in GitHub Desktop.
MySQL Database: Core Fundamentals & Common Gotchas

Common Table Expressions

A Common Table Expression (CTE) in MySQL is a temporary result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement.

Think of it as a temporary view that exists only for the duration of that single query. It allows you to write cleaner, more readable SQL by breaking down complex queries into smaller, modular building blocks.

CTEs were introduced in MySQL 8.0, so you will need that version or higher to use them.

The Basic Syntax

A CTE always starts with the WITH keyword, followed by the CTE name and the query that defines it.

WITH cte_name AS (
    -- Your subquery goes here
    SELECT column1, column2
    FROM table_name
    WHERE condition
)
-- Your main query that uses the CTE
SELECT * FROM cte_name;

Why Use a CTE? (CTE vs. Traditional Subqueries)

Before CTEs, if you wanted to use a temporary result set, you had to use nested subqueries or derived tables. This often led to "Pyramid of Doom" code that was incredibly hard to read.

Here is a quick comparison of how a CTE improves readability:

Old Way (Derived Table / Subquery)

SELECT department_id, AVG(salary)
FROM (
    SELECT department_id, salary 
    FROM employees 
    WHERE status = 'Active'
) AS active_employees
GROUP BY department_id;

New Way (With a CTE)

WITH active_employees AS (
    SELECT department_id, salary 
    FROM employees 
    WHERE status = 'Active'
)
SELECT department_id, AVG(salary)
FROM active_employees
GROUP BY department_id;

Why this is better: You read a CTE from top to bottom, exactly how your brain processes logic. With subqueries, you have to read from the inside out.

Types of CTEs

MySQL supports two types of Common Table Expressions:

1. Non-Recursive CTEs

These are simple CTEs that do not reference themselves. They are purely used to modularize code, join multiple temporary sets together, or simplify complex aggregations.

You can even chain multiple CTEs together by separating them with a comma:

WITH total_sales AS (
    SELECT region, SUM(amount) AS total FROM sales GROUP BY region
),
top_regions AS (
    SELECT region FROM total_sales WHERE total > 100000
)
SELECT * FROM top_regions;

2. Recursive CTEs

This is where CTEs get supercharged. A recursive CTE is a query that references its own name. It continuously loops through data until a specific condition is met.

This is incredibly useful for dealing with hierarchical or tree-structured data (like organizational charts, category/sub-category loops, or bill-of-materials).

A recursive CTE requires the RECURSIVE keyword and consists of two parts combined by a UNION:

  1. Anchor Member: The base query that starts the recursion.
  2. Recursive Member: The query that references the CTE and loops.

Example: Generating numbers from 1 to 5

WITH RECURSIVE number_sequence AS (
    -- Anchor member
    SELECT 1 AS n  
    UNION ALL
    -- Recursive member (loops until n = 5)
    SELECT n + 1 FROM number_sequence WHERE n < 5  
)
SELECT * FROM number_sequence;

Key Benefits of CTEs

  • Readability: Isolates complex logic so your main query is clean.
  • Reusability: You can reference the same CTE multiple times in your main query (e.g., joining a CTE to itself), whereas a subquery would have to be written out twice.
  • Recursion: Allows MySQL to easily traverse hierarchical data without needing complex stored procedures or application-level loops.
  • Contextual Substitution: Great substitution for temporary tables when you don't need indexing or overhead storage.

πŸš€ MySQL EXPLAIN: The Ultimate Performance Guide

Database optimization can feel like a dark art, but the EXPLAIN command is your "Matrix vision." It reveals how MySQL plans to execute your query before you run it.


1. The Starting Line: Slow Query Log

You can't fix what you can't find. Use the Slow Query Log to catch queries that exceed a specific time threshold.

-- Check current settings
SHOW VARIABLES LIKE 'slow_query_log';
SHOW VARIABLES LIKE 'long_query_time';

-- Set threshold to 0.5 seconds (Global)
SET GLOBAL long_query_time = 0.5;
SET GLOBAL slow_query_log = 'ON';

2. Reading the EXPLAIN Table

Focus on these five critical columns. They tell the story of your query's efficiency.

Column Importance High-Level Meaning
type ⭐⭐⭐ How MySQL finds the rows (The "Speed Limit").
possible_keys ⭐⭐ The Candidates: Every index MySQL could have used.
key ⭐⭐⭐ The Winner: The actual index MySQL decided to use.
rows ⭐⭐ Estimate of how many rows MySQL must scan.
Extra ⭐⭐⭐ Critical "under-the-hood" details (Sorts, Temps, Loops).

πŸ’‘ Candidates (possible_keys) vs. The Winner (key)

Think of this as the optimizer's narrowing-down process:

  • possible_keys: These are the "Shortlisted" indexes. If this is NULL, you have no indexes that match your WHERE or JOIN clauses.
  • key: This is the final choice. MySQL picks the index with the "lowest cost" (usually the one that eliminates the most rows the fastest).

Why would they differ? > If possible_keys shows an index but key is NULL, MySQL decided it’s actually faster to scan the whole table (type: ALL) than to use the indexβ€”this often happens on very small tables or if you are fetching >20% of the total rows.


3. The table Column: Decoding Special Names

If the table column looks like a weird tag, MySQL is working with internal data:

  • <derivedN>: A temporary table generated from a Subquery in your FROM clause. The N refers to the id line in the output where that subquery was defined.
  • <unionM,N>: The internal result of a UNION. MySQL is merging results from lines M and N and performing deduplication.
  • <subqueryN>: A subquery that has been optimized into a specialized lookup table for efficiency.

4. The type Hierarchy (Fastest to Slowest)

Your goal is to move up this list.

🏎️ The "Fast" Lane

  • system / const: MySQL finds exactly one row instantly. (Searching by Primary Key).
  • eq_ref: The "Golden Ticket" for joins. A 1-to-1 match using a Unique/Primary key.
  • ref: Using a standard index to find multiple rows.
    • Analogy: Finding all "Smiths" in a phone book.

🚲 The "Proceed with Caution" Lane

  • range: Using an index to find a slice of data (BETWEEN, IN(), >, <).
  • index: The Full Index Scan. MySQL reads the entire index file. Better than a table scan, but still a lot of reading.

🐌 The "Danger" Zone

  • ALL: The Full Table Scan. MySQL reads every single byte of the table from disk.
    • Fix: Add an index or refine your WHERE clause.

Tip

When is ALL okay?
If the table column shows <derivedN> or <unionM,N> and the "rows" count is very small (e.g., under 100), an ALL scan is perfectly fine. It just means MySQL is reading a small temporary result set it just created in memory.


5. The Extra Column: Red Flags & Green Lights

This column reveals the "extra effort" MySQL has to put in.

Flag Impact Meaning
Using index βœ… Good Covering Index: MySQL got all data from the index itself without touching table rows.
Using index condition βœ… Good Index Condition Pushdown (ICP): MySQL filtered rows at the storage engine level using the index before even reading the full table rows. (Saves I/O).
Using index for group-by βœ… Good "Loose Index Scan." MySQL jumped between unique index values (very fast).
Using where ⚠️ Neutral MySQL filters rows after fetching them. Watch out if type is ALL.
Using intersect(...) ⚠️ Neutral Index Merge: MySQL is using two separate indexes and finding the overlap. Better than a scan, but usually slower than a single Composite Index.
Using filesort 🚨 Bad MySQL had to manually sort data because the index order didn't match your ORDER BY.
Using temporary 🚨 Bad An internal table was built in memory to process the result (Common with UNION or DISTINCT).
Dependent Subquery πŸ’€ Very Bad The subquery runs once for every row in the outer query. This is a performance killer.

Note

ICP Explanation:
Normally, the process of fetching data has two steps:

  1. The Storage Engine (e.g., InnoDB) uses an index to find a row and hands the entire row back to the MySQL Server.
  2. The MySQL Server then looks at the row and says, "Does this actually match the rest of my WHERE clause?" Using index condition means MySQL "pushed" the filtering logic down to the Storage Engine. Instead of handing back every row that might match, the Storage Engine checks the conditions against the index fields first and only hands back the rows that actually match.

Example: You have an index on (last_name, birth_year).
Your query is: WHERE last_name LIKE 'Smi%' AND birth_year > 1990.\

Without ICP, MySQL finds all "Smi%" entries and reads the full row for every single one from the disk just to check the year. With Using index condition, MySQL checks the birth_year while it's still looking at the index. It only goes to the disk for the "Smi" people actually born after 1990.


6. Common "Index Killers"

Even with an index, these mistakes force a slow type: ALL scan:

  1. Leading Wildcards: LIKE '%smith' (Slow). LIKE 'smith%' (Fast).
  2. Function Wrappers: WHERE YEAR(date) = 2026 (Slow). Use WHERE date >= '2026-01-01' (Fast).
  3. Type Mismatch: Searching a VARCHAR column with a number (e.g., WHERE string_id = 123). MySQL has to convert every row to a number to compare them.

7. Pro-Tip: The Composite Index Rule

If you frequently query WHERE status = 'active' AND created_at > '...', a single index on (status, created_at) is vastly superior to two separate indexes.

  • The Leftmost Prefix Rule: If you have an index on (A, B, C), it can be used for queries on (A), (A, B), or (A, B, C). It cannot be used for a query only on (B) or (C).

Database Indexes: Core Fundamentals & Common Gotchas

A practical guide to how database indexes function under the hood, how they impact performance, and advanced index types.

1. Fundamentals & Costs

  • How They Work: Indexes utilize a sorted B-tree structure and binary search to drastically accelerate query read times.
  • The Trade-off: While they optimize reads, they introduce a read-fast / write-slow trade-off because every INSERT, UPDATE, or DELETE requires updating the index and incurs additional storage costs.

2. Common Index Gotchas

  • Left-Prefix Rule: For composite (multi-column) indexes, column ordering matters. The query must filter by the leftmost columns in the index definition to utilize it.

Note

The order of columns inside your WHERE clause does not matter because the query optimizer will reorder them to match the index; however, the necessary columns must be present.

  • Index Gaps (Partial Matching): If you filter by the first column and a later column, but skip a column in the middle of the index definition, you create an "index gap." The database will use the index to filter up until the missing column, but will be forced to perform a manual scan for any subsequent columns.
-- Composite index definition
CREATE INDEX idx_users_name_country_state ON users(last_name, country, state);

-- BAD: Does NOT use the index (skips the leftmost column entirely)
SELECT * FROM users WHERE country = 'UK';

-- GOOD: Uses the full index (Optimizer reorders the WHERE clause automatically)
SELECT * FROM users WHERE state = 'California' AND country = 'US' AND last_name = 'Smith';

-- PARTIAL / INDEX GAP: Uses the index for `last_name` only. 
-- Because `country` is missing, it cannot use the index to look up `state`.
-- The database must manually scan all 'Smith' rows to filter for 'California'.
SELECT * FROM users WHERE last_name = 'Smith' AND state = 'California';
  • Equality Before Range Rule: When designing a composite index, always place equality columns (=, IN) before range columns (>, <, BETWEEN, LIKE). A database can only evaluate one range condition per index lookup; any index columns specified to the right of a range condition are ignored.
-- BAD:

-- INEFFICIENT COMPOSITE INDEX DESIGN: Range column placed first
CREATE INDEX idx_users_created_country ON users(created_at, country);

-- POOR PERFORMANCE: The `>=` range condition halts index traversal. 
-- The database uses the index for the date range, but must manually 
-- scan every matched row to filter by country.
SELECT * FROM users WHERE country = 'UK' AND created_at >= '2026-01-01';

-- GOOD:

-- Composite index definition (Equality first, Range last)
CREATE INDEX idx_users_country_created ON users(country, created_at);

-- OPTIMAL: Uses both columns of the index because equality comes first.
SELECT * FROM users WHERE country = 'UK' AND created_at >= '2026-01-01';
  • Function-Wrapped Columns: Wrapping indexed columns in functions (e.g., WHERE LOWER(name) = 'pokemon') breaks standard index usage and triggers a full table scan.
    • Solution: Use functional (expression) indexes to index the output of the function directly.
-- BAD: Standard index on `name` is ignored here
SELECT * FROM users WHERE LOWER(name) = 'pokemon';

-- GOOD: Create an expression/functional index
CREATE INDEX idx_users_lower_name ON users(LOWER(name));

3. Advanced Indexing Techniques

  • Partial Indexes: Indexes that include a WHERE clause. Perfect for optimizing queries on low-cardinality data or specific filtered subsets, saving significant storage.
-- Index only active accounts (saves space if most users are inactive)
CREATE INDEX idx_active_users ON users(id) WHERE status = 'active';

-- This query utilizes the partial index
SELECT id FROM users WHERE status = 'active' AND id = 42;
  • Covering Indexes (Index Only Scan): Uses PostgreSQL's INCLUDE clause to append non-key payload data directly to the index. This allows the database to fetch the required data purely from the index without reading the underlying table heap.
    • Caveat: Due to MVCC (Multi-Version Concurrency Control), a table heap lookup may still happen unless a VACUUM has recently run to map the pages as 100% visible.
-- Index by email, but attach the username to the payload
CREATE INDEX idx_users_email_include_username ON users(email) INCLUDE (username);

-- Index Only Scan: The database doesn't need to look at the main table at all
SELECT email, username FROM users WHERE email = 'alex@example.com';

4. Diagnostics

  • Always use EXPLAIN and EXPLAIN ANALYZE to accurately verify whether your database engine is utilizing your indexes as intended.
-- Shows the execution plan and execution statistics
EXPLAIN ANALYZE 
SELECT * FROM users WHERE email = 'alex@example.com';

Virtual Columns

In MySQL, a Virtual Column (specifically known as a Generated Column) is a column whose value is automatically calculated from an expression or other columns in the same table, rather than being explicitly inserted or updated by a user.

Think of it like a formula in an Excel spreadsheet: you define the rule once, and the database ensures the data stays in sync.

Why is it useful?

1. Simplified Queries and DRY (Don't Repeat Yourself)

Instead of writing complex logic in every SELECT statement, you bake the logic into the table schema.

  • Example: If you have first_name and last_name, you can create a virtual full_name column. You'll never have to manually concatenate them in your code again.

2. Indexing the Un-indexable

This is the "killer feature" of Generated Columns. You cannot directly index the result of a function in a WHERE clause in older versions of MySQL.

  • The Problem: Querying WHERE JSON_EXTRACT(data, "$.id") = 10 is slow because it requires a full table scan.
  • The Solution: Create a Generated Column that extracts that JSON field and index that column. Now, your JSON lookups are lightning-fast.

3. Data Integrity and Consistency

Because the database handles the calculation, there is zero risk of the generated value getting "out of sync" with the source data. If the price or tax_rate changes, the total_price column updates automatically and atomically.

4. Handling Complex Constraints

You can use Generated Columns to enforce business logic. For example, you could create a generated column that returns TRUE if a complex set of conditions is met, and then place a UNIQUE index on it to prevent invalid data combinations.

The Two Types of Generated Columns

MySQL offers two "flavors" of these columns, and the distinction is mostly about storage space versus CPU performance:

Type Stored on Disk? Performance Impact
VIRTUAL No Calculated on-the-fly when the row is read. Uses no disk space but costs a bit of CPU during SELECT.
STORED Yes Calculated when the row is written/updated. Uses disk space but is faster to read (like a normal column).

VIRTUAL vs. STORED: When to Use Which?

When creating Generated Columns, the choice between VIRTUAL and STORED comes down to a fundamental architectural trade-off: Disk Space vs. CPU Performance.

Type Stored on Disk? Calculation Trigger Best For...
VIRTUAL ❌ No Calculated on-the-fly during a SELECT query. Simple calculations and saving storage space.
STORED Yes Calculated once during an INSERT or UPDATE. Complex logic, high-read tables, and sorting optimization.

πŸ’‘ When to choose VIRTUAL

Use VIRTUAL as your default choice for lightweight, straightforward operations.

  • To Save Disk Space: Because values are never physically written to the table heap, it uses zero additional storage.
  • Low-Cost Expressions: Perfect for simple string concatenations (e.g., CONCAT(first_name, ' ', last_name)) or basic math where the CPU overhead to compute the value on-the-fly is completely negligible.
  • Low-Read / High-Write Tables: If a table undergoes frequent writes but infrequent reads, VIRTUAL prevents write amplification because the database doesn't have to re-compute and re-write the generated data on every single update.

⚑ When to choose STORED

Use STORED when data retrieval speed is your absolute highest priority, or when dealing with complex data transformations.

  • Heavy / Expensive Calculations: If your expression utilizes resource-heavy functions, computing it on-the-fly for millions of rows during a SELECT query will cripple performance. STORED moves that CPU tax to the write phase so reads remain instant.
  • High-Read / Low-Write Tables: If the data is written once but queried constantly, paying a tiny penalty during the INSERT means every subsequent SELECT reads a pre-computed static value directly from disk.
  • Optimizing ORDER BY and Aggregations: If you frequently sort by the generated value (e.g., ORDER BY path_pattern_length DESC), a STORED column provides a static, physical property that the database can scan cleanly.

Note

While modern MySQL allows you to index VIRTUAL columns (which creates a hidden physical index structure anyway), utilizing a STORED column ensures that the computed data physically resides within the main table row, making it a highly reliable and bulletproof strategy for turning un-indexable function results into static, searchable data.

How it looks in SQL

Here is a breakdown of the syntax:

ALTER TABLE products -- Tell the database we are modifying the structure of the existing products table.
  ADD COLUMN discount_price DECIMAL(10,2) -- Create a new column named discount_price.
    GENERATED ALWAYS AS (original_price * 0.9) -- Don't let users input data here; calculate it using this rule instead.
    STORED; -- This defines how the data is kept.
  • VIRTUAL: The value is calculated "on the fly" only when you query the table.
  • STORED: The result of the calculation is written to disk whenever original_price is updated.

In routing systems (URL matching), you often want the "most specific" match to win. Since a longer pattern is usually more specific than a shorter one. The following example shows how an index can allow the database to instantly find the longest applicable pattern for a specific version_id and match_type.

You'll see we are using a Generated Column (path_pattern_length). This is necessary because of a fundamental rule in database indexing: You cannot directly index the result of a function unless you use a Generated Column (or a Functional Index); and by indexing this in descending order, we are trying to prioritize longer path patterns first.

This is useful for when you frequently run queries like ORDER BY path_pattern_length DESC LIMIT 10. By having the index already sorted in descending order allows the database to read the values directly from the "start" of the index without having to reverse-scan.

ALTER TABLE path_groups
  ADD COLUMN path_pattern_length INT
    GENERATED ALWAYS AS (LENGTH(path_pattern)) STORED;

CREATE INDEX idx_version_matchtype_length
  ON path_groups (version_id, match_type, path_pattern_length DESC);

It eliminates the need for a "Sort" operation in the execution plan if your query asks for the longest paths first.

It allows the database to find the "Maximum" value for that column (within a specific version/match type) in O(1) or O(log n) time.

If we simply created an index on path_pattern, the database would store the strings (e.g., /user/*, /admin/settings/logs). If we then ran a query asking for the length of those strings, the database would have to:

  1. Scan every single row.
  2. Calculate the length for each one on the fly.
  3. Sort them.

This is a "Full Table Scan," which is extremely slow as your data grows. By creating a generated column, you give the database a static value that it can physically put into an index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment