The DecorrelatePredicateSubquery
rule in Apache DataFusion is responsible for rewriting correlated subqueries in WHERE
/HAVING
clauses (specifically IN
and EXISTS
predicates, including their negations) into semijoin or antijoin operations. This transforms a nested query into a flat join-based plan for execution. To achieve this, the rule employs a carefully orchestrated recursion strategy that handles subqueries within subqueries (nested subqueries) and coordinates with DataFusion’s optimizer driver to avoid duplicate traversals.
Top-Down Invocation: The rule is registered to run in a top-down manner. In its implementation of OptimizerRule
, it overrides apply_order
to return Some(ApplyOrder::TopDown)
(decorrelate_predicate_subquery.rs - source). This signals the optimizer to call the rule’s rewrite
method on a plan node before optimizing that node’s inputs. In practice, DataFusion’s optimizer will invoke DecorrelatePredicateSubquery::rewrite
on a Filter
node that may contain a subquery, and then let the rule handle the rest. (By contrast, many simpler rules use a bottom-up order, meaning children get optimized first. More on this below.)
Recursing into Subqueries: When rewrite
is called on a plan, the first thing this rule does is explicitly recurse into any subquery plans embedded in the plan’s expressions. It uses the LogicalPlan::map_subqueries
utility to find subquery expressions and apply a transformation closure to each one (decorrelate_predicate_subquery.rs - source). The closure invokes subquery.transform_down(|p| self.rewrite(p, config))
, which essentially calls the same rewrite
function (the rule itself) on the subquery’s plan p
in a top-down traversal (decorrelate_predicate_subquery.rs - source). In other words, the rule calls itself recursively on each inner subquery plan before proceeding. This ensures that if a subquery itself contains another subquery (or multiple layers of nesting), those get decorrelated as well. By the time the rule finishes this step, all nested subqueries should have been converted to joins (or left intact if not transformable) within their respective subplans.
// Inside DecorrelatePredicateSubquery::rewrite(...)
let plan = plan
.map_subqueries(|subquery| {
// Recursively apply this same rewrite rule to the subquery plan
subquery.transform_down(|p| self.rewrite(p, config))
})?
.data;
Explanation: The code above shows the rule obtaining a new plan
by mapping a closure over any subqueries. The closure uses transform_down
to apply self.rewrite
to each subplan node top-down (decorrelate_predicate_subquery.rs - source). This effectively pulls recursion into the rule for subqueries. Notably, DataFusion’s normal optimizer traversal would not descend into these subquery plans on its own (since a subquery is contained in an expression, not as a direct child in the logical plan tree). By doing this, the rule prevents subqueries from being “missed” during optimization.
Handling the Outer Plan (Filter Node): After recursively processing inner subqueries, the rule turns to the plan it was invoked on – typically a LogicalPlan::Filter
that has a predicate with one or more subquery expressions. If the plan node isn’t a Filter
, or the filter predicate contains no subquery, the rule does nothing and returns Transformed::no(plan)
(meaning “no change”) (decorrelate_predicate_subquery.rs - source). But assuming we have a filter with a correlated subquery, the rule proceeds to decorrelate it:
-
Identify Subquery Predicates: The filter’s predicate (which could be a conjunction of multiple conditions) is analyzed for subquery expressions. Using a helper, the rule splits the predicate into two lists:
with_subqueries
(the pieces of the predicate that include a subquery likeEXISTS(...)
orcol IN (subquery)
) andother_exprs
(the remaining conjuncts) (decorrelate_predicate_subquery.rs - source). This partitioning isolates the subquery conditions we need to transform. -
Iterate and Transform Subqueries: The rule then iterates over each subquery expression in
with_subqueries
(decorrelate_predicate_subquery.rs - source). For each, it distinguishes two cases:-
Top-Level Subquery Predicate: If the subquery expression is a stand-alone conjunct (not nested inside a larger boolean expression), it is treated as a top-level predicate. For example,
WHERE x IN (SELECT ...) AND ...
or justWHERE EXISTS(SELECT ...)
. In this case, the rule invokesbuild_join_top
, which constructs a new plan that joins the current outer plan (cur_input
) with the subquery plan, turning the subquery condition into a join filter. This typically produces a Semi Join (forEXISTS/IN
) or an Anti Join (forNOT EXISTS/NOT IN
). Ifbuild_join_top
succeeds, it returns a newplan
representing the join, and the rule updatescur_input
to this new plan (decorrelate_predicate_subquery.rs - source). If it fails or deems the subquery cannot be transformed (perhaps due to an unsupported pattern), it returnsNone
, and the rule in that case simply keeps the subquery as is by appending the original subquery condition back intoother_exprs
(decorrelate_predicate_subquery.rs - source). (In other words, if it can’t decorrelate that subquery, it will leave it in the filter condition.) -
Embedded Subquery Expression: If the subquery expression is part of a larger expression (for example,
WHERE score > (SELECT avg(...) ...)
or some complex Boolean combination likecol1 = 5 OR EXISTS(...)
), the rule uses a different approach. It calls a helper functionrewrite_inner_subqueries
to deal with this scenario (decorrelate_predicate_subquery.rs - source). This helper takes the current outer plan (cur_input
) and the compound expression containing the subquery, and rewrites the expression from the inside out. It traverses the expression tree (expr.transform(...)
) looking forExpr::Exists
orExpr::InSubquery
nodes (decorrelate_predicate_subquery.rs - source). Each time it finds a subquery:- It performs a “mark join” operation: it joins the outer plan with the subquery (similar to a left-semi join but specifically to produce a boolean marker). The join output will include a boolean column or expression indicating whether a match was found for each outer row. This might be a TRUE/FALSE flag or a non-null presence indicator for the subquery’s result.
- It then replaces the
EXISTS(...)
orIN (subquery)
expression with a new expression (exists_expr
) that references the join outcome (e.g. a column in the join that carries the true/false mark) (decorrelate_predicate_subquery.rs - source). In the code,mark_join
returns(plan, exists_expr)
whereplan
is the updatedcur_input
with the join added, andexists_expr
is the expression (usually a column reference or literal) to use in place of the subquery condition (decorrelate_predicate_subquery.rs - source). - The outer plan (
cur_input
) is updated in place with the new join (cur_input = plan
) as each subquery is processed. This means if there are multiple subquery expressions embedded in one larger predicate, they will be handled one by one, each possibly adding a new join. The final returned expression (expr_without_subqueries
) has noEXISTS
/IN
left – those have been replaced by the marker expressions that are now valid given the new joined plan. - If a subquery cannot be transformed (e.g.,
mark_join
returnsNone
for it), the helper will stop or simply leave that part unchanged (for instance, aNOT EXISTS
might be left as is if it can’t be converted, as seen in the code returningTransformed::no(not_exists(subquery))
for an unsupported case (decorrelate_predicate_subquery.rs - source)).
In summary,
rewrite_inner_subqueries
uses a recursive traversal of the expression tree to find subquery nodes and convert them by injecting joins. This is an inner recursion (over the expression’s structure) in addition to the outer recursion (over the plan nodes) that the rule performs. -
-
Reassemble the Filter: After each subquery expression from
with_subqueries
is handled (either converted into a join or left intact and moved toother_exprs
), the rule reconstructs the filter condition. All remaining predicates inother_exprs
(which now include any original non-subquery conditions plus any subquery conditions that could not be decorrelated) are combined back into a single expression via a logicalAND
(conjunction(other_exprs)
). If this combined expression is not empty (not all predicates were turned into joins), a newFilter
node is created on top of the latestcur_input
plan (decorrelate_predicate_subquery.rs - source). Ifother_exprs
is empty (meaning all subquery conditions were transformed into joins and there were no other filters), then no Filter node is needed – thecur_input
is already the semijoin/antijoin plan representing the filtered result. -
Marking the Plan as Transformed: Finally, the rule returns
Transformed::yes(cur_input)
to indicate that the plan was rewritten (decorrelate_predicate_subquery.rs - source). Thecur_input
at this point is the new logical plan where the subquery predicate has been replaced by one or more joins (plus a possibly residual filter). For example, anEXISTS
clause becomes a semi-join, or ana IN (subquery)
becomes a semi-join with a filter on equality ofa
and the subquery’s output. The original correlated condition is gone from the filter expression, thus decorrelated.
It’s worth noting that if a subquery was unsupported (left untransformed), the rule still returns Transformed::yes
because it may have applied other changes or at least rebuilt the Filter node. In degenerate cases this could mean no real change was made except reconstructing the same filter, which could cause the optimizer to think progress was made. DataFusion has guardrails (like a plan signature check) to detect when a rule isn’t actually making progress, to avoid infinite optimization loops. Indeed, in earlier versions, this rule had some limitations (e.g., it did not handle SORT
or LIMIT
inside subqueries, as noted in a bug report) and would bail out (decorrelate_where_in doesn't support Sort/Limit in Subquery. · Issue #6263 · apache/datafusion · GitHub) (decorrelate_where_in doesn't support Sort/Limit in Subquery. · Issue #6263 · apache/datafusion · GitHub). Such scenarios were addressed by improving the rule and by allowing the configuration to skip failing rules to prevent non-termination.
Optimizer vs. Rule Recursion: An important aspect of this design is how the recursion is divided between the optimizer framework and the rule itself. In DecorrelatePredicateSubquery
, the rule itself explicitly traverses into subquery plans (via map_subqueries
) and into expression trees (rewrite_inner_subqueries
). The optimizer’s role is to call rewrite
on the outer plan nodes in a top-down order; it does not automatically descend into subquery expressions. This means there is no double recursion happening. The rule handles the subquery internals, while the optimizer handles the outer structure. For example:
- The optimizer identifies a Filter node and invokes
rewrite
on it (because the rule declared interest in top-down fashion). - The rule sees the Filter has a subquery, and the rule takes care of optimizing that subquery’s plan by calling
self.rewrite
recursively on it. The optimizer won’t separately call the rule again for that subquery, because it’s not a direct child in the plan tree. - Thus, each subquery plan is processed exactly once by the rule, through this manual recursion. After that, when the subquery becomes a join input in the outer plan, it turns into a normal child in the plan tree and will be subject to other rules in subsequent optimizer passes (or even further applications of this rule if the subquery had its own subquery).
In summary, the DecorrelatePredicateSubquery
rule uses a mix of optimizer-driven recursion (top-down over plan nodes) and custom recursion (over subquery plans and expressions) to ensure no correlated subquery is left behind. This avoids duplicate work: the optimizer doesn’t traverse subqueries on its own, and the rule doesn’t redundantly traverse plan nodes that the optimizer is already handling. The approach is intricate but necessary for correctness. After this rule runs, subqueries in predicates have largely been converted into join operations, making the plan easier for DataFusion to execute and for other optimizer rules to further optimize (joins, filters, etc., can now be pushed down or reordered as needed).
Many other optimizer rules in DataFusion follow a more straightforward recursion pattern, thanks to the framework provided by the OptimizerRule
trait and the TreeNode utilities. In general, most rules let the optimizer coordinate the recursion rather than manually walking the plan tree themselves. They indicate their preferred traversal order (top-down or bottom-up) via apply_order
, and implement rewrite
to perform a local transformation on a given plan node. The optimizer will then apply that to every node in the plan as appropriate.
For example, a simple constant folding or expression simplification rule might be implemented as follows (as illustrated by a custom rule example in DataFusion’s repository):
impl OptimizerRule for MyOptimizerRule {
fn supports_rewrite(&self) -> bool { true }
fn apply_order(&self) -> Option<ApplyOrder> {
// Use bottom-up so children expressions are simplified before their parent
Some(ApplyOrder::BottomUp)
}
fn rewrite(&self, plan: LogicalPlan, _config: &dyn OptimizerConfig)
-> Result<Transformed<LogicalPlan>>
{
// Rewrite expressions in this plan node (but do not explicitly recurse into child plan nodes)
plan.map_expressions(|expr| self.simplify_expr(expr))
}
/* ... */
}
In this pattern, the rule declares apply_order
and trusts the optimizer to invoke rewrite
on every node (either bottom-up or top-down). The rule’s rewrite
method focuses only on the current node: it might call plan.map_expressions
(as above) to iterate through all expressions in that node and simplify them, or plan.map_children
to potentially swap or modify child plans. But it does not call self.rewrite
on the children explicitly. It doesn’t need to – the framework ensures that by the time a node is visited in bottom-up order, its children have already been optimized (each child was a node that had rewrite
called on it earlier in the traversal). This division of labor prevents double-processing. In the example above, we see a comment: “Ask the optimizer to handle the plan recursion. rewrite
will be called on each plan node.” (datafusion/datafusion-examples/examples/optimizer_rule.rs at main · apache/datafusion · GitHub). This is exactly how DataFusion’s optimizer is designed to work for rules using the rewrite API.
Concretely, the optimizer’s main loop will do something like: for each rule, traverse the plan and apply the rule’s rewrite logic to each node. If a rule is top-down, the optimizer applies it to a parent before children; if bottom-up, children before parent. The Transformed<LogicalPlan>
result tells the optimizer whether the rule made a change (yes
or no
), which is used to decide if another optimization pass might be needed.
No Double Traversal: Because of this contract, rules generally do not recursively call optimize
or rewrite
on their inputs – doing so would indeed duplicate work that the optimizer itself is performing. Instead, they perform a single-pass transformation at the node they’re given. For instance, the built-in rule PushDownFilter
(which moves filters down past projections and joins) is implemented to examine a Filter
node and, if possible, push it below an intervening operator. It will reconstruct a new plan structure (moving the filter) for that one subtree. It doesn’t descend into the tablescan or join on its own; after it returns, the optimizer will continue and eventually call the appropriate rules on those child nodes separately. Similarly, EliminateCrossJoin
will pattern-match on a Join
node and if it’s a cartesian product with a conjunctive filter, it may convert it to an inner join with a condition. It relies on the optimizer to have already handled the inputs of the join.
Example – Scalar Subquery Rule: A useful contrast to DecorrelatePredicateSubquery
is the ScalarSubqueryToJoin
rule. This rule handles scalar subqueries (subqueries that return a single value, used in contexts like WHERE col = (SELECT max(x) ...)
). ScalarSubqueryToJoin
also targets filter nodes and rewrites them into a left join plus post-join filter. However, this rule does not perform deep recursion into the subquery’s plan within its own code. Instead, it uses the standard approach: it checks if the filter predicate contains a scalar subquery and if so, does a one-step rewrite. Specifically, it will extract the subquery and turn it into a left outer join (because a scalar subquery is essentially like a single-row table) with the outer plan, adding any necessary equality conditions between outer columns and the subquery’s output (scalar_subquery_to_join.rs - source) (scalar_subquery_to_join.rs - source). The subquery plan is attached as a child of the join. At this point, the subquery has been lifted into the main plan as a child of a join – thus it’s no longer an Expr::InSubquery
but a normal LogicalPlan
subtree. The optimizer will then naturally continue optimizing that subplan (e.g., pushing down filters inside it, applying aggregations, etc.) in subsequent rule applications.
Notice that ScalarSubqueryToJoin
doesn’t call itself recursively on the subquery. If the subquery contained another subquery (a rare scenario for a scalar subquery, but hypothetically possible), that inner subquery would likely be handled by the DecorrelatePredicateSubquery
rule during the optimizer’s passes. In practice, DataFusion runs the predicate subquery decorrelation and scalar subquery rules in the same optimization stage, and they have to cooperate. The general strategy is to handle EXISTS/IN
via semi/anti join first (since those can introduce new filters that might then become scalar subqueries or vice versa), and handle scalar subqueries via left join. Coordination is done by ordering these rules appropriately in the optimizer’s rule list and by each rule robustly checking for only the pattern it can handle. For example, ScalarSubqueryToJoin
begins with a quick check if !contains_scalar_subquery(predicate) { return Transformed::no(...) }
(scalar_subquery_to_join.rs - source) – if the filter doesn’t have the pattern it cares about, it returns immediately (no recursive descent). This is an optimization to skip work and also a safety check to not accidentally transform things it shouldn’t.
Other Rules and Recursion: Outside of subquery handling, most rules either operate on the expressions within a single plan node or restructure the relationships between nodes, and they rely on the optimizer driver to manage the traversal order. Some examples:
- Expression simplifications (constant folding, algebraic simplifications) use bottom-up traversal. They often use
Expr::transform_up
orExpr::transform_down
internally, which is a recursion on the expression tree (not the plan tree). This is similar to howrewrite_inner_subqueries
works on an expression, but without affecting plan nodes. Because expression trees are generally shallow (a single predicate or projection expression), this doesn’t conflict with the plan recursion – it’s confined to one node’s expression. For instance, a rule might doexpr.transform_up(|sub_expr| { /* simplify sub_expr */ })
to simplify a complex condition into a simpler one. This happens entirely in the context of the current plan node. - Plan structure rules like
JoinReordering
orProjectionPushDown
often use top-down traversal. For instance, pushing a projection down requires looking at a node and deciding to move it below its child – you do this top-down so that you handle the highest projections first. These rules will call utilities likeplan.map_children(...)
or construct new plans manually, but again they do not recursively callrewrite
on those children; they simply rearrange them. After the rearrangement, the optimizer will later invoke other rules on the affected children as needed. - Common subexpression elimination scans the plan for repeated expressions. In older versions, it had to traverse the plan to find duplicates, but with the new framework it can leverage
LogicalPlan::visit
or similar to collect info, thenrewrite
to eliminate redundancy. It likely uses bottom-up order (so that by the time it processes a parent, it knows if children had identical subtrees).
The key point is that each rule either relies on the framework’s recursion or implements its own specialized recursion – but not both on the same aspect. If a rule is written to use the rewrite
API (which most are, as of recent DataFusion versions), it will not manually call optimize
on its inputs. Conversely, a few specialized rules that predate the apply_order
mechanism or handle unique constructs might internally traverse certain parts of the plan. The decorrelation rules are the prime examples of the latter, because subqueries required special handling.
Avoiding Conflicts: The design of DataFusion’s optimizer aims to avoid any situation where both the optimizer driver and an optimizer rule independently recurse over the same nodes, which could cause duplicate transformations or even infinite loops. By using the ApplyOrder
hints and the Transformed<LogicalPlan>
mechanism, the optimizer ensures that each rule sees a coherent view of the plan and that changes are propagated properly. As long as rules adhere to the pattern (not descending into children unless necessary), there is no double-processing. For example, if DecorrelatePredicateSubquery
did not use map_subqueries
, then the subquery plan might never be optimized until execution (bad). If it did use map_subqueries
and the optimizer also tried to recursively optimize subqueries as separate plans, the subquery might get optimized twice or the decorrelation might be attempted twice. Fortunately, DataFusion doesn’t do the latter – subqueries remain part of the outer plan’s expression until a rule like this one deals with them. After that, they become part of the main plan (as joins) and will not be seen as “subqueries” anymore.
One potential complexity is ensuring that once a subquery is decorrelated, subsequent applications of the same rule do not try to re-decorrelate it again. In DataFusion’s implementation, after a successful decorrelation, the subquery is literally gone from the filter expression, so the rule will find no subquery on a second run and will no-op. If a subquery was partially decorrelated (some conditions transformed, but one left intact as a subquery), the rule might run again in the next optimizer pass and attempt that remaining subquery. This iterative approach can be intentional – progressively simplifying a complex nested subquery scenario – but it must be controlled. The use of Transformed::yes
/no
helps here: if in an iteration nothing can be decorrelated, the rule should return Transformed::no
so that the optimizer knows a fixpoint is reached. The DataFusion optimizer will typically repeat all rules until a fixpoint (no rules make changes) or a max iteration count is hit. Thus, correct reporting is crucial. In our case, if DecorrelatePredicateSubquery
fails to decorrelate a subquery and leaves it untouched, ideally it would report Transformed::no
to avoid an endless loop. In the current code, it reports no
only when no subquery is found (decorrelate_predicate_subquery.rs - source); if a subquery is found but none can be transformed, it actually triggers an internal error as a sanity check (decorrelate_predicate_subquery.rs - source). In practice, that situation is rare, and improvements over time have extended support (for example, disjunctions and limits in subqueries were later supported via marker joins and other techniques).
Maintaining Correctness: Recursion decisions also affect correctness. A top-down rule like subquery decorrelation needs to run before certain other optimizations so that it has the outer query structure intact. If another rule modified the plan first (say pushed a filter down into the subquery or something), it could confuse the decorrelation logic or even violate its assumptions. That’s why the optimizer rule order matters. DataFusion’s optimizer pipeline runs a batch of rules in a defined sequence. Typically, subquery decorrelation runs early in the pipeline (right after the query is analyzed and validated), so that the rest of the optimizer can treat subqueries as joins. This minimizes potential conflict with, for example, predicate pushdown (which doesn’t know about subqueries). By the time PushDownFilter
runs, any EXISTS
or IN
should already be handled – either removed or turned into a normal filter on a join that PushDownFilter
can understand.
Performance Considerations: The approach of letting the rule handle subqueries recursively in one go can be efficient, but it does a lot of work in one rule invocation. The alternative would be to invoke the entire optimizer on each subquery separately (optimize the inner query as an independent plan, then somehow reintegrate it). That alternative could lead to repeated passes and more copying of plans. Instead, DecorrelatePredicateSubquery
uses in-place transformations (via the TreeNode APIs) to avoid unnecessary cloning. For instance, methods like map_subqueries
and transform_down
yield a Transformed
result that either reuses the existing node if no change, or produces a new node only when needed. This is meant to reduce overhead. Indeed, a recent refactoring in DataFusion was aimed at stopping the full cloning of LogicalPlans during optimization – rules now operate in a more incremental fashion (datafusion/datafusion-examples/examples/optimizer_rule.rs at main · apache/datafusion · GitHub) (datafusion/datafusion-examples/examples/optimizer_rule.rs at main · apache/datafusion · GitHub). This change improves performance and also reduces the likelihood of subtle bugs (since the plan structure is modified in a controlled way, rather than rebuilt from scratch each time).
There is a trade-off in complexity: the decorrelation rules are among the most complex optimizer rules in DataFusion. They must manage multiple inputs (outer and inner plans), maintain alias names for deduplicating columns, handle different types of subqueries (EXISTS
vs IN
vs scalar) and edge cases like NULL semantics for NOT IN
. This complexity can make the code harder to maintain. Comments in the source code and incremental enhancements (like adding support for disjunctions (Support decorrelate_subquery with disjunction · Issue #5368 · apache/datafusion · GitHub) or fixing case-sensitive identifiers) show that this area evolves as new corner cases are discovered. In contrast, many other rules are much simpler (often a few pattern matches and a plan reconstruction), precisely because the recursion and iteration logic is abstracted away by the framework.
Similar Patterns in Other Engines: It’s interesting to note that the approach DataFusion takes – turning subqueries into joins (sometimes called unnesting) – is common in query optimizers. The use of a mark join for handling EXISTS/NOT EXISTS
(where a boolean “marker” is used to indicate existence) is a known strategy (Support decorrelate_subquery with disjunction · Issue #5368 · apache/datafusion · GitHub). The recursion in decorrelation is inherently more complex than in, say, constant folding, because it’s essentially implementing a mini-optimizer for a subplan within an expression. DataFusion’s TreeNode framework provides just enough hooks (like apply_with_subqueries
or the map_subqueries
we saw) to make this feasible. Other rules don’t need those hooks – for example, a rule eliminating an unnecessary projection can just use plan.map_children()
or return a simplified LogicalPlan
without diving further.
Potential Issues and Resolutions: Coordination between rules is crucial. If two rules could apply to the same node, the optimizer needs to have a deterministic order or a mechanism to prevent them from fighting each other. In DataFusion, rules are applied in sequence, and usually each rule either does a distinct transformation or checks a condition to see if it should act. For instance, both DecorrelatePredicateSubquery
and ScalarSubqueryToJoin
operate on Filter
nodes with subqueries, but one targets EXISTS/IN
and the other targets scalar subqueries. They won’t step on each other’s toes because of how the predicates are classified. However, one rule’s action can enable another – e.g., decorrelating an IN
subquery might introduce a new expression that could be simplified by the expression simplifier later, or turning a subquery into a join might create an opportunity for the join reordering rule. DataFusion’s iterative optimizer will indeed run through the list of rules multiple times if needed to catch such opportunities. This is why having accurate Transformed::yes/no
signals and a robust fixpoint check matters: it allows multiple passes without infinite loops.
Finally, from a maintainability and design perspective, the recursion patterns in DataFusion’s optimizer have been evolving toward clarity: use the framework for typical cases, only break out custom recursion for special cases like subqueries. This separation makes most rules easier to reason about (they see one node and do one thing), and isolates complexity in the few rules that truly need it. The decorrelate_predicate_subquery
pass is an example of that complexity, handling recursion on two dimensions (plan and expression) in a single rule. The outcome of these patterns is a query optimizer that, while not as mature as those in full-fledged RDBMS, is capable of handling pretty complex SQL (including nested subqueries and CTEs) with reasonable efficiency and correctness. Each additional form of recursion (subqueries, recursive CTEs, etc.) tends to add one such specialized rule, but the majority of other optimizations remain orthogonal and consistently applied.
References:
- The implementation of
DecorrelatePredicateSubquery
showing recursive subquery handling and top-down application (decorrelate_predicate_subquery.rs - source) (decorrelate_predicate_subquery.rs - source) (decorrelate_predicate_subquery.rs - source) (decorrelate_predicate_subquery.rs - source) (decorrelate_predicate_subquery.rs - source). - Example of a custom optimizer rule using
apply_order
to delegate recursion to the framework (datafusion/datafusion-examples/examples/optimizer_rule.rs at main · apache/datafusion · GitHub). - The scalar subquery to join transformation for comparison (scalar_subquery_to_join.rs - source) (scalar_subquery_to_join.rs - source).
- Discussion of a bug and enhancement related to subquery decorrelation (limits in subqueries) (decorrelate_where_in doesn't support Sort/Limit in Subquery. · Issue #6263 · apache/datafusion · GitHub) which highlights the ongoing evolution of these rules.