Investigation: Flaky `test_search_delete_hdca_output` (Issue #21230)

Summary

func.array_agg(column, order_by=column) in SQLAlchemy silently drops the order_by keyword argument, generating array_agg(path_sig) instead of array_agg(path_sig ORDER BY path_sig). This means the HDCA signature arrays used for job search equivalence checking are unordered, causing non-deterministic comparison results on PostgreSQL.

Root Cause

In lib/galaxy/managers/jobs.py, the agg_expression method:

def agg_expression(self, column):
    if self.dialect_name == "sqlite":
        return func.group_concat(column)
    else:
        return func.array_agg(column, order_by=column)

The order_by=column kwarg is silently ignored by SQLAlchemy's func.array_agg. Verified:

>>> from sqlalchemy import func, String, column
>>> col = column('path_sig', String)
>>> expr = func.array_agg(col, order_by=col)
>>> print(expr.compile(compile_kwargs={'literal_binds': True}))
array_agg(path_sig)

No ORDER BY is generated. Both the reference and candidate signature arrays are returned in whatever scan order PostgreSQL happens to use, which varies depending on query plan, table statistics, and prior test state.

How We Found It

Step 1: Add debug logging to `__search` stages

Added logging after each stage of the job search pipeline to identify where candidates are lost:

Job search: 0 jobs match input datasets
Job search: 0 jobs remain after _filter_jobs
No equivalent jobs found (query returned 0 candidate rows)

Result: The SQL query itself returns 0 rows. The bug is in the HDCA signature matching (_build_stmt_for_hdca), not in parameter matching or output filtering.

Step 2: Add debug logging inside `_build_stmt_for_hdca`

Added logging after each CTE to trace the signature comparison:

HDCA search: v=75, collection_type=list, depth=0
HDCA search: reference dataset_ids=[251, 252, 253]
HDCA search: reference signature elements=['data0;251', 'data1;252', 'data2;253']
HDCA search: reference full signature=['data0;251', 'data1;252', 'data2;253']
HDCA search: candidate pre-filter HDCA ids=[75]
HDCA search: candidate signature elements=[(75, 'data2;253'), (75, 'data1;252'), (75, 'data0;251')]
HDCA search: candidate full signatures=[(75, ['data2;253', 'data1;252', 'data0;251'])]
HDCA search: equivalent HDCA ids=[]

Key finding: The reference and candidate are the same HDCA (id=75) with the same elements, but the aggregated arrays differ:

Reference: ['data0;251', 'data1;252', 'data2;253'] (happens to be alphabetical = insertion order)
Candidate: ['data2;253', 'data1;252', 'data0;251'] (reverse order — different scan path)

The == comparison fails because the arrays have different element ordering.

Step 3: Verify SQLAlchemy SQL generation

>>> from sqlalchemy import func, String, column
>>> col = column('path_sig', String)
>>> print(func.array_agg(col, order_by=col).compile(compile_kwargs={'literal_binds': True}))
array_agg(path_sig)  # NO ORDER BY!

The correct approach uses aggregate_order_by from the PostgreSQL dialect:

>>> from sqlalchemy.dialects.postgresql import aggregate_order_by
>>> print(func.array_agg(aggregate_order_by(col, col.asc())).compile(
...     dialect=postgresql.dialect(), compile_kwargs={'literal_binds': True}))
array_agg(path_sig ORDER BY path_sig ASC)  # Correct!

Fix

In lib/galaxy/managers/jobs.py, change agg_expression:

from sqlalchemy.dialects.postgresql import aggregate_order_by

def agg_expression(self, column):
    if self.dialect_name == "sqlite":
        return func.group_concat(column)
    else:
        return func.array_agg(aggregate_order_by(column, column.asc()))

Why It's Flaky (Not Always Failing)

Without ORDER BY, array_agg returns elements in whatever order PostgreSQL's query executor processes them. This depends on:

The query plan chosen (affected by table statistics, which change as tests run)
The physical order of rows (affected by prior inserts/deletes)
Join ordering decisions by the optimizer

When the reference and candidate CTEs happen to use the same scan order, the arrays match and the test passes. When they use different scan orders (different join paths, different CTE structures), the arrays differ and the test fails.

CI Run With Debug Output

Failed run with full diagnostic output: https://github.com/mvdbeek/galaxy/actions/runs/23689091785

The debug log lines showing the signature mismatch are in the "Run tests" step output for the Test (3.10, 1) job.

mvdbeek/issue_21230_investigation.md

Select an option

No results found

Select an option

No results found

Investigation: Flaky `test_search_delete_hdca_output` (Issue #21230)

Summary

Root Cause

How We Found It

Step 1: Add debug logging to `__search` stages

Step 2: Add debug logging inside `_build_stmt_for_hdca`

Step 3: Verify SQLAlchemy SQL generation

Fix

Why It's Flaky (Not Always Failing)

CI Run With Debug Output

mvdbeek/issue_21230_investigation.md

Investigation: Flaky test_search_delete_hdca_output (Issue #21230)

Summary

Root Cause

How We Found It

Step 1: Add debug logging to __search stages

Step 2: Add debug logging inside _build_stmt_for_hdca

Step 3: Verify SQLAlchemy SQL generation

Fix

Why It's Flaky (Not Always Failing)

CI Run With Debug Output

Investigation: Flaky `test_search_delete_hdca_output` (Issue #21230)

Step 1: Add debug logging to `__search` stages

Step 2: Add debug logging inside `_build_stmt_for_hdca`