My take: PR #163 adds something useful, but I would not merge the default retune as-is.
The valuable part is the config surface. Exposing CONSOLIDATION_CLUSTER_SIMILARITY_THRESHOLD and CONSOLIDATION_MIN_CLUSTER_SIZE is clearly right because embedding geometry varies by provider/model. That part removes the need to fork/subclass just to calibrate clustering. The PR does exactly that across config.py, runtime_helpers.py, runtime_bindings.py, and app.py in the files changed.
The default changes are where I’d be more conservative:
0.75 -> 0.65: our local diagnostics do not show0.75as dead. On8001,0.75still returned 5,878 sampled top-k neighbor hits;0.65returned 7,764. On8011,0.75returned 4,112;0.65returned 6,617. So lowering to0.65broadens recall materially, but we have not proven those extra edges are good clusters.min_cluster_size 3 -> 2: good as an env var, questionable as a default. Pair clusters are useful for a future supersession pass, but current cluster consolidation only creates meta memories for size>=5, so this mostly expands candidate bookkeeping unless other code starts consuming pairs.30d -> 7d: risky with the current exact clustering implementation. The algorithm is still pairwise O(n²). Running that weekly on 62.7k memories is a serious operational cost unless clustering is first rewritten around ANN/top-k candidate generation.
So my preferred version of PR #163:
- Merge the env vars.
- Keep existing defaults for now:
0.75,3,2592000. - Document suggested experimental settings:
0.65,2,604800. - Add a follow-up diagnostic/benchmark that reports cluster count, component-size histogram, sample quality, and estimated meta-memory writes at
0.65/0.70/0.75.
It does not address the bigger graph-quality issues we measured: legacy PARALLEL_CONTEXT zero similarity, sparse INVALIDATED_BY/PREFERS_OVER, bad CONTRASTS_WITH heuristic, startup readiness, or O(n²) clustering. The PR itself calls the CONTRASTS_WITH issue out as out of scope in the PR description.
Net: merge the plumbing, don’t bless the new defaults yet. Our local report is the evidence base:
Raw artifacts: data/sweep_runs/20260501-173242-graph-diagnostics
| Label | Endpoint | Status | Memories | Vectors | Sync | Dimensions |
|---|---|---|---|---|---|---|
| full | http://localhost:8001 |
healthy | 10750 | 10750 | synced | 1024 |
| cleaned | http://localhost:8011 |
healthy | 7618 | 7618 | synced | 1024 |
| Label | Nodes | Edges | PRECEDED_BY | System edges | Authorable edges | Memory type | INVALIDATED_BY | PREFERS_OVER | Legacy discovered | Risks |
|---|---|---|---|---|---|---|---|---|---|---|
| full | 10750 | 116558 | 37.2% | 92.1% | 7.9% | 54.9% | 23 | 4 | 99.0% | high generic Memory type share, system-generated edges dominate the graph, sparse authorable edges, INVALIDATED_BY/PREFERS_OVER barely fire, legacy discovered relation types dominate discovered edges, legacy PARALLEL_CONTEXT similarities are all zero |
| cleaned | 7618 | 74281 | 35.0% | 90.5% | 9.5% | 55.3% | 23 | 4 | 99.4% | high generic Memory type share, system-generated edges dominate the graph, sparse authorable edges, INVALIDATED_BY/PREFERS_OVER barely fire, legacy discovered relation types dominate discovered edges, legacy PARALLEL_CONTEXT similarities are all zero |
| Claim | Local status | Evidence |
|---|---|---|
| PRECEDED_BY dominates at ~87% | differs locally | Full local graph PRECEDED_BY share is 37.2%. |
| INVALIDATED_BY/PREFERS_OVER barely fire | confirmed locally | Full local graph has INVALIDATED_BY=23, PREFERS_OVER=4. |
| parallel_context similarity=0.0 | mixed | full: legacy zero=100.0%; full: DISCOVERED nonzero=283; cleaned: legacy zero=100.0%; cleaned: DISCOVERED nonzero=124 |
| clustering defaults need source verification | checked | hardcoded_similarity_threshold, hardcoded_min_cluster_size, cluster_interval_default_30d, eager_scheduler_tick |
0.75 still returns 5878 sampled top-k neighbor hits; 0.65 returns 7764 (32.1% more). Top-1 median is 0.9754.
| Threshold | Sampled top-k neighbor hits |
|---|---|
| 0.55 | 9388 |
| 0.6 | 8843 |
| 0.65 | 7764 |
| 0.7 | 6818 |
| 0.75 | 5878 |
| 0.8 | 4963 |
| 0.85 | 3901 |
0.75 still returns 4112 sampled top-k neighbor hits; 0.65 returns 6617 (60.9% more). Top-1 median is 0.8551.
| Threshold | Sampled top-k neighbor hits |
|---|---|
| 0.55 | 9104 |
| 0.6 | 8016 |
| 0.65 | 6617 |
| 0.7 | 5140 |
| 0.75 | 4112 |
| 0.8 | 3247 |
| 0.85 | 2458 |
| Hypothesis | Status | Evidence |
|---|---|---|
| hardcoded_similarity_threshold | confirmed | consolidation.py:157 self.similarity_threshold = 0.75 |
| hardcoded_min_cluster_size | confirmed | consolidation.py:156 self.min_cluster_size = 3 |
| cluster_interval_default_30d | confirmed | config.py:38 os.getenv("CONSOLIDATION_CLUSTER_INTERVAL_SECONDS", str(2592000)) |
| eager_scheduler_tick | confirmed | runtime_scheduler.py:100 run_consolidation_tick_fn() |
The diagnostics do not mutate AutoMem. They treat runtime fixes as follow-up PRs: readiness probe, configurable clustering, legacy edge normalization, and any supersession-discovery pass should remain separate from this measurement harness.