We ran Memoria against LongMemEval with the aligned official datasets and saw a clear split:
LongMemEval oracleis strongLongMemEval_Sdegrades sharply in long-session settings
The current failure mode is not a one-off benchmark bug. It appears to be a structural retrieval issue in long histories:
- graph retrieval never contributes
- fallback hybrid retrieval operates on whole-session memory blobs
- broad semantic matches outrank the actual evidence sessions
- exact lexical cases can still pass, but long-session fact lookup is generally weak
Preserved report:
benchmarks/results/longmemeval-oracle-20260321b/full.report.json
Result:
- dataset:
longmemeval-oracle-rustbench - scenarios:
470 - overall:
96.35 (S)
Category breakdown:
Temporal Reasoning:91.55over127Knowledge Update:94.98over72Multi-Session:98.18over121Single-Session Assistant:98.91over56Single-Session Preference:100.0over30Single-Session User:100.0over64
We started a full LongMemEval_S run on the aligned 470-scenario dataset, but stopped it because throughput was too slow for iteration:
- rough throughput observed:
~75-90s / scenario - rough ETA:
~10-12hfor full run
To get faster signal, we generated a deterministic stratified subset:
- dataset:
benchmarks/datasets/longmemeval_s_sample_60.json - dataset_id:
longmemeval-s-rustbench-sample-60 - scenarios:
60
Category mix:
multi-session:16temporal-reasoning:16knowledge-update:9single-session-user:8single-session-assistant:7single-session-preference:4
This sample run was also stopped early, at about 10 / 60 scenarios, but the live output was already strongly negative:
Observed scenario outputs:
06db6396→0.0 (D)0a34ad58→39.2 (D)0bb5a684→60.8 (C)0db4c65d→100.0 (S)0f05491a→39.2 (D)1c549ce4→39.2 (D)1d4e3b97→39.2 (D)35a27287→39.2 (D)37f165cf→39.2 (D)3f1e9474→39.2 (D)
This is enough to conclude LongMemEval_S is failing systematically early, not just on one corner case.
The retrieval service tries graph first, then hybrid/vector, then fulltext.
However, in the stopped LongMemEval_S sample DB snapshot:
memory_graph_nodes = 0memory_graph_edges = 0mem_entities = 24124mem_memories = 470
So entity extraction is happening, but graph memory nodes are not populated, which means graph retrieval never contributes.
Observed from retrieval explain:
graph_hit = falsegraph_candidates = 0
Each seeded memory is a full session transcript, and LongMemEval_S scenarios often contain 42-52 session memories each.
That means retrieval is embedding and ranking coarse session-sized blobs rather than narrower facts/events.
In failed cases, top results were often thematically related but clearly wrong.
Example: 37f165cf
- Query:
What was the page count of the two novels I finished in January and March? - Expected: 2 book-related evidence sessions
- Returned top results:
- meal planning / pasta
- immigration / Canada resettlement
- Chicago travel
Explain:
- path:
hybrid graph_hit=falsevector_hit=true
Example: 0a34ad58
- Query:
I’m a bit anxious about getting around Tokyo. Do you have any helpful tips? - Expected: Tokyo transit sessions tied to Suica usage
- Returned top results:
- showerhead repair
- salad recipe
- one generic Tokyo restaurant session
Again:
- path:
hybrid graph_hit=false
Example: 0db4c65d
- Query mentions rare titles:
The Seven Husbands of Evelyn HugoThe Silent Patient
- This case succeeded with:
- path:
fulltext graph_hit=falsevector_hit=false
- path:
So some long-session cases still pass, but mainly when exact lexical matching is strong enough to compensate.
In the failed hybrid explain traces:
temporal_score = 0.0
So temporal or multi-event questions in long sessions are not being saved by a real temporal-aware retrieval layer. They mostly succeed or fail based on coarse vector/fulltext retrieval over long session blobs.
Memoria currently stores and retrieves long histories at too coarse a granularity for LongMemEval_S.
What seems to be happening:
- capture stores session-sized memories
- graph path is not materially available in these runs
- hybrid retrieval ranks semantically broad sessions
- the benchmark needs precise evidence sessions, often for a specific fact inside a long conversation
This is why:
oracledoes wellSfails early and broadly
oracle reduces the search burden to evidence sessions only.
S requires retrieval to survive long-history dilution, and the current design does not.
- Add a benchmark-friendly long-history capture/indexing layer
- do not only store full session blobs
- also store narrower fact/event units derived from each session
- Make graph retrieval real for this path
- entity extraction alone is not enough
- ensure graph node materialization/backfill participates during long-session benchmark runs
- Improve retrieval granularity before tuning ranking
- the main issue looks like coarse candidate generation, not just bad score weights
- Add a repeatable
LongMemEval_Ssample benchmark workflow
- keep deterministic subset sampling for fast iteration
- use full
Sonly for milestone validation
Aligned datasets:
benchmarks/datasets/longmemeval_oracle.rustbench.jsonbenchmarks/datasets/longmemeval_s.rustbench.jsonbenchmarks/datasets/longmemeval_m.rustbench.json
Deterministic sample:
benchmarks/datasets/longmemeval_s_sample_60.json
Preserved oracle report:
benchmarks/results/longmemeval-oracle-20260321b/full.report.json