The Error
[SYCL-GRAPH] Exception when updating graph, Cannot update using a graph with a different topology. Mismatch found in the number of nodes.
Root Cause
SYCL command graphs can only be updated (via update()) if the new graph has the exact same topology as the cached one—same number of nodes, same kernel types, same execution order. Parameter values (buffer pointers, scalar arguments) can differ, but structure cannot.
The original implementation cached a single executable graph in sycl_ctx->exec_graph. When test-backend-ops ran tests with varying tensor dimensions (n=1, 2, 3, 4...), each created graphs with different node counts. The code:
Tried to update() the cached graph with new topology → failed with exception Caught exception, re-recorded and finalized a new graph Repeated this cycle for every mismatched topology
After ~9-10 graph recreations, the GPU crashed with UR_RESULT_ERROR_DEVICE_LOST due to driver resource exhaustion.
Fix: Hash-Based Multi-Entry Graph Cache
- Topology Hash Function (backend.cpp):
static uint64_t compute_cgraph_hash(const ggml_cgraph * cgraph) { uint64_t hash = 0xcbf29ce484222325ULL; // FNV-1a // Hash: n_nodes, each node's op, type, dimensions, source tensor info ... }
- Cache Structure (common.hpp):
std::map<uint64_t, std::unique_ptr<executable_graph>> graph_cache; static constexpr size_t MAX_GRAPH_CACHE_SIZE = 8;
- Lookup Logic:
Cache hit: Reuse graph, call update() (guaranteed to succeed since topology matches by hash) Cache miss: Record new graph, finalize, store in cache Eviction: FIFO when cache exceeds 8 entries
Additional Fixes
Disabled graphs for F16×F32 and BF16×F32 GEMM (tiled GEMM had incorrect type casts causing NaN) Only F32×F32 tiled GEMM is graph-compatible; others fall back to oneMKL