1/18/2026, 10:16:42 AM
I want to reframe the medical literature project a bit, allow it to be generalized to other domains of knowledge. We are still building a graph and a graph still consists of nodes (entities) and edges (relationships). We still have a collection of entities from previous ingestion processes. We add a new thing: entities may be "canonical", that is they have been assigned canonical IDs (UMLS numbers or whatever) or they may be "provisional", meaning that we don't know yet if they should be assigned canonical IDs, for instance an entity might be a mention of some trivial thing in just one paper.
Given a batch of papers to ingest, we proceed in two passes. First pass we extract entities and assign canonical IDs where they make sense. Second pass we identify the edges (for medical, these edges are of the three types, extraction, claims, and evidence). The first pass produces a JSON serialization of the collection of entities including canonical IDs and synonyms. The second pass produces one JSON file per paper, including the paper's edges and any provisional entities unique to that paper.
This framework allows each knowledge domain (legal documents, academic CS papers) to define its own source of canonical IDs, its own schema, its own list of edge types. Any interesting query optimizations (graph theory tricks, database quirks) can be shared across domains. Where possible, cross-domain canonical IDs are preferred (such as DBPedia URIs). Including a significant chunk of DBPedia is probably a very good idea, or at least being able to pull in DBPedia entities as the ingestion progresses.
Back in the 1980s, Doug Lenat tried to do something like this with the Cyc project, but there were no LLMs then, only grad students and postdocs, so that's what he used. They never got enough scale for the project to gain much traction. Perhaps now his idea is more feasible.
What to do about entity merging? If two canonical entities turn out to be the same thing, that's a messier operation than promoting a provisional entity. Maybe track usage counts to decide when to promote, and maintain confidence scores on canonical IDs to decide when to merge. Each entity has a semantic vector, and hopefully potential merges can be flagged by near matches of cosine distance.
1/18/2026, 10:17:20 AM
Claude's thought process
Synthesized scalable knowledge graph architecture across domains.
This is a really interesting evolution of the medical literature project. Will wants to generalize the architecture to work across different knowledge domains. Let me break down what he's proposing:
- Core architecture remains graph-based: Nodes are entities, edges are relationships
- Key innovation - entity status:
- Canonical entities: Have domain-specific IDs (UMLS for medical, DBPedia URIs cross-domain, etc.)
- Provisional entities: Mentions that haven't yet earned canonical status
- Two-pass ingestion:
- Pass 1: Extract entities, assign canonical IDs where appropriate
- Pass 2: Extract relationships/edges, create paper-specific files with edges and provisional entities
- Domain flexibility: Each domain defines:
- Its canonical ID source
- Its schema
- Its edge types
- Shared infrastructure: Query optimizations, graph algorithms can be reused
- Entity lifecycle management:
- Track usage counts to decide when to promote provisional → canonical
- Track confidence scores on canonical assignments
- Use semantic vectors for merge detection
- Merging canonical entities is "messier" than promotion
The Cyc reference is apt - this is trying to build comprehensive knowledge graphs but with LLMs doing the heavy lifting instead of human experts.
This is a much more sophisticated and scalable architecture than the original medical-only version. Key design challenges:
- Entity resolution: When does a provisional entity become canonical?
- Merge detection: How to identify duplicates among canonical entities?
- Confidence management: How to track and update confidence in canonical assignments?
- Cross-domain linking: When to use domain-specific vs universal IDs?
Will's background makes this interesting - he has deep experience with both the technical infrastructure (AWS, containerization, CI/CD) and semantic technologies. He's thinking about how to make this practical and maintainable.
I should engage with this architectural vision, help him think through:
- How entities transition between states and what triggers those transitions
- The JSON structure for both extraction passes and how to handle provisional vs canonical entities
- Strategies for detecting and resolving duplicate canonical entities using semantic similarity
- Scaling considerations and operational complexity as the graph grows
- Whether this fits into his current work or exists as a separate exploration
Given his technical depth, I can discuss implementation details and tradeoffs directly.
This is a really solid architectural evolution. You're essentially building a federated knowledge graph system where domains can plug in their own ontologies and extraction logic while sharing the underlying graph infrastructure and entity resolution machinery.
The two-pass approach is smart - it separates the "what exists" problem from the "how are things related" problem, and lets you defer committing to provisional entities until you see if they gain traction across papers.
A few thoughts on the entity lifecycle:
Promotion criteria could be fairly mechanical:
- Usage count threshold (appears in N papers, or referenced by M canonical entities)
- Semantic clustering (multiple provisional entities converge semantically)
- Explicit domain rules (e.g., any chemical compound mentioned more than once gets promoted)
For merge detection, the semantic vector approach is good, but you probably want multiple signals:
- Cosine similarity of entity embeddings
- Synonym overlap (if both have associated aliases)
- Co-occurrence patterns (entities that appear together in similar contexts)
- Domain-specific rules (e.g., UMLS already has a "broader/narrower" hierarchy)
The messy merge operation - this is where the graph structure helps you. If you need to merge canonical entities A and B:
- Create new entity C (or keep A, deprecate B)
- Redirect all edges pointing to A or B to point to C
- Merge synonym lists, choose best canonical ID
- Update confidence scores based on combined evidence
- Mark old entities as deprecated with redirect pointers
This is easier if you version your graph snapshots - you can always reconstruct what happened.
Output structure - I'd suggest:
# Pass 1: entities.json
{
"canonical_entities": {
"domain_id": {
"canonical_id": "UMLS:C0001175",
"dbpedia_uri": "http://dbpedia.org/resource/...",
"name": "...",
"synonyms": [...],
"embedding": [...],
"confidence": 0.95,
"first_seen": "2024-01-15",
"usage_count": 42
}
}
}
# Pass 2: paper_12345.json
{
"paper_id": "12345",
"metadata": {...},
"provisional_entities": {
"local_id": {
"name": "...",
"context": "...",
"embedding": [...],
"candidate_canonical": ["UMLS:C0001175"],
"confidence": 0.65
}
},
"edges": [
{
"type": "extraction", # or "claims", "evidence", etc
"source": "canonical:UMLS:C0001175",
"target": "provisional:local_id",
"attributes": {...}
}
]
}The DBPedia integration is clever - it gives you a universal backbone. You could have a preprocessing step that pulls relevant DBPedia subgraphs for each domain (medical entities, CS concepts, legal terms) and treats those as pre-seeded canonical entities.
The Cyc comparison is apt, but you have a huge advantage: LLMs can extract structured information at scale, and modern vector databases can handle the semantic similarity computations. Lenat's team was hand-curating everything.
One more thought: Consider a "review queue" workflow where low-confidence promotions or potential merges get flagged for human review. The system could rank these by confidence score and potential impact (highly-connected entities get priority review). This lets you bootstrap quality without blocking the pipeline.
What I said
Let me take a breather by addressing a larger question. We've been picking nits
for over an hour in what should be a really straightforward application or
extension of kgraph. There have been a very large number of little things that
have to be gotten just right, which we would then ask a future extender to also
get just right. It seems the contract offered by kgraph must be too picky and
fastidious, don't you think? Are we failing in some best practice (I dunno,
maybe good use of interfaces?) to explain why this process is so fraught?
What ChatGPT said
I get why it feels like “death by a thousand papercuts.” But I don’t think the core problem is that
kgraphis too picky so much as: right now the framework is exposing too many “sharp edges” of its internal invariants to the extender.There are two kinds of “picky”:
Good picky (worth it)
These are invariants that make the system reliable:
Those constraints are best practice for KG pipelines. If you remove them, the system becomes easy to extend but hard to trust.
Bad picky (avoidable)
This is the stuff that made the Sherlock example feel fraught:
created_at,source, etc.)BaseEntity”document_idthrough mention metadataThat’s not “best practice,” that’s “the framework doesn’t protect you from common mistakes.”
What’s actually going on
You’re running into a mismatch between:
When the extender has to manually do a dozen bookkeeping tasks correctly, the framework contract isn’t “too strict”—it’s under-designed at the boundary.
In other words: the problem isn’t interfaces per se; it’s that the interfaces are too low-level. They expose internal representation decisions rather than offering a safer, higher-level API.
The best-practice smell: “You made the user assemble a valid object graph manually”
This is the classic pitfall.
A good framework usually tries to ensure:
Right now
kgraphmakes you manually construct Pydantic models that have a lot of invariants… and then the user becomes the one enforcing those invariants.That’s why it feels fraught.
How to make kgraph feel easy without losing correctness
1) Provide “builders” / factories (strongly recommended)
Instead of extenders instantiating entities directly:
ctx.make_canonical_entity(type="character", canonical_id=..., name=..., source=...)ctx.make_provisional_entity(type="character", mention=..., doc=...)ctx.make_relationship(predicate=..., subj=..., obj=..., doc=..., evidence=...)These factories would:
This one change removes like 70% of the papercuts.
2) Make “mention provenance” first-class, not
metadataIf
EntityMentionhad adocument_idfield, or aprovenanceobject, then you don’t rely on convention likemention.metadata["document_id"].Conventions in
metadataare where correctness goes to die.3) Make aggregation semantics explicit
Right now storage decides whether duplicates overwrite. That forces extenders to “know storage behavior.”
Instead:
Extenders should not need to care how the storage dedupes.
4) Provide a “domain registry / resolver” helper
A
DomainSchemacould expose:schema.new_entity(entity_type, **fields)(returns correct subclass)schema.new_relationship(predicate, **fields)So nobody ever accidentally instantiates
BaseEntity.5) Offer a “toy backend” that matches production semantics
Your in-memory storage currently has behavior that can differ from a DB-backed store (overwrite vs merge). If the example uses the toy backend, it should behave like the “real thing” to avoid surprise.
Are you failing in some best practice?
You’re not doing something “wrong” so much as you’re observing an architectural truth:
You are using interfaces. The missing best practice is:
What this means for the Sherlock example
Sherlock is actually doing you a favor: it’s acting as a canary for extension friction.
If a “toy domain” feels fragile, that’s evidence that:
This is exactly what examples are for—finding where you need to add guardrails.
If you want, I can propose a concrete API sketch for
IngestionContext/ factory helpers (just a few functions) that would make Sherlock go from “lots of little gotchas” to “boringly straightforward,” while preserving all the invariants that make kgraph trustworthy.What I said
I agree the Sherlock thing has been useful in exposing weaknesses in kgraph and
that's a good thing. Let's look at this factory idea (70% of the paper cuts).
Please say more about what they would look like, where they should go, all
that. BTW, please DO NOT USE DATACLASSES, use Pydantic models. Where possible,
it would be lovely if they are frozen because immutable data is happy data.
What ChatGPT said
Yep — factories are the right lever here, and you can do it without dataclasses, staying fully in Pydantic, and keeping things frozen/immutable.
The trick is: don’t just add “helper functions.” Add a typed “ingestion context” that’s the only way extenders create entities/relationships during ingestion. That context can enforce invariants, fill required fields, normalize IDs, attach provenance, and centralize aggregation.
Below is a concrete design that would remove most papercuts while preserving kgraph’s rigor.
What the factories should feel like to an extender
Instead of:
They do:
And relationships:
They never touch
created_at,source,usage_count, document IDs, etc. That all becomes framework-owned bookkeeping.Where these should live
New module:
kgraph/pipeline/builders.py(orkgraph/builders.py)It should be pipeline-adjacent, because builders are about ingestion-time object construction, not storage.
Orchestrator uses them internally
IngestionOrchestratorcreates a context per document ingestion, passes it to components (or uses it itself while wrapping existing interfaces).Extenders opt into the “safe path”
You can introduce this without breaking the world:
EntityExtractorInterface, etc.)EntityMention/BaseEntityreturns.Core idea: a frozen “context” with two sub-builders
1) Provenance becomes first-class (Pydantic, frozen)
Instead of hiding this in
mention.metadata["document_id"], define a model:You can still store it in
metadatafor backwards compatibility, but internally, builders use it.2) An ingestion clock / ids helper
(Or just a function; but making it explicit makes tests deterministic.)
Entity factories
Why you need them
They solve:
What they should look like
Implementation rules inside the builder
Look up the correct subclass from
domain.entity_types[entity_type]Set:
status = CANONICALorPROVISIONALusage_count = 1for new entities (so your counting semantics are sane)created_at = clock.nowsource = "sherlock:curated"orprovenance.document_id(or"ingest:<doc_id>")Merge
attributesintometadata(unless you want typed domain attrs — see below)Validate using
domain.validate_entity(entity)Return frozen Pydantic model (your BaseEntity already is frozen)
Optional but very nice: typed domain attributes without sacrificing generic pipeline
If your domain entity has typed fields like
role, you can support them without making extenders manually call the right class by allowingattributesand applying them as constructor kwargs.Example:
SherlockCharacter(role="detective")is a field, not metadata.If attributes include a key not on that model, Pydantic will raise (good).
Relationship factories + aggregation
Relationships are where you want to remove storage-dependent semantics.
Builder API
Aggregator inside orchestrator, not extenders
Instead of telling extenders “dedupe triples yourself,” have orchestrator aggregate by
(subject, predicate, object):co_occurrence_countstyle evidence accumulatorlast_updatedThat means storage can be dumb and overwrite safely, because orchestrator only writes canonical aggregated relationships.
The ingestion context itself
The orchestrator constructs this per document, so everything has consistent “now” and consistent document IDs.
Enforcing pass invariants (this is where “sharp edges” go away)
You can make it almost impossible to do the wrong thing by providing two contexts:
Pass1Context: entity stuff onlyPass2Context: relationships stuff only (entities are resolved already)Or more simply: separate builders and only expose the appropriate one.
That alone kills “accidentally creating entities during relationship extraction.”
How this plugs into the existing interfaces
You can introduce new optional interfaces without breaking existing ones:
EntityExtractorV2.extract(ctx: Pass1Context) -> list[EntityMention]EntityResolverV2.resolve(ctx: Pass1Context, mention: EntityMention, ...) -> BaseEntityRelationshipExtractorV2.extract(ctx: Pass2Context, entities: Sequence[BaseEntity]) -> list[BaseRelationship]And orchestrator can detect which methods exist via
hasattrand choose the v2 path.Even if you don’t change interfaces, orchestrator can still use builders internally to:
But the biggest win comes when extenders themselves use the factories.
What changes for Sherlock specifically
Your resolver stops being “construct 12 fields correctly” and becomes:
No more abstract base mistakes, missing fields, or metadata conventions.
Why this removes ~70% of papercuts
Because you’ve moved the pain from “every extender must remember 20 rules” to:
This is the “pit of success” principle in practice.
If you want, next I can sketch the minimum viable patch to
IngestionOrchestratorto introduce:IngestionClock+IngestionContextEntityBuilder+RelationshipBuilder…without breaking the current interfaces. That’s the smallest change that makes Sherlock (and future domains) feel boring again.