Skip to content

Instantly share code, notes, and snippets.

@wware
Last active January 19, 2026 16:55
Show Gist options
  • Select an option

  • Save wware/49fd4c9ab738588dc9ab6230e7c3452e to your computer and use it in GitHub Desktop.

Select an option

Save wware/49fd4c9ab738588dc9ab6230e7c3452e to your computer and use it in GitHub Desktop.

Generalizing literature graphs across knowledge domains

Prompt:

1/18/2026, 10:16:42 AM

I want to reframe the medical literature project a bit, allow it to be generalized to other domains of knowledge. We are still building a graph and a graph still consists of nodes (entities) and edges (relationships). We still have a collection of entities from previous ingestion processes. We add a new thing: entities may be "canonical", that is they have been assigned canonical IDs (UMLS numbers or whatever) or they may be "provisional", meaning that we don't know yet if they should be assigned canonical IDs, for instance an entity might be a mention of some trivial thing in just one paper.

Given a batch of papers to ingest, we proceed in two passes. First pass we extract entities and assign canonical IDs where they make sense. Second pass we identify the edges (for medical, these edges are of the three types, extraction, claims, and evidence). The first pass produces a JSON serialization of the collection of entities including canonical IDs and synonyms. The second pass produces one JSON file per paper, including the paper's edges and any provisional entities unique to that paper.

This framework allows each knowledge domain (legal documents, academic CS papers) to define its own source of canonical IDs, its own schema, its own list of edge types. Any interesting query optimizations (graph theory tricks, database quirks) can be shared across domains. Where possible, cross-domain canonical IDs are preferred (such as DBPedia URIs). Including a significant chunk of DBPedia is probably a very good idea, or at least being able to pull in DBPedia entities as the ingestion progresses.

Back in the 1980s, Doug Lenat tried to do something like this with the Cyc project, but there were no LLMs then, only grad students and postdocs, so that's what he used. They never got enough scale for the project to gain much traction. Perhaps now his idea is more feasible.

What to do about entity merging? If two canonical entities turn out to be the same thing, that's a messier operation than promoting a provisional entity. Maybe track usage counts to decide when to promote, and maintain confidence scores on canonical IDs to decide when to merge. Each entity has a semantic vector, and hopefully potential merges can be flagged by near matches of cosine distance.

Response:

1/18/2026, 10:17:20 AM

Claude's thought process

Synthesized scalable knowledge graph architecture across domains.

This is a really interesting evolution of the medical literature project. Will wants to generalize the architecture to work across different knowledge domains. Let me break down what he's proposing:

  1. Core architecture remains graph-based: Nodes are entities, edges are relationships
  2. Key innovation - entity status:
    • Canonical entities: Have domain-specific IDs (UMLS for medical, DBPedia URIs cross-domain, etc.)
    • Provisional entities: Mentions that haven't yet earned canonical status
  3. Two-pass ingestion:
    • Pass 1: Extract entities, assign canonical IDs where appropriate
    • Pass 2: Extract relationships/edges, create paper-specific files with edges and provisional entities
  4. Domain flexibility: Each domain defines:
    • Its canonical ID source
    • Its schema
    • Its edge types
  5. Shared infrastructure: Query optimizations, graph algorithms can be reused
  6. Entity lifecycle management:
    • Track usage counts to decide when to promote provisional → canonical
    • Track confidence scores on canonical assignments
    • Use semantic vectors for merge detection
    • Merging canonical entities is "messier" than promotion

The Cyc reference is apt - this is trying to build comprehensive knowledge graphs but with LLMs doing the heavy lifting instead of human experts.

This is a much more sophisticated and scalable architecture than the original medical-only version. Key design challenges:

  1. Entity resolution: When does a provisional entity become canonical?
  2. Merge detection: How to identify duplicates among canonical entities?
  3. Confidence management: How to track and update confidence in canonical assignments?
  4. Cross-domain linking: When to use domain-specific vs universal IDs?

Will's background makes this interesting - he has deep experience with both the technical infrastructure (AWS, containerization, CI/CD) and semantic technologies. He's thinking about how to make this practical and maintainable.

I should engage with this architectural vision, help him think through:

  • How entities transition between states and what triggers those transitions
  • The JSON structure for both extraction passes and how to handle provisional vs canonical entities
  • Strategies for detecting and resolving duplicate canonical entities using semantic similarity
  • Scaling considerations and operational complexity as the graph grows
  • Whether this fits into his current work or exists as a separate exploration

Given his technical depth, I can discuss implementation details and tradeoffs directly.

This is a really solid architectural evolution. You're essentially building a federated knowledge graph system where domains can plug in their own ontologies and extraction logic while sharing the underlying graph infrastructure and entity resolution machinery.

The two-pass approach is smart - it separates the "what exists" problem from the "how are things related" problem, and lets you defer committing to provisional entities until you see if they gain traction across papers.

A few thoughts on the entity lifecycle:

Promotion criteria could be fairly mechanical:

  • Usage count threshold (appears in N papers, or referenced by M canonical entities)
  • Semantic clustering (multiple provisional entities converge semantically)
  • Explicit domain rules (e.g., any chemical compound mentioned more than once gets promoted)

For merge detection, the semantic vector approach is good, but you probably want multiple signals:

  • Cosine similarity of entity embeddings
  • Synonym overlap (if both have associated aliases)
  • Co-occurrence patterns (entities that appear together in similar contexts)
  • Domain-specific rules (e.g., UMLS already has a "broader/narrower" hierarchy)

The messy merge operation - this is where the graph structure helps you. If you need to merge canonical entities A and B:

  1. Create new entity C (or keep A, deprecate B)
  2. Redirect all edges pointing to A or B to point to C
  3. Merge synonym lists, choose best canonical ID
  4. Update confidence scores based on combined evidence
  5. Mark old entities as deprecated with redirect pointers

This is easier if you version your graph snapshots - you can always reconstruct what happened.

Output structure - I'd suggest:

# Pass 1: entities.json
{
  "canonical_entities": {
    "domain_id": {
      "canonical_id": "UMLS:C0001175",
      "dbpedia_uri": "http://dbpedia.org/resource/...",
      "name": "...",
      "synonyms": [...],
      "embedding": [...],
      "confidence": 0.95,
      "first_seen": "2024-01-15",
      "usage_count": 42
    }
  }
}

# Pass 2: paper_12345.json
{
  "paper_id": "12345",
  "metadata": {...},
  "provisional_entities": {
    "local_id": {
      "name": "...",
      "context": "...",
      "embedding": [...],
      "candidate_canonical": ["UMLS:C0001175"],
      "confidence": 0.65
    }
  },
  "edges": [
    {
      "type": "extraction",  # or "claims", "evidence", etc
      "source": "canonical:UMLS:C0001175",
      "target": "provisional:local_id",
      "attributes": {...}
    }
  ]
}

The DBPedia integration is clever - it gives you a universal backbone. You could have a preprocessing step that pulls relevant DBPedia subgraphs for each domain (medical entities, CS concepts, legal terms) and treats those as pre-seeded canonical entities.

The Cyc comparison is apt, but you have a huge advantage: LLMs can extract structured information at scale, and modern vector databases can handle the semantic similarity computations. Lenat's team was hand-curating everything.

One more thought: Consider a "review queue" workflow where low-confidence promotions or potential merges get flagged for human review. The system could rank these by confidence score and potential impact (highly-connected entities get priority review). This lets you bootstrap quality without blocking the pipeline.

Don't forget queries

Response:

1/18/2026, 1:35:54 PM

Claude's thought process

Thought process: Architected query layer strategy bridging storage and user-facing API interfaces.

Will is asking about the query side of the knowledge graph - how users will actually interact with and retrieve information from the graph once it's built. He references med-lit-schema which has:

  • FastAPI server
  • GraphQL/GraphiQL interface
  • MkDocs documentation

This is about the "read" side of the system, whereas everything we've built so far is the "write" side (ingestion, entity resolution, etc.).

Let me think about what a query interface needs:

  1. Graph traversal queries (find paths, neighborhoods)
  2. Entity search and retrieval
  3. Relationship queries
  4. Semantic search (vector similarity)
  5. Aggregations and statistics
  6. Domain-specific queries

GraphQL is particularly well-suited for graph queries because:

  • Natural fit for graph data structures
  • Clients can specify exactly what they want
  • Type system matches well with domain schemas
  • Built-in introspection

The architecture should probably be:

  • Storage backend (what we have)
  • Query layer (new - translates high-level queries to storage operations)
  • API layer (FastAPI + GraphQL)
  • Documentation (MkDocs for API docs and domain schema docs)

Key considerations:

  • Domain-agnostic query layer (works for any DomainSchema)
  • GraphQL schema generation from DomainSchema
  • Efficient query execution (caching, batching)
  • Support for complex graph queries
  • Semantic search integration

Should discuss:

  • Query patterns needed

  • GraphQL schema generation

  • Query optimization

  • How this fits with the two-pass ingestion output

  • Cross-domain queries (using DBPedia as bridge)

Ah yes, the query side! This is where the graph becomes useful. Let's think through what a query layer could look like.

Query Patterns We Need to Support

Entity-centric:

  • Get entity by ID (canonical or provisional)
  • Search entities by name/synonym (fuzzy matching)
  • Find similar entities (semantic search via embeddings)
  • Get entity with all its relationships
  • Entity promotion candidates (for review queues)

Relationship-centric:

  • Get all relationships for an entity (inbound/outbound)
  • Filter by predicate type
  • Get relationship path between two entities
  • Find all entities related through a specific predicate chain

Graph traversal:

  • N-hop neighborhood (all entities within N relationships)
  • Path finding (shortest path, all paths between entities)
  • Subgraph extraction (all entities/relationships matching criteria)
  • Clustering/community detection

Domain-specific:

  • Medical: "What drugs treat conditions similar to X?"
  • Legal: "What precedents cite this case?"
  • CS papers: "What papers build on this technique?"

Aggregations:

  • Count entities by type, status
  • Most-connected entities (graph centrality)
  • Usage statistics (which entities appear most often)
  • Merge candidate review queue

Architecture Layers

┌─────────────────────────────────────┐
│  API Layer (FastAPI + GraphQL)      │
│  - REST endpoints                   │
│  - GraphQL schema & resolvers       │
│  - GraphiQL UI                      │
└────────────┬────────────────────────┘
             │
┌────────────▼────────────────────────┐
│  Query Service Layer                │
│  - Domain-agnostic query builders   │
│  - Query optimization               │
│  - Result caching                   │
│  - Batch/dataloader support         │
└────────────┬────────────────────────┘
             │
┌────────────▼────────────────────────┐
│  Storage Layer (what we have)       │
│  - EntityStorage                    │
│  - RelationshipStorage              │
│  - DocumentStorage                  │
└─────────────────────────────────────┘

GraphQL Schema Generation

The big win here is auto-generating GraphQL schema from DomainSchema:

# Conceptual - not actual code
def generate_graphql_schema(domain: DomainSchema) -> str:
    """Generate GraphQL schema from domain definition."""
    
    # For each entity type, create:
    # - Type definition
    # - Query resolvers
    # - Relationship fields
    
    # Medical domain would generate:
    """
    type Drug {
      id: ID!
      name: String!
      synonyms: [String!]!
      dbpediaUri: String
      treats: [Disease!]!  # Relationships
      interactsWith: [Drug!]!
      usageCount: Int!
    }
    
    type Disease {
      id: ID!
      name: String!
      treatedBy: [Drug!]!
      causedBy: [Gene!]!
    }
    
    type Query {
      drug(id: ID!): Drug
      searchDrugs(name: String!, limit: Int = 10): [Drug!]!
      similarDrugs(id: ID!, threshold: Float = 0.8): [Drug!]!
      
      disease(id: ID!): Disease
      searchDiseases(name: String!): [Disease!]!
      
      # Generic queries
      entityById(id: ID!): Entity
      entitiesInRadius(id: ID!, hops: Int!): [Entity!]!
      pathBetween(from: ID!, to: ID!): Path
    }
    """

Query Service Interface

class QueryService:
    """Domain-agnostic query execution."""
    
    def __init__(
        self,
        domain: DomainSchema,
        entity_storage: EntityStorageInterface,
        relationship_storage: RelationshipStorageInterface,
    ):
        self.domain = domain
        self.entity_storage = entity_storage
        self.relationship_storage = relationship_storage
        self._cache = {}  # Query result cache
    
    async def get_entity_with_relationships(
        self,
        entity_id: str,
        predicates: list[str] | None = None,
        direction: Literal["inbound", "outbound", "both"] = "both",
    ) -> tuple[BaseEntity, list[BaseRelationship]]:
        """Get entity with its relationships."""
        
    async def find_path(
        self,
        source_id: str,
        target_id: str,
        max_hops: int = 5,
    ) -> list[list[tuple[BaseEntity, BaseRelationship]]]:
        """Find paths between entities."""
        
    async def get_neighborhood(
        self,
        entity_id: str,
        hops: int = 2,
        predicate_filter: list[str] | None = None,
    ) -> tuple[list[BaseEntity], list[BaseRelationship]]:
        """Get N-hop neighborhood subgraph."""
        
    async def semantic_search(
        self,
        query: str,
        entity_type: str | None = None,
        limit: int = 10,
    ) -> list[tuple[BaseEntity, float]]:
        """Semantic search via embeddings."""

FastAPI + GraphQL Stack

from fastapi import FastAPI
from strawberry.fastapi import GraphQLRouter
import strawberry

app = FastAPI(title="Knowledge Graph API")

# REST endpoints for simple queries
@app.get("/entities/{entity_id}")
async def get_entity(entity_id: str):
    """Get entity by ID."""
    
@app.get("/search/entities")
async def search_entities(q: str, type: str | None = None):
    """Search entities by name."""

# GraphQL
@strawberry.type
class Query:
    @strawberry.field
    async def entity(self, id: str) -> Entity:
        """Get entity by ID."""
        
    @strawberry.field
    async def search(self, query: str) -> list[Entity]:
        """Search entities."""

schema = strawberry.Schema(query=Query)
graphql_app = GraphQLRouter(schema)
app.include_router(graphql_app, prefix="/graphql")

# MkDocs for documentation
# - Domain schema reference
# - API documentation
# - Example queries
# - Query playground

Cool Features to Consider

1. Cross-domain queries via DBPedia:

query {
  # Medical domain
  drug(id: "aspirin") {
    name
    dbpediaUri
    # Jump to legal domain via DBPedia
    legalCases {
      title
      court
    }
  }
}

2. Temporal queries (if we track timestamps):

query {
  entitiesAddedAfter(date: "2025-01-01") {
    name
    createdAt
  }
}

3. Promotion/merge review queues:

query {
  promotionCandidates(minUsage: 5, minConfidence: 0.8) {
    id
    name
    usageCount
    confidence
    status
  }
  
  mergeCandidates(threshold: 0.95) {
    entity1 { name }
    entity2 { name }
    similarity
  }
}

4. Aggregations:

query {
  statistics {
    totalEntities
    byType {
      type
      count
      canonicalCount
      provisionalCount
    }
    topEntities(limit: 10) {
      name
      usageCount
    }
  }
}

Query Optimization Considerations

Caching:

  • Entity lookups (high cache hit rate expected)
  • Relationship traversals for popular entities
  • Embedding similarity results

Batching (DataLoader pattern):

  • Batch entity lookups when expanding relationships
  • Batch embedding searches

Indexing (storage layer):

  • Entity name prefix index (for autocomplete)
  • Relationship subject/object indexes (already have)
  • Embedding vector index (FAISS, pgvector)

Materialized views:

  • Pre-compute popular neighborhoods
  • Cache graph statistics

Structure

kgraph/
├── query/
│   ├── __init__.py
│   ├── service.py          # QueryService
│   ├── graphql/
│   │   ├── schema.py       # Schema generation
│   │   ├── resolvers.py    # Resolver implementations
│   │   └── types.py        # GraphQL type definitions
│   └── cache.py            # Query result caching
├── api/
│   ├── __init__.py
│   ├── main.py             # FastAPI app
│   ├── rest.py             # REST endpoints
│   └── deps.py             # Dependency injection
└── docs/
    └── api/                # MkDocs API documentation

What Makes This Cool

  1. Domain-agnostic: GraphQL schema auto-generated from DomainSchema
  2. Consistent interface: Same query patterns work across medical, legal, CS domains
  3. Graph-native: First-class support for traversals, paths, neighborhoods
  4. Semantic search: Leverage embeddings for "find similar" queries
  5. Review workflows: Built-in support for promotion/merge queues
  6. Cross-domain: DBPedia URIs enable queries across domains
  7. Developer experience: GraphiQL playground, MkDocs, type safety

Next Steps (When You're Ready)

  1. Start simple: REST endpoints for basic CRUD
  2. Add GraphQL: Schema generation from one domain (medical or legal)
  3. Graph queries: Implement traversal/path-finding
  4. Polish: Add caching, batching, documentation
  5. Cross-domain: Demonstrate DBPedia bridge queries

This would be a really compelling demo - ingest medical papers, then query "show me drugs similar to aspirin that treat inflammation" with a GraphiQL interface showing the graph visualization.

What aspects of this excite you most? The GraphQL schema generation? The graph traversal algorithms? The cross-domain queries?

Generalizing literature graphs across knowledge domains

Created: 1/18/2026 10:12:10
Updated: 1/18/2026 14:10:04
Exported: 1/18/2026 14:15:06
Link: https://claude.ai/chat/d5d81531-8075-4636-87b4-b40ec558f5a5

Prompt:

1/18/2026, 2:08:54 PM

Let's look in a different direction. Now that we've defined kgraph, let's think about the "vector subtraction" of "med-lit-schema minus kgraph", that is, how we would extend kgraph to get to where med-lit-schema is now. This should be an application of the "example" stuff we've already discussed.

Response:

1/18/2026, 2:10:04 PM

Claude's thought process

Thought process: Deconstructed med-lit-schema to identify kgraph extensions needed.

This is a great question. Will wants to understand what's needed to get from the generic kgraph framework to a complete medical literature application. Let me think about what med-lit-schema provides that kgraph doesn't...

Looking at this, we need:

  1. Medical domain implementation (specific entity types, relationships)
  2. LLM-based pipeline components (parser, extractors, resolver)
  3. Actual canonical ID integration (UMLS, RxNorm, etc.)
  4. The three edge types (extraction, claims, evidence)
  5. PubMed/arxiv integration for document ingestion
  6. Query interface for the graph
  7. Promotion/merge review workflows

This would go in examples/medical/ as a complete implementation showing how to use kgraph for medical literature processing.

Let me structure this as a clear "what needs to be built" list with concrete file paths and implementations.

Perfect question. Let's define what needs to be in examples/medical/ to turn kgraph into a working medical literature system.

examples/medical/ - The Med-Lit Application Layer

1. Domain Definition (medical_domain.py)

from kgraph import BaseEntity, BaseRelationship, DomainSchema, EntityStatus

class DrugEntity(BaseEntity):
    """Pharmaceutical drug."""
    rxnorm_id: str | None = None
    drug_class: str | None = None
    
    def get_entity_type(self) -> str:
        return "drug"
    
    def get_canonical_id_source(self) -> str | None:
        return "RxNorm" if self.status == EntityStatus.CANONICAL else None

class DiseaseEntity(BaseEntity):
    """Medical condition."""
    umls_id: str | None = None
    icd_code: str | None = None
    
    def get_entity_type(self) -> str:
        return "disease"
    
    def get_canonical_id_source(self) -> str | None:
        return "UMLS" if self.status == EntityStatus.CANONICAL else None

class GeneEntity(BaseEntity):
    """Gene or protein."""
    ncbi_gene_id: str | None = None
    uniprot_id: str | None = None
    
    def get_entity_type(self) -> str:
        return "gene"
    
    def get_canonical_id_source(self) -> str | None:
        return "NCBI_Gene" if self.status == EntityStatus.CANONICAL else None

# Three relationship types from your architecture
class ExtractionRelationship(BaseRelationship):
    """Direct co-mention in text."""
    context: str | None = None
    sentence: str | None = None
    
    def get_edge_type(self) -> str:
        return "extraction"

class ClaimRelationship(BaseRelationship):
    """Author's explicit claim."""
    claim_text: str | None = None
    strength: str | None = None  # "strong", "moderate", "weak"
    
    def get_edge_type(self) -> str:
        return "claim"

class EvidenceRelationship(BaseRelationship):
    """Supporting/contradicting evidence."""
    evidence_type: str | None = None  # "supports", "contradicts"
    study_design: str | None = None
    
    def get_edge_type(self) -> str:
        return "evidence"

class PubMedPaper(BaseDocument):
    """PubMed research paper."""
    pmid: str | None = None
    doi: str | None = None
    authors: list[str] = []
    journal: str | None = None
    pub_date: str | None = None
    
    def get_document_type(self) -> str:
        return "pubmed_article"
    
    def get_sections(self) -> list[tuple[str, str]]:
        # Parse abstract, methods, results, discussion
        # Real implementation would use section detection
        return [("abstract", self.metadata.get("abstract", ""))]

class MedicalDomain(DomainSchema):
    @property
    def name(self) -> str:
        return "medical"
    
    @property
    def entity_types(self) -> dict[str, type[BaseEntity]]:
        return {
            "drug": DrugEntity,
            "disease": DiseaseEntity,
            "gene": GeneEntity,
        }
    
    @property
    def relationship_types(self) -> dict[str, type[BaseRelationship]]:
        return {
            "extraction": ExtractionRelationship,
            "claim": ClaimRelationship,
            "evidence": EvidenceRelationship,
        }
    
    @property
    def document_types(self) -> dict[str, type[BaseDocument]]:
        return {"pubmed_article": PubMedPaper}
    
    @property
    def canonical_id_sources(self) -> dict[str, str]:
        return {
            "drug": "RxNorm",
            "disease": "UMLS",
            "gene": "NCBI_Gene",
        }
    
    def validate_entity(self, entity: BaseEntity) -> bool:
        return entity.get_entity_type() in self.entity_types
    
    def validate_relationship(self, rel: BaseRelationship) -> bool:
        return rel.predicate in self.relationship_types

2. LLM-Based Pipeline Components

llm_parser.py - Document Parser

class LLMPubMedParser(DocumentParserInterface):
    """Parse PubMed XML/text into structured documents."""
    
    def __init__(self, llm_client):
        self._llm = llm_client
    
    async def parse(self, raw_content: bytes, content_type: str, 
                   source_uri: str | None = None) -> PubMedPaper:
        # Extract title, abstract, sections using LLM
        # Parse metadata (authors, journal, date)
        # Return PubMedPaper
        pass

llm_entity_extractor.py - Entity Extraction

class LLMEntityExtractor(EntityExtractorInterface):
    """Extract medical entities using LLM with biomedical prompting."""
    
    def __init__(self, llm_client):
        self._llm = llm_client
    
    async def extract(self, document: BaseDocument) -> list[EntityMention]:
        # Prompt: "Extract drugs, diseases, genes from this abstract..."
        # Return EntityMention objects with entity_type set
        pass

umls_resolver.py - Entity Resolution with UMLS

class UMLSEntityResolver(EntityResolverInterface):
    """Resolve entities to UMLS/RxNorm/NCBI canonical IDs."""
    
    def __init__(self, umls_client, rxnorm_client, ncbi_client, 
                 embedding_generator):
        self._umls = umls_client
        self._rxnorm = rxnorm_client
        self._ncbi = ncbi_client
        self._embedder = embedding_generator
    
    async def resolve(self, mention: EntityMention, 
                     existing_storage: EntityStorageInterface
                     ) -> tuple[BaseEntity, float]:
        # 1. Check existing entities by name/synonym
        # 2. Query appropriate authority (UMLS/RxNorm/NCBI)
        # 3. If found, create canonical entity
        # 4. If not found, create provisional entity
        # 5. Generate embedding
        pass

llm_relationship_extractor.py - Three Edge Types

class MedicalRelationshipExtractor(RelationshipExtractorInterface):
    """Extract extraction/claim/evidence relationships."""
    
    def __init__(self, llm_client):
        self._llm = llm_client
    
    async def extract(self, document: BaseDocument, 
                     entities: Sequence[BaseEntity]
                     ) -> list[BaseRelationship]:
        # Pass 2a: Extraction edges (co-mentions)
        extractions = await self._extract_comentions(document, entities)
        
        # Pass 2b: Claim edges (author assertions)
        claims = await self._extract_claims(document, entities)
        
        # Pass 2c: Evidence edges (supporting/contradicting)
        evidence = await self._extract_evidence(document, entities)
        
        return extractions + claims + evidence
    
    async def _extract_comentions(self, doc, entities):
        # Find entities mentioned in same sentence
        # Create ExtractionRelationship
        pass
    
    async def _extract_claims(self, doc, entities):
        # Prompt LLM: "What does the author claim about relationships?"
        # Create ClaimRelationship
        pass
    
    async def _extract_evidence(self, doc, entities):
        # Look for evidence language (supports, contradicts)
        # Create EvidenceRelationship
        pass

3. External Service Clients

umls_client.py

class UMLSClient:
    """Interface to UMLS Metathesaurus API."""
    
    async def search(self, term: str) -> list[UMLSConcept]:
        # Query UMLS REST API
        # Return concept ID, preferred name, synonyms
        pass

rxnorm_client.py

class RxNormClient:
    """Interface to RxNorm drug database."""
    
    async def search(self, drug_name: str) -> RxNormConcept | None:
        # Query RxNorm API
        pass

pubmed_client.py

class PubMedClient:
    """Fetch papers from PubMed."""
    
    async def fetch_by_pmid(self, pmid: str) -> bytes:
        # Fetch XML from PubMed API
        pass
    
    async def search(self, query: str, limit: int = 100) -> list[str]:
        # Search for PMIDs matching query
        pass

4. Orchestration & Workflows

ingest_pipeline.py

class MedicalIngestionPipeline:
    """High-level pipeline for medical literature."""
    
    def __init__(self, orchestrator: IngestionOrchestrator):
        self._orchestrator = orchestrator
    
    async def ingest_pubmed_query(self, query: str, limit: int = 100):
        """Ingest papers matching a PubMed query."""
        # 1. Search PubMed
        # 2. Fetch papers
        # 3. Ingest through orchestrator
        # 4. Run promotion
        # 5. Export entities.json and per-paper JSONs
        pass
    
    async def review_promotions(self) -> list[BaseEntity]:
        """Get entities ready for promotion, sorted by confidence."""
        # Call orchestrator.run_promotion()
        # Return sorted list for human review
        pass
    
    async def review_merges(self, threshold: float = 0.95
                          ) -> list[tuple[BaseEntity, BaseEntity, float]]:
        """Find potential duplicate entities."""
        # Call orchestrator.find_merge_candidates()
        # Add synonym overlap detection
        # Return sorted by confidence
        pass

export.py

class MedicalGraphExporter:
    """Export graph in various formats."""
    
    async def export_entities_json(self, output_path: str):
        """Export entities.json with all canonical entities."""
        pass
    
    async def export_paper_json(self, document_id: str, output_path: str):
        """Export paper_{id}.json with edges and provisional entities."""
        pass
    
    async def export_neo4j(self, neo4j_driver):
        """Export to Neo4j graph database."""
        pass

5. Query Interface

query.py

class MedicalGraphQuery:
    """Query interface for medical knowledge graph."""
    
    def __init__(self, entity_storage, relationship_storage):
        self._entities = entity_storage
        self._relationships = relationship_storage
    
    async def find_drug_disease_relationships(
        self, drug_name: str, disease_name: str
    ) -> list[BaseRelationship]:
        """Find all relationships between a drug and disease."""
        pass
    
    async def find_papers_mentioning(
        self, entity_name: str
    ) -> list[BaseDocument]:
        """Find papers that mention an entity."""
        pass
    
    async def get_evidence_chain(
        self, subject: str, object: str, max_hops: int = 3
    ) -> list[list[BaseRelationship]]:
        """Find evidence chains between two entities."""
        pass

6. CLI & Demo Scripts

cli.py

import click

@click.group()
def cli():
    pass

@cli.command()
@click.argument('query')
@click.option('--limit', default=100)
def ingest(query: str, limit: int):
    """Ingest papers from PubMed."""
    # Run ingestion pipeline
    pass

@cli.command()
def promote():
    """Review and promote provisional entities."""
    pass

@cli.command()
def merge():
    """Review and merge duplicate entities."""
    pass

@cli.command()
@click.argument('drug')
@click.argument('disease')
def query_relationship(drug: str, disease: str):
    """Query relationships between entities."""
    pass

demo_workflow.py

async def demo_medical_workflow():
    """Complete workflow demo."""
    # 1. Set up storage and domain
    domain = MedicalDomain()
    entity_storage = PostgresEntityStorage(...)
    
    # 2. Configure pipeline with LLM components
    orchestrator = IngestionOrchestrator(
        domain=domain,
        parser=LLMPubMedParser(llm_client),
        entity_extractor=LLMEntityExtractor(llm_client),
        entity_resolver=UMLSEntityResolver(...),
        relationship_extractor=MedicalRelationshipExtractor(llm_client),
        embedding_generator=OpenAIEmbedding(...),
        entity_storage=entity_storage,
        relationship_storage=relationship_storage,
        document_storage=document_storage,
    )
    
    # 3. Ingest papers
    pipeline = MedicalIngestionPipeline(orchestrator)
    await pipeline.ingest_pubmed_query("aspirin AND cardiovascular", limit=50)
    
    # 4. Review promotions
    candidates = await pipeline.review_promotions()
    print(f"Found {len(candidates)} entities ready for promotion")
    
    # 5. Export graph
    exporter = MedicalGraphExporter(entity_storage, relationship_storage)
    await exporter.export_entities_json("output/entities.json")
    
    # 6. Query
    query = MedicalGraphQuery(entity_storage, relationship_storage)
    rels = await query.find_drug_disease_relationships("aspirin", "stroke")
    print(f"Found {len(rels)} relationships")

File Structure Summary

examples/medical/
├── README.md                    # Setup and usage instructions
├── requirements.txt             # Additional deps (anthropic, openai, etc.)
├── medical_domain.py            # Domain definition (6 entity types, 3 edge types)
├── llm_parser.py               # LLM-based document parser
├── llm_entity_extractor.py     # LLM entity extraction
├── llm_relationship_extractor.py  # Three edge type extraction
├── umls_resolver.py            # Entity resolution with canonical IDs
├── clients/
│   ├── umls_client.py          # UMLS Metathesaurus API
│   ├── rxnorm_client.py        # RxNorm drug database
│   ├── ncbi_client.py          # NCBI Gene database
│   └── pubmed_client.py        # PubMed paper fetching
├── ingest_pipeline.py          # High-level workflow orchestration
├── export.py                   # JSON/Neo4j export utilities
├── query.py                    # Graph query interface
├── cli.py                      # Command-line interface
├── demo_workflow.py            # Complete example workflow
└── tests/
    ├── test_medical_domain.py
    └── test_entity_resolution.py

The "Subtraction" Summary

kgraph provides: Core abstractions, storage interfaces, ingestion orchestration, entity lifecycle

examples/medical/ adds:

  1. Medical domain specifics - 6 entity types with canonical IDs
  2. Three relationship types - extraction, claim, evidence
  3. LLM-based extractors - Using Claude/GPT for entity/relationship extraction
  4. External authority integration - UMLS, RxNorm, NCBI clients
  5. PubMed ingestion - Fetching and parsing papers
  6. Workflow automation - Promotion review, merge review, batch processing
  7. Export utilities - JSON serialization in the format we discussed
  8. Query interface - Graph traversal and search

This is exactly the "application layer" on top of your framework. Once you build this, you can replicate it for examples/legal/ with different entity types (cases, statutes, courts) and relationship types (cites, overrules, interprets).

@wware
Copy link
Author

wware commented Jan 19, 2026

What I said

Let me take a breather by addressing a larger question. We've been picking nits
for over an hour in what should be a really straightforward application or
extension of kgraph. There have been a very large number of little things that
have to be gotten just right, which we would then ask a future extender to also
get just right. It seems the contract offered by kgraph must be too picky and
fastidious, don't you think? Are we failing in some best practice (I dunno,
maybe good use of interfaces?) to explain why this process is so fraught?

What ChatGPT said

I get why it feels like “death by a thousand papercuts.” But I don’t think the core problem is that kgraph is too picky so much as: right now the framework is exposing too many “sharp edges” of its internal invariants to the extender.

There are two kinds of “picky”:

Good picky (worth it)

These are invariants that make the system reliable:

  • Pass separation (entities first, relationships second): avoids cycles like “edges create nodes” and keeps provenance coherent.
  • Immutability of entities/relationships: makes indexing, dedupe, and caching sane.
  • Stable IDs + promotion: you can’t have referential integrity if IDs change without a migration story.

Those constraints are best practice for KG pipelines. If you remove them, the system becomes easy to extend but hard to trust.

Bad picky (avoidable)

This is the stuff that made the Sherlock example feel fraught:

  • having to manually remember “don’t create entities in relationship extraction”
  • having to manually populate required fields (created_at, source, etc.)
  • having to remember “don’t instantiate abstract BaseEntity
  • relationship dedupe/aggregation quirks depending on storage backend
  • doc title normalization affecting story identity
  • needing to thread document_id through mention metadata

That’s not “best practice,” that’s “the framework doesn’t protect you from common mistakes.”


What’s actually going on

You’re running into a mismatch between:

  • kgraph’s conceptual model: “everything is explicit and immutable and provenance-aware”
  • the extender experience: “I just want to write an extractor/resolver and get a working graph”

When the extender has to manually do a dozen bookkeeping tasks correctly, the framework contract isn’t “too strict”—it’s under-designed at the boundary.

In other words: the problem isn’t interfaces per se; it’s that the interfaces are too low-level. They expose internal representation decisions rather than offering a safer, higher-level API.


The best-practice smell: “You made the user assemble a valid object graph manually”

This is the classic pitfall.

A good framework usually tries to ensure:

  • the “easy path” is correct by default
  • you can’t accidentally violate invariants without trying
  • the interface guides you into constructing valid objects

Right now kgraph makes you manually construct Pydantic models that have a lot of invariants… and then the user becomes the one enforcing those invariants.

That’s why it feels fraught.


How to make kgraph feel easy without losing correctness

1) Provide “builders” / factories (strongly recommended)

Instead of extenders instantiating entities directly:

  • ctx.make_canonical_entity(type="character", canonical_id=..., name=..., source=...)
  • ctx.make_provisional_entity(type="character", mention=..., doc=...)
  • ctx.make_relationship(predicate=..., subj=..., obj=..., doc=..., evidence=...)

These factories would:

  • fill required fields
  • normalize IDs
  • attach provenance automatically
  • forbid “entity creation in pass 2” by not exposing that API there

This one change removes like 70% of the papercuts.

2) Make “mention provenance” first-class, not metadata

If EntityMention had a document_id field, or a provenance object, then you don’t rely on convention like mention.metadata["document_id"].

Conventions in metadata are where correctness goes to die.

3) Make aggregation semantics explicit

Right now storage decides whether duplicates overwrite. That forces extenders to “know storage behavior.”

Instead:

  • relationship extractor should return “raw events”
  • orchestrator aggregates into canonical relationships before storage
  • storage stores canonical relationships

Extenders should not need to care how the storage dedupes.

4) Provide a “domain registry / resolver” helper

A DomainSchema could expose:

  • schema.new_entity(entity_type, **fields) (returns correct subclass)
  • schema.new_relationship(predicate, **fields)

So nobody ever accidentally instantiates BaseEntity.

5) Offer a “toy backend” that matches production semantics

Your in-memory storage currently has behavior that can differ from a DB-backed store (overwrite vs merge). If the example uses the toy backend, it should behave like the “real thing” to avoid surprise.


Are you failing in some best practice?

You’re not doing something “wrong” so much as you’re observing an architectural truth:

Interfaces alone don’t make extension easy; safe defaults and “pit of success” APIs do.

You are using interfaces. The missing best practice is:

  • high-level abstraction at the extension boundary
  • construction helpers that enforce invariants
  • conventions replaced with typed fields

What this means for the Sherlock example

Sherlock is actually doing you a favor: it’s acting as a canary for extension friction.

If a “toy domain” feels fragile, that’s evidence that:

  • the core abstractions are fine, but
  • the extender ergonomics are not yet there

This is exactly what examples are for—finding where you need to add guardrails.


If you want, I can propose a concrete API sketch for IngestionContext / factory helpers (just a few functions) that would make Sherlock go from “lots of little gotchas” to “boringly straightforward,” while preserving all the invariants that make kgraph trustworthy.

What I said

I agree the Sherlock thing has been useful in exposing weaknesses in kgraph and
that's a good thing. Let's look at this factory idea (70% of the paper cuts).
Please say more about what they would look like, where they should go, all
that. BTW, please DO NOT USE DATACLASSES, use Pydantic models. Where possible,
it would be lovely if they are frozen because immutable data is happy data.

What ChatGPT said

Yep — factories are the right lever here, and you can do it without dataclasses, staying fully in Pydantic, and keeping things frozen/immutable.

The trick is: don’t just add “helper functions.” Add a typed “ingestion context” that’s the only way extenders create entities/relationships during ingestion. That context can enforce invariants, fill required fields, normalize IDs, attach provenance, and centralize aggregation.

Below is a concrete design that would remove most papercuts while preserving kgraph’s rigor.


What the factories should feel like to an extender

Instead of:

entity = SherlockCharacter(
  entity_id=..., status=..., created_at=..., source=..., usage_count=..., metadata=...
)

They do:

entity = ctx.entities.canonical(
    entity_type="character",
    entity_id="holmes:char:SherlockHolmes",
    name="Sherlock Holmes",
    synonyms=("Holmes", "Mr. Holmes"),
    confidence=0.99,
    attributes={"role": "detective"},
)

And relationships:

ctx.relationships.add(
    predicate="appears_in",
    subject=holmes,
    object=story,
    confidence=0.95,
    evidence=ctx.evidence.from_document(doc),
)

They never touch created_at, source, usage_count, document IDs, etc. That all becomes framework-owned bookkeeping.


Where these should live

New module: kgraph/pipeline/builders.py (or kgraph/builders.py)

It should be pipeline-adjacent, because builders are about ingestion-time object construction, not storage.

Orchestrator uses them internally

IngestionOrchestrator creates a context per document ingestion, passes it to components (or uses it itself while wrapping existing interfaces).

Extenders opt into the “safe path”

You can introduce this without breaking the world:

  • Keep existing interfaces (EntityExtractorInterface, etc.)
  • Add v2 interfaces that accept a context, or adapt by letting orchestrator call builders while still accepting raw EntityMention/BaseEntity returns.

Core idea: a frozen “context” with two sub-builders

1) Provenance becomes first-class (Pydantic, frozen)

Instead of hiding this in mention.metadata["document_id"], define a model:

class Provenance(BaseModel, frozen=True):
    document_id: str
    source_uri: str | None = None
    section: str | None = None
    start_offset: int | None = None
    end_offset: int | None = None

You can still store it in metadata for backwards compatibility, but internally, builders use it.

2) An ingestion clock / ids helper

class IngestionClock(BaseModel, frozen=True):
    now: datetime

(Or just a function; but making it explicit makes tests deterministic.)


Entity factories

Why you need them

They solve:

  • “don’t instantiate BaseEntity”
  • “fill required fields”
  • “make provisional vs canonical consistent”
  • “centralize ID normalization”
  • “enforce domain-registered types”
  • “avoid mixing domain-specific attrs into generic metadata ad hoc”

What they should look like

class EntityBuilder(BaseModel, frozen=True):
    domain: DomainSchema
    clock: IngestionClock

    def canonical(
        self,
        *,
        entity_type: str,
        entity_id: str,
        name: str,
        synonyms: tuple[str, ...] = (),
        confidence: float = 1.0,
        canonical_ids: dict[str, str] | None = None,
        embedding: tuple[float, ...] | None = None,
        provenance: Provenance | None = None,
        attributes: dict | None = None,
    ) -> BaseEntity:
        ...

    def provisional(
        self,
        *,
        entity_type: str,
        name: str,
        confidence: float,
        provenance: Provenance,
        synonyms: tuple[str, ...] = (),
        embedding: tuple[float, ...] | None = None,
        attributes: dict | None = None,
        entity_id: str | None = None,   # optional override
    ) -> BaseEntity:
        ...

Implementation rules inside the builder

  • Look up the correct subclass from domain.entity_types[entity_type]

  • Set:

    • status = CANONICAL or PROVISIONAL
    • usage_count = 1 for new entities (so your counting semantics are sane)
    • created_at = clock.now
    • source = "sherlock:curated" or provenance.document_id (or "ingest:<doc_id>")
  • Merge attributes into metadata (unless you want typed domain attrs — see below)

  • Validate using domain.validate_entity(entity)

  • Return frozen Pydantic model (your BaseEntity already is frozen)

Optional but very nice: typed domain attributes without sacrificing generic pipeline

If your domain entity has typed fields like role, you can support them without making extenders manually call the right class by allowing attributes and applying them as constructor kwargs.

Example:

  • SherlockCharacter(role="detective") is a field, not metadata.
  • The builder can do:
klass = domain.entity_types[entity_type]
entity = klass(**base_fields, **(attributes or {}))

If attributes include a key not on that model, Pydantic will raise (good).


Relationship factories + aggregation

Relationships are where you want to remove storage-dependent semantics.

Builder API

class RelationshipEvidence(BaseModel, frozen=True):
    source_documents: tuple[str, ...]
    mentions: tuple[Provenance, ...] = ()
    notes: dict = Field(default_factory=dict)
class RelationshipBuilder(BaseModel, frozen=True):
    domain: DomainSchema
    clock: IngestionClock

    def create(
        self,
        *,
        predicate: str,
        subject_id: str,
        object_id: str,
        confidence: float,
        evidence: RelationshipEvidence,
        attributes: dict | None = None,
    ) -> BaseRelationship:
        ...

Aggregator inside orchestrator, not extenders

Instead of telling extenders “dedupe triples yourself,” have orchestrator aggregate by (subject, predicate, object):

  • union source_documents
  • maintain a co_occurrence_count style evidence accumulator
  • set last_updated
  • combine confidence by rule (max, mean, or noisy-or)

That means storage can be dumb and overwrite safely, because orchestrator only writes canonical aggregated relationships.


The ingestion context itself

class IngestionContext(BaseModel, frozen=True):
    domain: DomainSchema
    clock: IngestionClock
    document: BaseDocument

    entities: EntityBuilder
    relationships: RelationshipBuilder

    def provenance(
        self,
        *,
        start_offset: int | None = None,
        end_offset: int | None = None,
        section: str | None = None,
    ) -> Provenance:
        return Provenance(
            document_id=self.document.document_id,
            source_uri=self.document.source_uri,
            section=section,
            start_offset=start_offset,
            end_offset=end_offset,
        )

The orchestrator constructs this per document, so everything has consistent “now” and consistent document IDs.


Enforcing pass invariants (this is where “sharp edges” go away)

You can make it almost impossible to do the wrong thing by providing two contexts:

  • Pass1Context: entity stuff only
  • Pass2Context: relationships stuff only (entities are resolved already)
class Pass1Context(IngestionContext, frozen=True):
    relationships: None = None  # or omit entirely

class Pass2Context(IngestionContext, frozen=True):
    # could omit entity creation methods here
    pass

Or more simply: separate builders and only expose the appropriate one.

That alone kills “accidentally creating entities during relationship extraction.”


How this plugs into the existing interfaces

You can introduce new optional interfaces without breaking existing ones:

  • EntityExtractorV2.extract(ctx: Pass1Context) -> list[EntityMention]
  • EntityResolverV2.resolve(ctx: Pass1Context, mention: EntityMention, ...) -> BaseEntity
  • RelationshipExtractorV2.extract(ctx: Pass2Context, entities: Sequence[BaseEntity]) -> list[BaseRelationship]

And orchestrator can detect which methods exist via hasattr and choose the v2 path.

Even if you don’t change interfaces, orchestrator can still use builders internally to:

  • create story entities from doc metadata
  • set default provenance
  • aggregate relationships

But the biggest win comes when extenders themselves use the factories.


What changes for Sherlock specifically

Your resolver stops being “construct 12 fields correctly” and becomes:

if canonical_id:
    meta = lookup_curated(canonical_id)
    return ctx.entities.canonical(
        entity_type=mention.entity_type,
        entity_id=canonical_id,
        name=meta["name"],
        synonyms=tuple(meta["aliases"]),
        confidence=mention.confidence,
        attributes=...,
        provenance=ctx.provenance(start_offset=mention.start_offset, end_offset=mention.end_offset),
    ), mention.confidence

return ctx.entities.provisional(
    entity_type=mention.entity_type,
    name=mention.text,
    confidence=mention.confidence * 0.5,
    provenance=ctx.provenance(...),
), mention.confidence * 0.5

No more abstract base mistakes, missing fields, or metadata conventions.


Why this removes ~70% of papercuts

Because you’ve moved the pain from “every extender must remember 20 rules” to:

  • one place in kgraph that encodes invariants
  • one small, pleasant API surface for extenders

This is the “pit of success” principle in practice.


If you want, next I can sketch the minimum viable patch to IngestionOrchestrator to introduce:

  1. IngestionClock + IngestionContext
  2. EntityBuilder + RelationshipBuilder
  3. relationship aggregation in orchestrator (so in-memory storage overwrite is fine)

…without breaking the current interfaces. That’s the smallest change that makes Sherlock (and future domains) feel boring again.

@wware
Copy link
Author

wware commented Jan 19, 2026

What I said

Add a typed “ingestion context” that’s the only way extenders create entities/relationships during ingestion. That context can enforce invariants, fill required fields, normalize IDs, attach provenance, and centralize aggregation.

This sounds very appealing. My question about this is, is the ingestion context defined in kgraph proper, or in the extension (which sounds painful), or is it a base class in kgraph that the extender might add more fields to? For instance, medical literature has some very specific thoughts on provenance and what constitutes evidence and all that.

What ChatGPT said

Define it in kgraph proper.

Extenders should not be defining their own context shape just to use the framework — that would recreate the papercuts (“now you must wire a context correctly too”). The core ingestion context should be a stable, opinionated part of kgraph that:

  • carries the document + domain + clock
  • provides the entity/relationship builders
  • provides a minimum viable, typed provenance/evidence model
  • enforces pass invariants (pass1 vs pass2)

Then give domains a clean way to specialize evidence/provenance without subclassing the context.

Recommended pattern: generic context + domain-defined evidence type

In kgraph

  • IngestionContext (frozen Pydantic model)
  • Provenance (frozen Pydantic model)
  • Evidence base type (frozen Pydantic model)
  • builders accept evidence: Evidence and store it in relationship/entity metadata in a structured way

In the extension

  • Define MedicalEvidence(Evidence) / MedicalProvenance(Provenance) if needed
  • Provide a domain hook (on DomainSchema) that tells kgraph what evidence model to expect/validate

Then the relationship builder can do:

  • validate isinstance(evidence, domain.evidence_model)
  • store evidence.model_dump() into relationship.metadata["evidence"]

How medical would fit

Medical can define richer evidence like:

  • citation (PMID/DOI)
  • section (abstract/results)
  • study type (RCT / observational)
  • population, dosage, outcome measures
  • evidence grade (GRADE, etc.)
  • whether it’s extracted vs curated vs inferred

…and your builder/orchestrator can treat it as opaque-but-validated structured data.

So: kgraph owns the context, the domain owns the evidence schema, and the builder enforces + records it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment