Skip to content

Instantly share code, notes, and snippets.

Generalizing literature graphs across knowledge domains

Prompt:

1/18/2026, 10:16:42 AM

I want to reframe the medical literature project a bit, allow it to be generalized to other domains of knowledge. We are still building a graph and a graph still consists of nodes (entities) and edges (relationships). We still have a collection of entities from previous ingestion processes. We add a new thing: entities may be "canonical", that is they have been assigned canonical IDs (UMLS numbers or whatever) or they may be "provisional", meaning that we don't know yet if they should be assigned canonical IDs, for instance an entity might be a mention of some trivial thing in just one paper.

Given a batch of papers to ingest, we proceed in two passes. First pass we extract entities and assign canonical IDs where they make sense. Second pass we identify the edges (for medical, these edges are of the three types, extraction, claims, and evidence). The first pass produces a JSON serialization of the collection of enti

Battle Plan: Medical Knowledge Graph Ingestion Optimization

This document outlines the strategic roadmap for integrating Large Language Models (LLMs) into the ingestion pipeline, specifically targeting Stage 4 (Claims Extraction) and Stage 5 (Evidence Aggregation).

1. LLM Infrastructure Strategy

The choice of model is critical for handling the specialized technical language and structured data extraction requirements of medical research papers.

Component Model Recommendation Rationale

Short answer: yes, this test suite makes a lot of sense, and it’s actually quite strong architecturally. Longer answer below, with praise and some concrete suggestions where you might tighten it.


What you’re doing well

1. Clear layering and intent

Your tests map very cleanly onto your system architecture:

@wware
wware / SBIR.md
Last active January 11, 2026 14:34

Great work on the pipeline refactoring! This is a solid Unix-style architecture with clean separation of concerns:

What you've built:

  • Modular stages - Each pipeline script is independent and can be run separately
  • Interface-based design - Storage, parsers, and embeddings all use ABC interfaces
  • Swappable backends - SQLite for dev/testing, PostgreSQL+pgvector for production
  • Clean data flow - Each stage reads/writes through well-defined interfaces
  • Comprehensive docs - README and TESTING guide are clear and helpful

GPU-Accelerated AI Tinkering

Like me, you may get tired of paying subscription fees to use online LLMs. Especially when, later, you're told that you've reached the usage limit and you should "switch to another model" or some such nonsense. The tempation at that point is to run a model locally using Ollama, but your local machine probably doesn't have a GPU if you're not a gamer. Then you dream of picking up a cheap GPU box on eBay and running it locally, and that's not a bad idea but it takes time and money that you may not want to spend right now.

There is an alternative, services like Lambda Labs, RunPod, and others. Lambda Labs is what I got when I threw a dart at a dartboard, so I'll be using it here.

I'm using a LLM to translate medical papers into a graph database of entities and relationships. I set up GPU-accelerated paper ingestion using Lambda Labs, and got an enormous speedup over CPU-only. The quick turnaround made it practical to find and fix some bugs discovered during testing.

GPU

@wware
wware / 0_README.md
Last active December 30, 2025 18:40

Graph-RAG with Neo4j and MCP

A hands-on introduction to graph databases using Neo4j's classic movie dataset, accessible through both the Neo4j web interface and AI-powered natural language queries via Cursor IDE.

What Are Graph Databases?

Imagine you're organizing information about movies and actors. A traditional database stores these as separate tables:

Movies Table: Actors Table: Acted_In Table:
@wware
wware / .gitignore
Last active August 20, 2025 19:25
Learn jimmer
.gradle/
target/
# btw the java files go in src/main/java/com/example/*.java
@wware
wware / iid.md
Last active September 10, 2025 18:27

Immutable Interface Design (IID): Definition and Key Features

Immutable Interface Design (IID) is a proposed architectural pattern for Python development. The primary goals of this protocol are threefold:

  1. Early and Specific Design Documentation: IID emphasizes defining comprehensive interface specifications before writing implementation code. This is achieved through the use of Python's abc module for interface classes, abstract methods with strict type annotations, detailed docstrings for all components, and frozen Pydantic models for immutable data structures[^0_2][^0_10]. This approach creates a clear blueprint, ensuring design details are captured early in the development process[^0_3][^0_6]. Static validation with tools like mypy further enforces type consistency from the outset.
  2. Robust Guardrails for LLM Code Generation: The detailed and validated structure provided by IID serves as effective guardrails when using Large Language Models (LLMs) for code generation[^0_7][^0_11]