implementation.md

Below is a practical recipe + Node .js scaffolding for turning your 250-page Markdown dump into three linguistics-aware assets:

CLDF StructureTable – grammar-feature spreadsheet
XIGT (JSON) – interlinear examples
OntoLex-Lemon (JSON-LD) – lexicon entries

The same pattern works for any other schema; just swap the “output-format” system message.

0 — Make the LLM’s job easy

Step	Why it matters	How to maximise accuracy
Pre-chunk by heading	2–4 k token windows keep context coherent	Use the Markdown `##` / `###` structure you already have.
Pre-tag text	Regex annotations help the LLM “see” examples	e.g. mark interlinear blocks: `KY: …` / `EN: …`
Provide an explicit schema	LLMs hallucinate less when you quote field names literally	Put the JSON Schema / CSV header in the system prompt.
Few-shot	Show 1-2 hand-crafted gold examples for each format	Hard-code them in the prompt.
*Ask for only* JSON/CSV**	Post-parse becomes trivial	Wrap output with sentinel tokens: `<START_JSON> … <END_JSON>`.
Validate + back-off	Catch hallucinations	`ajv` for JSON-Schema, `papaparse` for CSV; if invalid, retry with a “Fix the …” prompt.

1 — Boot-strap project

npm init -y
npm i openai remark remark-parse strip-markdown ajv papaparse dotenv

# .env
OPENAI_API_KEY=sk-……
MODEL=o3

2 — `extract.js` (minimal skeleton)

// extract.js
import fs from 'fs/promises';
import path from 'path';
import { unified } from 'remark';
import remarkParse from 'remark-parse';
import strip from 'strip-markdown';
import { Configuration, OpenAI } from 'openai';
import Papa from 'papaparse';
import Ajv from 'ajv';
import dotenv from 'dotenv';
dotenv.config();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

// ---------- load & chunk ----------
const md = await fs.readFile('grammar.md', 'utf8');
const tree = unified().use(remarkParse).parse(md);

function* walk(node, depth = 0, current = []) {
  if (node.type === 'heading' && node.depth <= 3) {
    if (current.length) yield current;
    current.length = 0;
  }
  if (node.type === 'text' || node.type === 'paragraph' || node.children) {
    current.push(node);
  }
  if (node.children) node.children.forEach(n => walk(n, depth + 1, current));
}
const chunks = [...walk(tree)].map(nodes =>
  unified().use(strip).stringify({ type: 'root', children: nodes })
);

// ---------- helper ----------
async function callLLM(system, user, maxTokens = 2048) {
  const res = await openai.chat.completions.create({
    model: process.env.MODEL,
    temperature: 0,
    messages: [
      { role: 'system', content: system },
      { role: 'user', content: user }
    ],
    response_format: { type: 'json_object' },
    max_tokens: maxTokens
  });
  return res.choices[0].message.content;
}

// ---------- schemas ----------
const cldfHeader = 'ID,Parameter_ID,Language_ID,Value,Source';
const ajv = new Ajv();
const xigtSchema = JSON.parse(
  await fs.readFile('./schemas/xigt.json', 'utf8')
);
const validateIGT = ajv.compile(xigtSchema);

// ---------- main loop ----------
const cldfRows = [];
const igtObjects = [];
const lexEntries = [];

for (const chunk of chunks) {
  // 1. Grammar features → CLDF
  const csvSystem = `
You are an expert linguist. Output a CSV with header:
${cldfHeader}
where each row captures ONE grammatical fact of Kuku Yalanji.
Use Parameter_ID = short kebab-case label (e.g. "ergative-optional"),
Language_ID = "gvn",
Source = "Patz §"+section heading if available.
Return ONLY CSV rows.
  `.trim();

  const csv = await callLLM(csvSystem, chunk, 512);
  cldfRows.push(csv.trim());

  // 2. Interlinear examples → XIGT
  if (/KY:/.test(chunk)) {
    const igtSys = `
Return an array named "items" where each item is XIGT JSON with
"transcript", "gloss", "translation", and "source".
  `;
    const raw = await callLLM(igtSys, chunk, 1024);
    const obj = JSON.parse(raw);
    if (validateIGT(obj)) igtObjects.push(...obj.items);
  }

  // 3. Lexicon snippets → OntoLex
  if (/bolded lexeme pattern/.test(chunk)) {
    const ontoSys = `
Convert every bolded Kuku Yalanji lemma in the text to
OntoLex-Lemon JSON-LD objects; include English gloss and partOfSpeech.
Return {"@graph":[…]} only.
    `;
    const onto = JSON.parse(await callLLM(ontoSys, chunk, 1024));
    lexEntries.push(...onto['@graph']);
  }
}

// ---------- save ----------
await fs.writeFile('structure.csv', Papa.unparse(cldfRows.join('\n')), 'utf8');
await fs.writeFile('examples.xigt.json', JSON.stringify({ items: igtObjects }, null, 2));
await fs.writeFile('kuku_lexicon.jsonld', JSON.stringify({ "@graph": lexEntries }, null, 2));

console.log('✓ extraction finished');

What the script does

Chunks the Markdown by heading ≤ h3 (keeps each topic self-contained).
Calls the LLM three times per chunk, once for each output schema.
Validates XIGT JSON with AJV; retries could be added if invalid.
Appends CSV rows / JSON objects; writes three files.

3 — Prompt design tips

Show the schema in the system message – not the user chunk.
Use low temperature (0–0.2).
Ask for deterministic delimiters ({"@graph":[ … ]} only).
Always test with 2 – 3 curated gold examples; one wrong heading can derail extraction.
Iterative RFC: run extraction once, spot-fix the worst headings, feed the corrected chunk back (“Here is corrected text – regenerate”).

If accuracy is still shaky, do a two-step chain:

“Identify all potential lemma/gloss pairs, return as plain list.”
Feed that list to a second prompt that serialises into OntoLex JSON.
(The “extract then structure” pattern lowers hallucination rate.)

4 — Loading into Postgres

psql kuku -c "\copy cldf_structure FROM 'structure.csv' CSV HEADER"
psql kuku -c "\copy xigt_json FROM program 'jq -c \".items[]\" examples.xigt.json'"

Vector embeddings:

ALTER TABLE section ADD COLUMN embedding vector(1536);
UPDATE section
  SET embedding = openai_embed(md_text)
  WHERE embedding IS NULL;
CREATE INDEX ON section USING hnsw (embedding vector_cosine_ops);

Where to go next

Write a validator bot – prompt o3: “Does this XIGT object violate the schema? Answer yes/no and fix.”
Integrate into a LangChain RAG – loader fetches nearest section.embedding, decorates translator prompts.
Publish the resulting files on GitHub with an open license so others can reuse your Kuku data.

This workflow keeps everything modern (Node 18+, Postgres 16, embeddings, JSON-LD) yet reversible — you can always hand-edit the CSV or JSON and reload.

thomasdavis/implementation.md