Below is a practical recipe + Node .js scaffolding for turning your 250-page Markdown dump into three linguistics-aware assets:
- CLDF StructureTable – grammar-feature spreadsheet
- XIGT (JSON) – interlinear examples
- OntoLex-Lemon (JSON-LD) – lexicon entries
The same pattern works for any other schema; just swap the “output-format” system message.
Step | Why it matters | How to maximise accuracy |
---|---|---|
Pre-chunk by heading | 2–4 k token windows keep context coherent | Use the Markdown ## / ### structure you already have. |
Pre-tag text | Regex annotations help the LLM “see” examples | e.g. mark interlinear blocks: KY: … / EN: … |
Provide an explicit schema | LLMs hallucinate less when you quote field names literally | Put the JSON Schema / CSV header in the system prompt. |
Few-shot | Show 1-2 hand-crafted gold examples for each format | Hard-code them in the prompt. |
Ask for only JSON/CSV | Post-parse becomes trivial | Wrap output with sentinel tokens: <START_JSON> … <END_JSON> . |
Validate + back-off | Catch hallucinations | ajv for JSON-Schema, papaparse for CSV; if invalid, retry with a “Fix the …” prompt. |
npm init -y
npm i openai remark remark-parse strip-markdown ajv papaparse dotenv
# .env
OPENAI_API_KEY=sk-……
MODEL=o3
// extract.js
import fs from 'fs/promises';
import path from 'path';
import { unified } from 'remark';
import remarkParse from 'remark-parse';
import strip from 'strip-markdown';
import { Configuration, OpenAI } from 'openai';
import Papa from 'papaparse';
import Ajv from 'ajv';
import dotenv from 'dotenv';
dotenv.config();
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
// ---------- load & chunk ----------
const md = await fs.readFile('grammar.md', 'utf8');
const tree = unified().use(remarkParse).parse(md);
function* walk(node, depth = 0, current = []) {
if (node.type === 'heading' && node.depth <= 3) {
if (current.length) yield current;
current.length = 0;
}
if (node.type === 'text' || node.type === 'paragraph' || node.children) {
current.push(node);
}
if (node.children) node.children.forEach(n => walk(n, depth + 1, current));
}
const chunks = [...walk(tree)].map(nodes =>
unified().use(strip).stringify({ type: 'root', children: nodes })
);
// ---------- helper ----------
async function callLLM(system, user, maxTokens = 2048) {
const res = await openai.chat.completions.create({
model: process.env.MODEL,
temperature: 0,
messages: [
{ role: 'system', content: system },
{ role: 'user', content: user }
],
response_format: { type: 'json_object' },
max_tokens: maxTokens
});
return res.choices[0].message.content;
}
// ---------- schemas ----------
const cldfHeader = 'ID,Parameter_ID,Language_ID,Value,Source';
const ajv = new Ajv();
const xigtSchema = JSON.parse(
await fs.readFile('./schemas/xigt.json', 'utf8')
);
const validateIGT = ajv.compile(xigtSchema);
// ---------- main loop ----------
const cldfRows = [];
const igtObjects = [];
const lexEntries = [];
for (const chunk of chunks) {
// 1. Grammar features → CLDF
const csvSystem = `
You are an expert linguist. Output a CSV with header:
${cldfHeader}
where each row captures ONE grammatical fact of Kuku Yalanji.
Use Parameter_ID = short kebab-case label (e.g. "ergative-optional"),
Language_ID = "gvn",
Source = "Patz §"+section heading if available.
Return ONLY CSV rows.
`.trim();
const csv = await callLLM(csvSystem, chunk, 512);
cldfRows.push(csv.trim());
// 2. Interlinear examples → XIGT
if (/KY:/.test(chunk)) {
const igtSys = `
Return an array named "items" where each item is XIGT JSON with
"transcript", "gloss", "translation", and "source".
`;
const raw = await callLLM(igtSys, chunk, 1024);
const obj = JSON.parse(raw);
if (validateIGT(obj)) igtObjects.push(...obj.items);
}
// 3. Lexicon snippets → OntoLex
if (/bolded lexeme pattern/.test(chunk)) {
const ontoSys = `
Convert every bolded Kuku Yalanji lemma in the text to
OntoLex-Lemon JSON-LD objects; include English gloss and partOfSpeech.
Return {"@graph":[…]} only.
`;
const onto = JSON.parse(await callLLM(ontoSys, chunk, 1024));
lexEntries.push(...onto['@graph']);
}
}
// ---------- save ----------
await fs.writeFile('structure.csv', Papa.unparse(cldfRows.join('\n')), 'utf8');
await fs.writeFile('examples.xigt.json', JSON.stringify({ items: igtObjects }, null, 2));
await fs.writeFile('kuku_lexicon.jsonld', JSON.stringify({ "@graph": lexEntries }, null, 2));
console.log('✓ extraction finished');
What the script does
- Chunks the Markdown by heading ≤ h3 (keeps each topic self-contained).
- Calls the LLM three times per chunk, once for each output schema.
- Validates XIGT JSON with AJV; retries could be added if invalid.
- Appends CSV rows / JSON objects; writes three files.
- Show the schema in the system message – not the user chunk.
- Use low temperature (0–0.2).
- Ask for deterministic delimiters (
{"@graph":[ … ]}
only). - Always test with 2 – 3 curated gold examples; one wrong heading can derail extraction.
- Iterative RFC: run extraction once, spot-fix the worst headings, feed the corrected chunk back (“Here is corrected text – regenerate”).
If accuracy is still shaky, do a two-step chain:
- “Identify all potential lemma/gloss pairs, return as plain list.”
- Feed that list to a second prompt that serialises into OntoLex JSON.
(The “extract then structure” pattern lowers hallucination rate.)
psql kuku -c "\copy cldf_structure FROM 'structure.csv' CSV HEADER"
psql kuku -c "\copy xigt_json FROM program 'jq -c \".items[]\" examples.xigt.json'"
Vector embeddings:
ALTER TABLE section ADD COLUMN embedding vector(1536);
UPDATE section
SET embedding = openai_embed(md_text)
WHERE embedding IS NULL;
CREATE INDEX ON section USING hnsw (embedding vector_cosine_ops);
- Write a validator bot – prompt o3: “Does this XIGT object violate the schema? Answer yes/no and fix.”
- Integrate into a LangChain RAG – loader fetches nearest
section.embedding
, decorates translator prompts. - Publish the resulting files on GitHub with an open license so others can reuse your Kuku data.
This workflow keeps everything modern (Node 18+, Postgres 16, embeddings, JSON-LD) yet reversible — you can always hand-edit the CSV or JSON and reload.