Skip to content

Instantly share code, notes, and snippets.

@thomasdavis
Created April 24, 2025 15:06
Show Gist options
  • Save thomasdavis/223a994e75b3c15eb4c76230765e34d4 to your computer and use it in GitHub Desktop.
Save thomasdavis/223a994e75b3c15eb4c76230765e34d4 to your computer and use it in GitHub Desktop.

Below is a practical recipe + Node .js scaffolding for turning your 250-page Markdown dump into three linguistics-aware assets:

  1. CLDF StructureTable – grammar-feature spreadsheet
  2. XIGT (JSON) – interlinear examples
  3. OntoLex-Lemon (JSON-LD) – lexicon entries

The same pattern works for any other schema; just swap the “output-format” system message.


0 — Make the LLM’s job easy

Step Why it matters How to maximise accuracy
Pre-chunk by heading 2–4 k token windows keep context coherent Use the Markdown ## / ### structure you already have.
Pre-tag text Regex annotations help the LLM “see” examples e.g. mark interlinear blocks: KY: … / EN: …
Provide an explicit schema LLMs hallucinate less when you quote field names literally Put the JSON Schema / CSV header in the system prompt.
Few-shot Show 1-2 hand-crafted gold examples for each format Hard-code them in the prompt.
Ask for only JSON/CSV Post-parse becomes trivial Wrap output with sentinel tokens: <START_JSON> … <END_JSON>.
Validate + back-off Catch hallucinations ajv for JSON-Schema, papaparse for CSV; if invalid, retry with a “Fix the …” prompt.

1 — Boot-strap project

npm init -y
npm i openai remark remark-parse strip-markdown ajv papaparse dotenv
# .env
OPENAI_API_KEY=sk-……
MODEL=o3

2 — extract.js (minimal skeleton)

// extract.js
import fs from 'fs/promises';
import path from 'path';
import { unified } from 'remark';
import remarkParse from 'remark-parse';
import strip from 'strip-markdown';
import { Configuration, OpenAI } from 'openai';
import Papa from 'papaparse';
import Ajv from 'ajv';
import dotenv from 'dotenv';
dotenv.config();

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

// ---------- load & chunk ----------
const md = await fs.readFile('grammar.md', 'utf8');
const tree = unified().use(remarkParse).parse(md);

function* walk(node, depth = 0, current = []) {
  if (node.type === 'heading' && node.depth <= 3) {
    if (current.length) yield current;
    current.length = 0;
  }
  if (node.type === 'text' || node.type === 'paragraph' || node.children) {
    current.push(node);
  }
  if (node.children) node.children.forEach(n => walk(n, depth + 1, current));
}
const chunks = [...walk(tree)].map(nodes =>
  unified().use(strip).stringify({ type: 'root', children: nodes })
);

// ---------- helper ----------
async function callLLM(system, user, maxTokens = 2048) {
  const res = await openai.chat.completions.create({
    model: process.env.MODEL,
    temperature: 0,
    messages: [
      { role: 'system', content: system },
      { role: 'user', content: user }
    ],
    response_format: { type: 'json_object' },
    max_tokens: maxTokens
  });
  return res.choices[0].message.content;
}

// ---------- schemas ----------
const cldfHeader = 'ID,Parameter_ID,Language_ID,Value,Source';
const ajv = new Ajv();
const xigtSchema = JSON.parse(
  await fs.readFile('./schemas/xigt.json', 'utf8')
);
const validateIGT = ajv.compile(xigtSchema);

// ---------- main loop ----------
const cldfRows = [];
const igtObjects = [];
const lexEntries = [];

for (const chunk of chunks) {
  // 1. Grammar features → CLDF
  const csvSystem = `
You are an expert linguist. Output a CSV with header:
${cldfHeader}
where each row captures ONE grammatical fact of Kuku Yalanji.
Use Parameter_ID = short kebab-case label (e.g. "ergative-optional"),
Language_ID = "gvn",
Source = "Patz §"+section heading if available.
Return ONLY CSV rows.
  `.trim();

  const csv = await callLLM(csvSystem, chunk, 512);
  cldfRows.push(csv.trim());

  // 2. Interlinear examples → XIGT
  if (/KY:/.test(chunk)) {
    const igtSys = `
Return an array named "items" where each item is XIGT JSON with
"transcript", "gloss", "translation", and "source".
  `;
    const raw = await callLLM(igtSys, chunk, 1024);
    const obj = JSON.parse(raw);
    if (validateIGT(obj)) igtObjects.push(...obj.items);
  }

  // 3. Lexicon snippets → OntoLex
  if (/bolded lexeme pattern/.test(chunk)) {
    const ontoSys = `
Convert every bolded Kuku Yalanji lemma in the text to
OntoLex-Lemon JSON-LD objects; include English gloss and partOfSpeech.
Return {"@graph":[…]} only.
    `;
    const onto = JSON.parse(await callLLM(ontoSys, chunk, 1024));
    lexEntries.push(...onto['@graph']);
  }
}

// ---------- save ----------
await fs.writeFile('structure.csv', Papa.unparse(cldfRows.join('\n')), 'utf8');
await fs.writeFile('examples.xigt.json', JSON.stringify({ items: igtObjects }, null, 2));
await fs.writeFile('kuku_lexicon.jsonld', JSON.stringify({ "@graph": lexEntries }, null, 2));

console.log('✓ extraction finished');

What the script does

  1. Chunks the Markdown by heading ≤ h3 (keeps each topic self-contained).
  2. Calls the LLM three times per chunk, once for each output schema.
  3. Validates XIGT JSON with AJV; retries could be added if invalid.
  4. Appends CSV rows / JSON objects; writes three files.

3 — Prompt design tips

  • Show the schema in the system message – not the user chunk.
  • Use low temperature (0–0.2).
  • Ask for deterministic delimiters ({"@graph":[ … ]} only).
  • Always test with 2 – 3 curated gold examples; one wrong heading can derail extraction.
  • Iterative RFC: run extraction once, spot-fix the worst headings, feed the corrected chunk back (“Here is corrected text – regenerate”).

If accuracy is still shaky, do a two-step chain:

  1. “Identify all potential lemma/gloss pairs, return as plain list.”
  2. Feed that list to a second prompt that serialises into OntoLex JSON.
    (The “extract then structure” pattern lowers hallucination rate.)

4 — Loading into Postgres

psql kuku -c "\copy cldf_structure FROM 'structure.csv' CSV HEADER"
psql kuku -c "\copy xigt_json FROM program 'jq -c \".items[]\" examples.xigt.json'"

Vector embeddings:

ALTER TABLE section ADD COLUMN embedding vector(1536);
UPDATE section
  SET embedding = openai_embed(md_text)
  WHERE embedding IS NULL;
CREATE INDEX ON section USING hnsw (embedding vector_cosine_ops);

Where to go next

  • Write a validator bot – prompt o3: “Does this XIGT object violate the schema? Answer yes/no and fix.”
  • Integrate into a LangChain RAG – loader fetches nearest section.embedding, decorates translator prompts.
  • Publish the resulting files on GitHub with an open license so others can reuse your Kuku data.

This workflow keeps everything modern (Node 18+, Postgres 16, embeddings, JSON-LD) yet reversible — you can always hand-edit the CSV or JSON and reload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment