Skip to content

Instantly share code, notes, and snippets.

@cboulanger
Last active March 3, 2026 10:56
Show Gist options
  • Select an option

  • Save cboulanger/d80187144214f667fdaa18115f3ddaab to your computer and use it in GitHub Desktop.

Select an option

Save cboulanger/d80187144214f667fdaa18115f3ddaab to your computer and use it in GitHub Desktop.
Evaluation results for LLM-based TEI annotation via tei-annotator
uv run scripts/evaluate_llm.py --max-items 10 --show-annotations --output-file .local/evaluate-llm.log --match-mode overlap --provider kisski
────────────────────────────────────────────────────────────────
Provider : Gemini 2.0 Flash
Gold file : tests/fixtures/blbl-examples.tei.xml
Records : 10 match-mode: overlap
GLiNER : disabled
────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
Gold: <author><orgName>Commission Inter-IREM Collège</orgName></author> &amp; <author><orgName>Commission Inter-IREM Statistiques et Probabilités</orgName></author>, (<date>2012</date>). <title level="a">Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard…</title>. Dans <title level="j">Brochure APMEP</title> n°<biblScope unit="volume">198</biblScope>.
Annotation: <orgName><author>Commission Inter-IREM Collège & Commission Inter-IREM Statistiques et Probabilités</author></orgName>, <date>(2012)</date>. <title>Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard….</title> Dans <title level="s">Brochure APMEP</title> <biblScope>n°198</biblScope>.
F1=0.857 missed=['orgName', 'author'] spurious=[]
────────────────────────────────────────────────────────────
Gold: <author><surname>Russell</surname>, <forename>D.A.</forename> and <forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989</date> [<date>1972</date>]. <title level="m">Classical Literary Criticism. Oxford World Classics</title>. <pubPlace>Oxford</pubPlace>: <publisher>Oxford UP</publisher>.
Annotation: <author><surname>Russell</surname>, <forename>D.A.</forename></author> and <author><forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989</date> <date>[1972]</date>. <title level="m">Classical Literary Criticism</title>. <title level="s"><pubPlace>Oxford</pubPlace> World Classics</title>. Oxford: <publisher><orgName>Oxford UP</orgName></publisher>.
F1=0.783 missed=['pubPlace'] spurious=['author', 'pubPlace', 'title', 'orgName']
────────────────────────────────────────────────────────────
Gold: <label>17.</label><author><surname>Creed</surname> <forename>PA</forename>, <surname>Hicks</surname> <forename>RE</forename>, <surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>.
Annotation: <label>17.</label><author><surname>Creed</surname> <forename>PA</forename></author>, <author><surname>Hicks</surname> <forename>RE</forename></author>, <author><surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>.
F1=0.857 missed=['author'] spurious=['author', 'author', 'author']
────────────────────────────────────────────────────────────
Gold: <label>25.</label> <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename> &amp; <forename>T.-Y.</forename> <surname>Eng</surname></author> (<date>2006</date>). “<title level="a">SMEs and the Strategic Context for Communication</title>”’, <title level="j">Journal of Marketing Communications</title>, Vol. <biblScope unit="volume">12</biblScope>(<biblScope unit="issue">3</biblScope>), <biblScope unit="page">225 - 243</biblScope>.
Annotation: <label>25.</label> <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename></author> & <author><forename>T.-Y.</forename> <surname>Eng</surname></author> <date>(2006)</date>. <title level="a">“SMEs and the Strategic Context for Communication”’</title>, <title level="j">Journal of Marketing Communications</title>, Vol. <biblScope unit="volume">12</biblScope>(<biblScope unit="issue">3</biblScope>), <biblScope unit="page">225 - 243</biblScope>.
F1=0.960 missed=[] spurious=['author']
────────────────────────────────────────────────────────────
Gold: <author><surname>Lillié</surname>, <forename>F.</forename></author>, <title level="m">Analyse tectonique de Gisement Claude</title> (<pubPlace>Cluff Lake, Saskatchewan</pubPlace>). <note type="report">Amok Internal Report</note>. <date>1982</date>.
Annotation: <author><surname>Lillié</surname>, <forename>F.</forename></author>, <title>Analyse tectonique de Gisement Claude (Cluff Lake, Saskatchewan)</title>. <note type="report">Amok Internal Report</note>. <date>1982</date>.
F1=0.923 missed=['pubPlace'] spurious=[]
Completed: 10/10 records
=== Overall — Gemini 2.0 Flash ===
Micro P=0.914 R=0.944 F1=0.929 (TP=85 FP=8 FN=5)
Macro P=0.882 R=0.888 F1=0.882
Per-element breakdown:
author P=0.643 R=0.818 F1=0.720 (TP=9 FP=5 FN=2)
biblScope P=1.000 R=1.000 F1=1.000 (TP=14 FP=0 FN=0)
date P=1.000 R=1.000 F1=1.000 (TP=11 FP=0 FN=0)
editor P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0)
forename P=1.000 R=1.000 F1=1.000 (TP=13 FP=0 FN=0)
label P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0)
note P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0)
orgName P=0.500 R=0.500 F1=0.500 (TP=1 FP=1 FN=1)
pubPlace P=0.500 R=0.333 F1=0.400 (TP=1 FP=1 FN=2)
publisher P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0)
surname P=1.000 R=1.000 F1=1.000 (TP=14 FP=0 FN=0)
title P=0.941 R=1.000 F1=0.970 (TP=16 FP=1 FN=0)
Lowest-F1 records (top 5):
# 4 F1=0.783 missed=['pubPlace'] spurious=['author', 'pubPlace', 'title', 'orgName']
"Russell, D.A. and Michael Winterbottom 1989 [1972]. Cla..."
# 2 F1=0.857 missed=['orgName', 'author'] spurious=[]
"Commission Inter-IREM Collège & Commission Inter-IREM S..."
# 7 F1=0.857 missed=['author'] spurious=['author', 'author', 'author']
"17.Creed PA, Hicks RE, Machin MA. Behavioural plasticit..."
# 9 F1=0.923 missed=['pubPlace'] spurious=[]
"Lillié, F., Analyse tectonique de Gisement Claude (Cluf..."
# 8 F1=0.960 missed=[] spurious=['author']
"25. Spickett-Jones, J. G. & T.-Y. Eng (2006). “SMEs and..."
────────────────────────────────────────────────────────────────
Provider : KISSKI / llama-3.3-70b-instruct
Gold file : tests/fixtures/blbl-examples.tei.xml
Records : 10 match-mode: overlap
GLiNER : disabled
────────────────────────────────────────────────────────────────
────────────────────────────────────────────────────────────
Gold: <author><orgName>Commission Inter-IREM Collège</orgName></author> &amp; <author><orgName>Commission Inter-IREM Statistiques et Probabilités</orgName></author>, (<date>2012</date>). <title level="a">Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard…</title>. Dans <title level="j">Brochure APMEP</title> n°<biblScope unit="volume">198</biblScope>.
Annotation: <author><orgName>Commission Inter-IREM Collège</orgName> & <orgName>Commission Inter-IREM Statistiques et Probabilités</orgName></author>, (<date>2012</date>). <title level="a">Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard….</title> Dans <title level="m">Brochure APMEP n°198</title>.
F1=0.857 missed=['author', 'biblScope'] spurious=[]
────────────────────────────────────────────────────────────
Gold: <author><surname>Russell</surname>, <forename>D.A.</forename> and <forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989</date> [<date>1972</date>]. <title level="m">Classical Literary Criticism. Oxford World Classics</title>. <pubPlace>Oxford</pubPlace>: <publisher>Oxford UP</publisher>.
Annotation: <author><surname>Russell</surname>, <forename>D.A.</forename> and <forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989 [1972]</date>. <title level="m">Classical Literary Criticism</title>. <title level="s">Oxford World Classics</title>. <pubPlace>Oxford</pubPlace>: <publisher>Oxford UP</publisher>.
F1=0.800 missed=['date', 'date'] spurious=['date', 'title']
────────────────────────────────────────────────────────────
Gold: <label>17.</label><author><surname>Creed</surname> <forename>PA</forename>, <surname>Hicks</surname> <forename>RE</forename>, <surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>.
Annotation: <label>17</label>.<author><surname>Creed</surname> <forename>PA</forename></author>, <author><surname>Hicks</surname> <forename>RE</forename></author>, <author><surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>.
F1=0.857 missed=['author'] spurious=['author', 'author', 'author']
────────────────────────────────────────────────────────────
Gold: <label>25.</label> <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename> &amp; <forename>T.-Y.</forename> <surname>Eng</surname></author> (<date>2006</date>). “<title level="a">SMEs and the Strategic Context for Communication</title>”’, <title level="j">Journal of Marketing Communications</title>, Vol. <biblScope unit="volume">12</biblScope>(<biblScope unit="issue">3</biblScope>), <biblScope unit="page">225 - 243</biblScope>.
Annotation: <label>25</label>. <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename> & <forename>T.-Y.</forename> <surname>Eng</surname></author> (<date>2006</date>). “SMEs and the Strategic Context for Communication”’, <title level="j">Journal of Marketing Communications</title>, <biblScope unit="volume">Vol. 12(3)</biblScope>, <biblScope unit="page">225 - 243</biblScope>.
F1=0.818 missed=['title', 'biblScope', 'biblScope'] spurious=['biblScope']
Completed: 10/10 records
=== Overall — KISSKI / llama-3.3-70b-instruct ===
Micro P=0.932 R=0.911 F1=0.921 (TP=82 FP=6 FN=8)
Macro P=0.959 R=0.947 F1=0.952
Per-element breakdown:
author P=0.750 R=0.818 F1=0.783 (TP=9 FP=3 FN=2)
biblScope P=0.917 R=0.786 F1=0.846 (TP=11 FP=1 FN=3)
date P=0.900 R=0.818 F1=0.857 (TP=9 FP=1 FN=2)
editor P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0)
forename P=1.000 R=1.000 F1=1.000 (TP=13 FP=0 FN=0)
label P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0)
note P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0)
orgName P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0)
pubPlace P=1.000 R=1.000 F1=1.000 (TP=3 FP=0 FN=0)
publisher P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0)
surname P=1.000 R=1.000 F1=1.000 (TP=14 FP=0 FN=0)
title P=0.938 R=0.938 F1=0.938 (TP=15 FP=1 FN=1)
Lowest-F1 records (top 5):
# 4 F1=0.800 missed=['date', 'date'] spurious=['date', 'title']
"Russell, D.A. and Michael Winterbottom 1989 [1972]. Cla..."
# 8 F1=0.818 missed=['title', 'biblScope', 'biblScope'] spurious=['biblScope']
"25. Spickett-Jones, J. G. & T.-Y. Eng (2006). “SMEs and..."
# 2 F1=0.857 missed=['author', 'biblScope'] spurious=[]
"Commission Inter-IREM Collège & Commission Inter-IREM S..."
# 7 F1=0.857 missed=['author'] spurious=['author', 'author', 'author']
"17.Creed PA, Hicks RE, Machin MA. Behavioural plasticit..."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment