Last active
March 3, 2026 10:56
-
-
Save cboulanger/d80187144214f667fdaa18115f3ddaab to your computer and use it in GitHub Desktop.
Evaluation results for LLM-based TEI annotation via tei-annotator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| uv run scripts/evaluate_llm.py --max-items 10 --show-annotations --output-file .local/evaluate-llm.log --match-mode overlap --provider kisski | |
| ──────────────────────────────────────────────────────────────── | |
| Provider : Gemini 2.0 Flash | |
| Gold file : tests/fixtures/blbl-examples.tei.xml | |
| Records : 10 match-mode: overlap | |
| GLiNER : disabled | |
| ──────────────────────────────────────────────────────────────── | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <author><orgName>Commission Inter-IREM Collège</orgName></author> & <author><orgName>Commission Inter-IREM Statistiques et Probabilités</orgName></author>, (<date>2012</date>). <title level="a">Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard…</title>. Dans <title level="j">Brochure APMEP</title> n°<biblScope unit="volume">198</biblScope>. | |
| Annotation: <orgName><author>Commission Inter-IREM Collège & Commission Inter-IREM Statistiques et Probabilités</author></orgName>, <date>(2012)</date>. <title>Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard….</title> Dans <title level="s">Brochure APMEP</title> <biblScope>n°198</biblScope>. | |
| F1=0.857 missed=['orgName', 'author'] spurious=[] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <author><surname>Russell</surname>, <forename>D.A.</forename> and <forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989</date> [<date>1972</date>]. <title level="m">Classical Literary Criticism. Oxford World Classics</title>. <pubPlace>Oxford</pubPlace>: <publisher>Oxford UP</publisher>. | |
| Annotation: <author><surname>Russell</surname>, <forename>D.A.</forename></author> and <author><forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989</date> <date>[1972]</date>. <title level="m">Classical Literary Criticism</title>. <title level="s"><pubPlace>Oxford</pubPlace> World Classics</title>. Oxford: <publisher><orgName>Oxford UP</orgName></publisher>. | |
| F1=0.783 missed=['pubPlace'] spurious=['author', 'pubPlace', 'title', 'orgName'] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <label>17.</label><author><surname>Creed</surname> <forename>PA</forename>, <surname>Hicks</surname> <forename>RE</forename>, <surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>. | |
| Annotation: <label>17.</label><author><surname>Creed</surname> <forename>PA</forename></author>, <author><surname>Hicks</surname> <forename>RE</forename></author>, <author><surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>. | |
| F1=0.857 missed=['author'] spurious=['author', 'author', 'author'] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <label>25.</label> <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename> & <forename>T.-Y.</forename> <surname>Eng</surname></author> (<date>2006</date>). “<title level="a">SMEs and the Strategic Context for Communication</title>”’, <title level="j">Journal of Marketing Communications</title>, Vol. <biblScope unit="volume">12</biblScope>(<biblScope unit="issue">3</biblScope>), <biblScope unit="page">225 - 243</biblScope>. | |
| Annotation: <label>25.</label> <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename></author> & <author><forename>T.-Y.</forename> <surname>Eng</surname></author> <date>(2006)</date>. <title level="a">“SMEs and the Strategic Context for Communication”’</title>, <title level="j">Journal of Marketing Communications</title>, Vol. <biblScope unit="volume">12</biblScope>(<biblScope unit="issue">3</biblScope>), <biblScope unit="page">225 - 243</biblScope>. | |
| F1=0.960 missed=[] spurious=['author'] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <author><surname>Lillié</surname>, <forename>F.</forename></author>, <title level="m">Analyse tectonique de Gisement Claude</title> (<pubPlace>Cluff Lake, Saskatchewan</pubPlace>). <note type="report">Amok Internal Report</note>. <date>1982</date>. | |
| Annotation: <author><surname>Lillié</surname>, <forename>F.</forename></author>, <title>Analyse tectonique de Gisement Claude (Cluff Lake, Saskatchewan)</title>. <note type="report">Amok Internal Report</note>. <date>1982</date>. | |
| F1=0.923 missed=['pubPlace'] spurious=[] | |
| Completed: 10/10 records | |
| === Overall — Gemini 2.0 Flash === | |
| Micro P=0.914 R=0.944 F1=0.929 (TP=85 FP=8 FN=5) | |
| Macro P=0.882 R=0.888 F1=0.882 | |
| Per-element breakdown: | |
| author P=0.643 R=0.818 F1=0.720 (TP=9 FP=5 FN=2) | |
| biblScope P=1.000 R=1.000 F1=1.000 (TP=14 FP=0 FN=0) | |
| date P=1.000 R=1.000 F1=1.000 (TP=11 FP=0 FN=0) | |
| editor P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0) | |
| forename P=1.000 R=1.000 F1=1.000 (TP=13 FP=0 FN=0) | |
| label P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0) | |
| note P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0) | |
| orgName P=0.500 R=0.500 F1=0.500 (TP=1 FP=1 FN=1) | |
| pubPlace P=0.500 R=0.333 F1=0.400 (TP=1 FP=1 FN=2) | |
| publisher P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0) | |
| surname P=1.000 R=1.000 F1=1.000 (TP=14 FP=0 FN=0) | |
| title P=0.941 R=1.000 F1=0.970 (TP=16 FP=1 FN=0) | |
| Lowest-F1 records (top 5): | |
| # 4 F1=0.783 missed=['pubPlace'] spurious=['author', 'pubPlace', 'title', 'orgName'] | |
| "Russell, D.A. and Michael Winterbottom 1989 [1972]. Cla..." | |
| # 2 F1=0.857 missed=['orgName', 'author'] spurious=[] | |
| "Commission Inter-IREM Collège & Commission Inter-IREM S..." | |
| # 7 F1=0.857 missed=['author'] spurious=['author', 'author', 'author'] | |
| "17.Creed PA, Hicks RE, Machin MA. Behavioural plasticit..." | |
| # 9 F1=0.923 missed=['pubPlace'] spurious=[] | |
| "Lillié, F., Analyse tectonique de Gisement Claude (Cluf..." | |
| # 8 F1=0.960 missed=[] spurious=['author'] | |
| "25. Spickett-Jones, J. G. & T.-Y. Eng (2006). “SMEs and..." | |
| ──────────────────────────────────────────────────────────────── | |
| Provider : KISSKI / llama-3.3-70b-instruct | |
| Gold file : tests/fixtures/blbl-examples.tei.xml | |
| Records : 10 match-mode: overlap | |
| GLiNER : disabled | |
| ──────────────────────────────────────────────────────────────── | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <author><orgName>Commission Inter-IREM Collège</orgName></author> & <author><orgName>Commission Inter-IREM Statistiques et Probabilités</orgName></author>, (<date>2012</date>). <title level="a">Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard…</title>. Dans <title level="j">Brochure APMEP</title> n°<biblScope unit="volume">198</biblScope>. | |
| Annotation: <author><orgName>Commission Inter-IREM Collège</orgName> & <orgName>Commission Inter-IREM Statistiques et Probabilités</orgName></author>, (<date>2012</date>). <title level="a">Probabilités au collège : ne pas laisser l’enseignement des probabilités au hasard….</title> Dans <title level="m">Brochure APMEP n°198</title>. | |
| F1=0.857 missed=['author', 'biblScope'] spurious=[] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <author><surname>Russell</surname>, <forename>D.A.</forename> and <forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989</date> [<date>1972</date>]. <title level="m">Classical Literary Criticism. Oxford World Classics</title>. <pubPlace>Oxford</pubPlace>: <publisher>Oxford UP</publisher>. | |
| Annotation: <author><surname>Russell</surname>, <forename>D.A.</forename> and <forename>Michael</forename> <surname>Winterbottom</surname></author> <date>1989 [1972]</date>. <title level="m">Classical Literary Criticism</title>. <title level="s">Oxford World Classics</title>. <pubPlace>Oxford</pubPlace>: <publisher>Oxford UP</publisher>. | |
| F1=0.800 missed=['date', 'date'] spurious=['date', 'title'] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <label>17.</label><author><surname>Creed</surname> <forename>PA</forename>, <surname>Hicks</surname> <forename>RE</forename>, <surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>. | |
| Annotation: <label>17</label>.<author><surname>Creed</surname> <forename>PA</forename></author>, <author><surname>Hicks</surname> <forename>RE</forename></author>, <author><surname>Machin</surname> <forename>MA</forename></author>. <title level="a">Behavioural plasticity and mental health outcomes for long-term unemployed attending occupational training programmes</title>. <title level="j">J Occup Org Psychol</title>. <date>1998</date>;<biblScope unit="volume">71</biblScope>: <biblScope unit="page">171-91</biblScope>. | |
| F1=0.857 missed=['author'] spurious=['author', 'author', 'author'] | |
| ──────────────────────────────────────────────────────────── | |
| Gold: <label>25.</label> <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename> & <forename>T.-Y.</forename> <surname>Eng</surname></author> (<date>2006</date>). “<title level="a">SMEs and the Strategic Context for Communication</title>”’, <title level="j">Journal of Marketing Communications</title>, Vol. <biblScope unit="volume">12</biblScope>(<biblScope unit="issue">3</biblScope>), <biblScope unit="page">225 - 243</biblScope>. | |
| Annotation: <label>25</label>. <author><surname>Spickett-Jones</surname>, <forename>J. G.</forename> & <forename>T.-Y.</forename> <surname>Eng</surname></author> (<date>2006</date>). “SMEs and the Strategic Context for Communication”’, <title level="j">Journal of Marketing Communications</title>, <biblScope unit="volume">Vol. 12(3)</biblScope>, <biblScope unit="page">225 - 243</biblScope>. | |
| F1=0.818 missed=['title', 'biblScope', 'biblScope'] spurious=['biblScope'] | |
| Completed: 10/10 records | |
| === Overall — KISSKI / llama-3.3-70b-instruct === | |
| Micro P=0.932 R=0.911 F1=0.921 (TP=82 FP=6 FN=8) | |
| Macro P=0.959 R=0.947 F1=0.952 | |
| Per-element breakdown: | |
| author P=0.750 R=0.818 F1=0.783 (TP=9 FP=3 FN=2) | |
| biblScope P=0.917 R=0.786 F1=0.846 (TP=11 FP=1 FN=3) | |
| date P=0.900 R=0.818 F1=0.857 (TP=9 FP=1 FN=2) | |
| editor P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0) | |
| forename P=1.000 R=1.000 F1=1.000 (TP=13 FP=0 FN=0) | |
| label P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0) | |
| note P=1.000 R=1.000 F1=1.000 (TP=1 FP=0 FN=0) | |
| orgName P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0) | |
| pubPlace P=1.000 R=1.000 F1=1.000 (TP=3 FP=0 FN=0) | |
| publisher P=1.000 R=1.000 F1=1.000 (TP=2 FP=0 FN=0) | |
| surname P=1.000 R=1.000 F1=1.000 (TP=14 FP=0 FN=0) | |
| title P=0.938 R=0.938 F1=0.938 (TP=15 FP=1 FN=1) | |
| Lowest-F1 records (top 5): | |
| # 4 F1=0.800 missed=['date', 'date'] spurious=['date', 'title'] | |
| "Russell, D.A. and Michael Winterbottom 1989 [1972]. Cla..." | |
| # 8 F1=0.818 missed=['title', 'biblScope', 'biblScope'] spurious=['biblScope'] | |
| "25. Spickett-Jones, J. G. & T.-Y. Eng (2006). “SMEs and..." | |
| # 2 F1=0.857 missed=['author', 'biblScope'] spurious=[] | |
| "Commission Inter-IREM Collège & Commission Inter-IREM S..." | |
| # 7 F1=0.857 missed=['author'] spurious=['author', 'author', 'author'] | |
| "17.Creed PA, Hicks RE, Machin MA. Behavioural plasticit..." |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment