Model prompt for a blog post on LLM eval

Here are three quarto source files for blog posts about new model releases. A new release of Gemini 2.5 Pro Flash just came out. I'd like to compare: Gemini 2.5 Pro Flash as the reference model, GPT o4-mini, Gemini 2.0 Flash, Gemini 2.5 Pro, GPT 4.1-nano, Claude Sonnet 3.7.

Gemini 2.5 Pro Flash is a "small"er model, which makes it like 2.0 Flash. It's also a thinking model, which makes it like o4-mini. o4-mini, being both cheap and thinking, is probably the closest analogue.

Try to write the source code implementing the eval exactly as I would. As with the o3 and o4-mini post, don't give so much explanation on how the vitals package works / what's happening under the hood. When writing the code, do it exactly as you'd imagine I would; just pattern match what's already there and don't include new code comments. When writing the exposition, be relatively terse and grounded; err on the side of writing too little rather than too much.

The "new Gemini 2.5 Pro update" is the newest blog post—refer to that most closely to pattern-match.

I've also included a screenshot of the pricing—use that screenshot and the previous posts to include the relevant pricing. The screenshot is from https://deepmind.google/models/gemini/flash/

If you link to the previous posts in the exposition (you will need to), those links are: https://www.simonpcouch.com/blog/2025-05-07-gemini-2-5-pro-new/ https://www.simonpcouch.com/blog/2025-04-18-o3-o4-mini/ https://www.simonpcouch.com/blog/2025-04-15-gpt-4-1/

Don't make any assumptions about how models will actually compare in their performance. Instead, write [TODO: ...] with a note about the type of observation that typically goes somewhere.

Include your response as an artifact.

[previous three .qmds pasted here]

simonpcouch/blog_post_prompt.md