We want to programmatically probe the game-specific knowledge of different instruction-trained LLMs. We want a large and diverse question collection. Large is easy to measure, but diversity is more tricky. To get diversity, we can check for the distribution of questions across some common vocabulary of categories. Depth categories (from Bloom’s Taxonomy):
-
Knowledge (recall)
-
Comprehension (understand)
-
Application (solve a new problem)