LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning?
To evaluate the logical reasoning abilities of LLMs, we have created a benchmark named ZebraLogic. Each example is a Logic Grid Puzzle, also known as a Zebra Puzzle. In each puzzle, we are given N houses (numbered 1 to N from left to right) and M features for each house. There are N distinct values for each feature, and each house must have a unique value for each feature. Given a list of clues, one should be able to deduce a unique correct assignment of values. The logic grid puzzle is a typical Constraint Satisfaction Problem (CSP) and is often used to test humans' logical reasoning abilities in exams such as the Law School Admission Test (LSAT).
- π€ Leaderboard: https://huggingface.co/spaces/allenai/ZebraLogic
- π¦ Data: https://huggingface.co/datasets/allenai/ZebraLogicBench
- π» Code for evaluation: https://github.com/yuchenlin/ZeroEval
Here is an example of a 2x3 puzzle (2 houses x 3 features):
ZebraLogic Bench Example; id=[lgp-test-2x3-1]: β¬οΈ
There are 2 houses, numbered 1 to 2 from left to right.
Each house is occupied by a different person.
Each house has a unique attribute for each of the following characteristics:
- Each person has a unique name: **Arnold, Eric**
- People own unique car models: **ford f150, tesla model 3**
- The people keep unique animals: **cat, horse**
**Clues**:
1. Eric is directly left of the person who owns a Tesla Model 3.
2. The person who keeps horses is in the first house.
Reasoning steps:
- From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1, because House 2 is the rightmost house.
- Additionally, we know that the person in House 1 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
- From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.
The solution presented in table format:
Houses | Name | CarModel | Animal |
---|---|---|---|
1 | Eric | ford f150 | horse |
2 | Arnold | tesla model 3 | cat |
We programmatically created 1,000 such puzzles, with sizes ranging from 2x2 to 6x6, and there are 40 puzzles for each size. We test large language models (LLMs) by providing one-shot example with reasoning steps and the JSON-formatted solution. We instruct the LLMs to first output their reasoning and then present their answers in the same format as shown in the in-context example.
We have two major metrics: puzzle-level accuracy and cell-wise accuracy. For each puzzle of size NxM, there are NxM cells to fill in, and we compute the cell-wise accuracy as the proportion of correctly filled cells. A puzzle is counted as a puzzle-level success only when all cells are filled with correct values. Additionally, we divide the 1000 puzzles into two subsets: easy and hard puzzles, based on their sizes.
For an NxM-size Zebra puzzle (N houses and M features), the probability of correctly guessing the assignment for each feature by random chance is
N β¬οΈ | M=2 | M=3 | M=4 | M=5 | M=6 |
---|---|---|---|---|---|
2 | -0.602060 | -0.903090 | -1.204120 | -1.505150 | -1.806180 |
3 | -1.556303 | -2.334454 | -3.112605 | -3.890756 | -4.668908 |
4 | -2.760422 | -4.146634 | -5.520845 | -6.901056 | -8.281267 |
5 | -4.158362 | -6.237544 | -8.316725 | -10.395906 | -12.475087 |
6 | -5.714665 | -8.571997 | -11.429330 | -14.286662 | -17.143995 |
We set a threshold for the logarithmic value at -3
. Therefore, all puzzles smaller than 3x3 are considered easy, while those of size 3x3 and larger are considered hard.
Humans can solve the puzzles by strategically reasoning with the constraints presented in the clues, using deliberate thinking methods such as reductio ad absurdum and the process of elimination. However, LLMs are still weak in such logical reasoning tasks. The best LLM, Claude 3.5 Sonnet, can only solve 33.4% of all puzzles and just 12.4% of the hard puzzles. Smaller language models with 7 to 10 billion parameters struggle to solve hard puzzles and also exhibit low accuracy on easy puzzles.
Our results indicate that LLMs still lack several abilities commonly required for complex logical reasoning: counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization, etc.
View all results on our leaderboard https://huggingface.co/spaces/allenai/ZebraLogic
Recent research shows that greedy decoding usually leads to better performance in hard reasoning tasks. However, in our case, some models can degenerate when they generate reasoning steps (e.g., start to repeatedly decode the same sentences). Thus, we also use sampling with a 0.5 temperature for some models. A few models get higher acc when using sampling but most models have better perforamnce in greedy decoding.
Zebra puzzles can be synthetically generated by programs:
- We start by defining a set of features and their possible values (e.g., the feature
CarModel
might have values likeTesla Model 3
,Ford F150
, etc.). - Next, we establish the clue types and their language templates, which include placeholders for values to be filled in. Each clue type is logically structured to describe a type of constraint that can involve multiple variables.
- To create a ZebraLogic example, we randomly assign values on a sampled grid as the solution. Then, we enumerate all possible clues that can describe the relation among variables.
- By iteratively removing clues through weighted sampling, we continuously check if the remaining set of clues can uniquely lead to the above solution.
- Finally, we represent the puzzle with prompting templates to form the inputs for the LLMs.
The used clue types are as follows:
Found_At
: the tea drinker lives in House 3Not_At
: the musician does not drink teaSame_House
: the musician drinks teaDirect_Left/Right
: the green house is directly to the left/right of the white houseSide_By_Side
: the coffee drinker and the tea drinker are next to each other.Left/Right_Of
: A is somewhere to the left/right of BOne/Two_between
: There is one/two house between A and B.
- More reasoning methods: We are interested in evaluting LLM agents (e.g., ReAct, Reflexion, SwiftSage). Also, we'll explore advanced prompting and fine-tuning methods like Tree of Thoughts, Flow of Reasoning, etc.
- More evaluation methods: We are considering trying multiple-choice format for faster evaluation.
- Fine-tuning with Logic Puzzles: Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?
- Analyze the internal reasoning mechanism of LLMs: how do LLMs reason correctly and incorrectly.
- More tasks: We'll add more types of logic puzzles that require a more diverse set of reasoning abilities in the evaluation.
@misc{zebralogic2024,
title={ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models},
author={Bill Yuchen Lin, Ronan Le Bras, Yejin Choi},
url={https://huggingface.co/spaces/allenai/ZebraLogic},
year={2024}
}
@article{dziri2024faith,
title={Faith and fate: Limits of transformers on compositionality},
author={Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za{\"i}d Harchaoui and Yejin Choi},
journal={Advances in Neural Information Processing Systems},
volume={36},
year={2024}
}
test