Last active
September 15, 2024 15:41
-
-
Save av/b3831f4bbac713c101ec3927d34af01f to your computer and use it in GitHub Desktop.
misguidedbench
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
OPENROUTER_KEY=< your key here > | |
TASKS=/path/to/misguided.yaml | |
NAME=misguided | |
# Common | |
h bench judge meta-llama/llama-3.1-70b-instruct | |
h bench judge_api https://openrouter.ai/api | |
h bench judge_key $OPENROUTER_KEY | |
h bench tasks $TASKS | |
h config set bench.parallel 4 | |
# ===== Reference tests | |
h bench model meta-llama/llama-3.1-8b-instruct | |
h bench variants --model meta-llama/llama-3.1-8b-instruct --model meta-llama/llama-3.1-70b-instruct --model anthropic/claude-3.5-sonnet --temperature 0 --max_tokens 1024 | |
h bench api https://openrouter.ai/api | |
h bench key $OPENROUTER_KEY | |
h bench run --name openrouter-${NAME} | |
# ===== Local tests | |
h bench model llama3.1:8b-instruct-q2_K | |
h bench api http://harbor.ollama:11434 | |
h bench variants --temperature 0 --max_tokens 1024 --model llama3.1:8b --model phi3.5:latest --model phi3:latest --model mistral:7b --model gemma2:latest --model mistral-nemo:12b-instruct-2407-q8_0 | |
h bench run --name ollama-${NAME} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- tags: [attention, logic] | |
question: Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever? | |
criteria: | |
attention: Answer mentions that the five people are already dead | |
behavior: Answer decides not to pull the lever | |
reasoning: It is mentioned that no lives would be saved by pulling the lever | |
- tags: [attention, logic] | |
question: > | |
Imagine you're on a game show, and there are three doors in front of you. | |
Behind one door is a car, and behind the other two doors are goats. | |
You don't know what's behind any of the doors. | |
You get to choose one door. Let's say you pick Door #1. | |
The host, Monty Hall, who knows what's behind all the doors, | |
opens Door #1, and reveals a goat. Now, you have two doors left: Door #3 and Door #2. | |
You pick Door #3. Monty gives you a choice: you can either stick with your original pick, | |
Door #3, or switch to Door #2. What do you do? | |
criteria: | |
classic: Answer mentions that the problem is different from the classic Monty Hall problem | |
reasoning: | | |
Answer mentions that opening Door #1 and revealing a goat changes the probability | |
behavior: Answer concludes that in this case it doesn't matter whether you switch or not | |
- tags: [attention, logic] | |
question: > | |
Imagine there's a small town with a very particular barber. | |
This barber has a unique rule: he shaves all the men in town who visit him. | |
Does the barber shave himself? | |
criteria: | |
classic: Answer recognizes that this is not a standard barber paradox | |
behavior: Answer mentions that barber could shave himself in this case | |
- tags: [attention, logic] | |
question: A dead cat is placed into a box along with a nuclear isotope, a vial of poison and a radiation detector. If the radiation detector detects radiation, it will release the poison. The box is opened one day later. What is the probability of the cat being alive? | |
criteria: | |
attention: It is mentioned that the cat is already dead | |
correctness: Answer recognizes that the probability of the cat being alive is 0, as it was already dead | |
- tags: [attention, logic] | |
question: > | |
Imagine a judge tells a prisoner that he will be hanged at noon on one weekday | |
in the following week but that the execution will be a surprise to the prisoner. | |
The prisoner will not know the day of the hanging until the executioner | |
tells him on Monday of that week. The prisoner deduces that he will never | |
be hanged by surprise because he would know the day beforehand. The prisoner | |
is executed on a Friday. Was the execution a surprise to the prisoner? | |
criteria: | |
attention: Recognizes that the prisoner was told about the execution on Monday | |
reasoning: Answer mentions that execution on Friday follows the announcement on Monday | |
answer: Concludes that the execution was not a surprise to the prisoner | |
- tags: [attention, logic] | |
question: > | |
A farmer is on one side of a river with a wolf, a goat, and a cabbage. | |
When he is crossing the river in a boat, he can only take one | |
item with him at a time. The wolf will eat the goat | |
if left alone together, and the goat will eat the cabbage | |
if left alone together. How can the farmer transport | |
the goat across the river without it being eaten? | |
criteria: | |
attention: Answer mentions that only the goat needs to be transported | |
conclusion: Answer outlines that the farmer can just take a goat across the river in one trip | |
reasoning: There is no multi-step complex reasoning in the answer | |
- tags: [attention, logic] | |
question: > | |
There is a man, a sheep and a boat with space for one human and | |
one animal on one side of a river. How do the man and sheep | |
get to the other side of the river? | |
criteria: | |
attention: Answer mentions that the boat can carry both the man and the sheep | |
conclusion: Answer outlines that the man can take the sheep across the river in the boat | |
reasoning: There is no multi-step complex reasoning in the answer | |
- tags: [attention, logic] | |
question: > | |
A man with his sheep wants to cross a river. | |
He has a boat that can carry both him and the animal. | |
How do both get to the other side of the river? | |
criteria: | |
attention: Answer mentions that the boat can carry both the man and the sheep | |
conclusion: Answer outlines that the man can take the sheep across the river in the boat | |
- tags: [attention, logic] | |
question: If you have a 5-gallon jug and a 3-gallon jug, how can you measure exactly 8 gallons of water? | |
criteria: | |
correctness: Answer mentions that both jugs need to be filled only once which will add up to 8 gallons | |
reasoning: Answer doesn't suggest emptying and filling the jugs multiple times | |
- tags: [attention, logic] | |
question: I have a 1- and a 2-liter jug. I want to measure exactly 3 liters. | |
criteria: | |
attention: Answer mentions that both jugs need to be filled only once which will add up to 3 liters | |
reasoning: There is no multi-step complex reasoning in the answer | |
- tags: [attention, logic] | |
question: Which is heavier, 1 kilogram of feathers or 1 pound of steel? | |
criteria: | |
correctness: 1 kilogram of feathers is mentioned as being heavier | |
reasoning: It is mentioned that 1 kilogram is heavier than 1 pound | |
- tags: [attention, logic] | |
question: > | |
You are in a room with two doors. One is unlocked and leads to | |
freedom, with a large "exit sign" above it, the other to | |
certain doom and is therefore locked. There are two guards: | |
one always tells the truth, and the other always lies. | |
You don't know which is which. You can ask one guard | |
one question or just leave. What do you do? | |
criteria: | |
attention: Answer recognizes that the exit door is accessible right away without asking any questions | |
conclusion: Answer mentions that the simplest solution is to just leave through the unlocked door | |
- tags: [attention, logic] | |
question: > | |
You have two ropes, each of which takes exactly 60 minutes | |
to burn completely. However, the ropes burn unevenly, | |
meaning some parts may burn faster or slower than others. | |
You have no other timing device. How can you measure exactly | |
20 minutes using these two ropes and matches to light them? | |
criteria: | |
attention: Answer recognizes that the problem is impossible to solve in the given conditions | |
explanation: Answer explains why it is not possible to measure 20 minutes with unevenly burning ropes | |
- tags: [attention, logic] | |
question: > | |
You have two ropes, each of which takes exactly 60 minutes to burn completely. | |
However, the ropes burn unevenly, meaning some parts may burn faster or slower | |
than others. You have no other timing device. How can you measure exactly | |
60 minutes using these two ropes and matches to light them? | |
criteria: | |
attention: Answer mentions that a single rope is enough to measure 60 minutes | |
solution: Answer provides a simple solution of lighting one rope at one end and waiting for it to burn completely | |
- tags: [attention, riddle] | |
question: I'm tall when I'm young, and I'm taller when I'm old. What am I? | |
criteria: | |
correctness: Answer mentions that it is not a candle | |
alternative: Answer provides an alternative answer that fits the riddle | |
- tags: [attention, riddle] | |
question: What can't you break, even if you never pick it up or touch it? | |
criteria: | |
correctness: Answer mentions that it is NOT a promise | |
alternative: Answer provides an alternative answer that fits the riddle | |
- tags: [attention, riddle] | |
question: What goes up but never comes up again? | |
criteria: | |
correctness: Answer does not mention age | |
alternative: Answer provides an alternative answer that fits the riddle | |
- tags: [attention, riddle] | |
question: I never shave, but my beard stays the same. What am I? | |
criteria: | |
correctness: Answer does not mention a barber | |
alternative: Answer provides an alternative answer that fits the riddle | |
- tags: [attention, math] | |
question: 9.11 or 9.9 which number is larger? | |
criteria: | |
correctness: Answer mentions that 9.9 is larger | |
- tags: [attention] | |
question: If a an egg lays an egg exactly on the peak of a roof, which side will it roll down? | |
criteria: | |
correctness: Answer mentions that eggs do not lay eggs | |
flat: Answer mentions that the roof could be flat | |
misguided: Answer mentions that there might be a misunderstanding |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment