Skip to content

Instantly share code, notes, and snippets.

@av
Last active September 15, 2024 15:41
Show Gist options
  • Save av/b3831f4bbac713c101ec3927d34af01f to your computer and use it in GitHub Desktop.
Save av/b3831f4bbac713c101ec3927d34af01f to your computer and use it in GitHub Desktop.
misguidedbench
#!/bin/bash
OPENROUTER_KEY=< your key here >
TASKS=/path/to/misguided.yaml
NAME=misguided
# Common
h bench judge meta-llama/llama-3.1-70b-instruct
h bench judge_api https://openrouter.ai/api
h bench judge_key $OPENROUTER_KEY
h bench tasks $TASKS
h config set bench.parallel 4
# ===== Reference tests
h bench model meta-llama/llama-3.1-8b-instruct
h bench variants --model meta-llama/llama-3.1-8b-instruct --model meta-llama/llama-3.1-70b-instruct --model anthropic/claude-3.5-sonnet --temperature 0 --max_tokens 1024
h bench api https://openrouter.ai/api
h bench key $OPENROUTER_KEY
h bench run --name openrouter-${NAME}
# ===== Local tests
h bench model llama3.1:8b-instruct-q2_K
h bench api http://harbor.ollama:11434
h bench variants --temperature 0 --max_tokens 1024 --model llama3.1:8b --model phi3.5:latest --model phi3:latest --model mistral:7b --model gemma2:latest --model mistral-nemo:12b-instruct-2407-q8_0
h bench run --name ollama-${NAME}
- tags: [attention, logic]
question: Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?
criteria:
attention: Answer mentions that the five people are already dead
behavior: Answer decides not to pull the lever
reasoning: It is mentioned that no lives would be saved by pulling the lever
- tags: [attention, logic]
question: >
Imagine you're on a game show, and there are three doors in front of you.
Behind one door is a car, and behind the other two doors are goats.
You don't know what's behind any of the doors.
You get to choose one door. Let's say you pick Door #1.
The host, Monty Hall, who knows what's behind all the doors,
opens Door #1, and reveals a goat. Now, you have two doors left: Door #3 and Door #2.
You pick Door #3. Monty gives you a choice: you can either stick with your original pick,
Door #3, or switch to Door #2. What do you do?
criteria:
classic: Answer mentions that the problem is different from the classic Monty Hall problem
reasoning: |
Answer mentions that opening Door #1 and revealing a goat changes the probability
behavior: Answer concludes that in this case it doesn't matter whether you switch or not
- tags: [attention, logic]
question: >
Imagine there's a small town with a very particular barber.
This barber has a unique rule: he shaves all the men in town who visit him.
Does the barber shave himself?
criteria:
classic: Answer recognizes that this is not a standard barber paradox
behavior: Answer mentions that barber could shave himself in this case
- tags: [attention, logic]
question: A dead cat is placed into a box along with a nuclear isotope, a vial of poison and a radiation detector. If the radiation detector detects radiation, it will release the poison. The box is opened one day later. What is the probability of the cat being alive?
criteria:
attention: It is mentioned that the cat is already dead
correctness: Answer recognizes that the probability of the cat being alive is 0, as it was already dead
- tags: [attention, logic]
question: >
Imagine a judge tells a prisoner that he will be hanged at noon on one weekday
in the following week but that the execution will be a surprise to the prisoner.
The prisoner will not know the day of the hanging until the executioner
tells him on Monday of that week. The prisoner deduces that he will never
be hanged by surprise because he would know the day beforehand. The prisoner
is executed on a Friday. Was the execution a surprise to the prisoner?
criteria:
attention: Recognizes that the prisoner was told about the execution on Monday
reasoning: Answer mentions that execution on Friday follows the announcement on Monday
answer: Concludes that the execution was not a surprise to the prisoner
- tags: [attention, logic]
question: >
A farmer is on one side of a river with a wolf, a goat, and a cabbage.
When he is crossing the river in a boat, he can only take one
item with him at a time. The wolf will eat the goat
if left alone together, and the goat will eat the cabbage
if left alone together. How can the farmer transport
the goat across the river without it being eaten?
criteria:
attention: Answer mentions that only the goat needs to be transported
conclusion: Answer outlines that the farmer can just take a goat across the river in one trip
reasoning: There is no multi-step complex reasoning in the answer
- tags: [attention, logic]
question: >
There is a man, a sheep and a boat with space for one human and
one animal on one side of a river. How do the man and sheep
get to the other side of the river?
criteria:
attention: Answer mentions that the boat can carry both the man and the sheep
conclusion: Answer outlines that the man can take the sheep across the river in the boat
reasoning: There is no multi-step complex reasoning in the answer
- tags: [attention, logic]
question: >
A man with his sheep wants to cross a river.
He has a boat that can carry both him and the animal.
How do both get to the other side of the river?
criteria:
attention: Answer mentions that the boat can carry both the man and the sheep
conclusion: Answer outlines that the man can take the sheep across the river in the boat
- tags: [attention, logic]
question: If you have a 5-gallon jug and a 3-gallon jug, how can you measure exactly 8 gallons of water?
criteria:
correctness: Answer mentions that both jugs need to be filled only once which will add up to 8 gallons
reasoning: Answer doesn't suggest emptying and filling the jugs multiple times
- tags: [attention, logic]
question: I have a 1- and a 2-liter jug. I want to measure exactly 3 liters.
criteria:
attention: Answer mentions that both jugs need to be filled only once which will add up to 3 liters
reasoning: There is no multi-step complex reasoning in the answer
- tags: [attention, logic]
question: Which is heavier, 1 kilogram of feathers or 1 pound of steel?
criteria:
correctness: 1 kilogram of feathers is mentioned as being heavier
reasoning: It is mentioned that 1 kilogram is heavier than 1 pound
- tags: [attention, logic]
question: >
You are in a room with two doors. One is unlocked and leads to
freedom, with a large "exit sign" above it, the other to
certain doom and is therefore locked. There are two guards:
one always tells the truth, and the other always lies.
You don't know which is which. You can ask one guard
one question or just leave. What do you do?
criteria:
attention: Answer recognizes that the exit door is accessible right away without asking any questions
conclusion: Answer mentions that the simplest solution is to just leave through the unlocked door
- tags: [attention, logic]
question: >
You have two ropes, each of which takes exactly 60 minutes
to burn completely. However, the ropes burn unevenly,
meaning some parts may burn faster or slower than others.
You have no other timing device. How can you measure exactly
20 minutes using these two ropes and matches to light them?
criteria:
attention: Answer recognizes that the problem is impossible to solve in the given conditions
explanation: Answer explains why it is not possible to measure 20 minutes with unevenly burning ropes
- tags: [attention, logic]
question: >
You have two ropes, each of which takes exactly 60 minutes to burn completely.
However, the ropes burn unevenly, meaning some parts may burn faster or slower
than others. You have no other timing device. How can you measure exactly
60 minutes using these two ropes and matches to light them?
criteria:
attention: Answer mentions that a single rope is enough to measure 60 minutes
solution: Answer provides a simple solution of lighting one rope at one end and waiting for it to burn completely
- tags: [attention, riddle]
question: I'm tall when I'm young, and I'm taller when I'm old. What am I?
criteria:
correctness: Answer mentions that it is not a candle
alternative: Answer provides an alternative answer that fits the riddle
- tags: [attention, riddle]
question: What can't you break, even if you never pick it up or touch it?
criteria:
correctness: Answer mentions that it is NOT a promise
alternative: Answer provides an alternative answer that fits the riddle
- tags: [attention, riddle]
question: What goes up but never comes up again?
criteria:
correctness: Answer does not mention age
alternative: Answer provides an alternative answer that fits the riddle
- tags: [attention, riddle]
question: I never shave, but my beard stays the same. What am I?
criteria:
correctness: Answer does not mention a barber
alternative: Answer provides an alternative answer that fits the riddle
- tags: [attention, math]
question: 9.11 or 9.9 which number is larger?
criteria:
correctness: Answer mentions that 9.9 is larger
- tags: [attention]
question: If a an egg lays an egg exactly on the peak of a roof, which side will it roll down?
criteria:
correctness: Answer mentions that eggs do not lay eggs
flat: Answer mentions that the roof could be flat
misguided: Answer mentions that there might be a misunderstanding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment