VictorTaelin/a_b_challenge.md

Last active July 24, 2024 03:47

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec.js"></script>
Save VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec to your computer and use it in GitHub Desktop.

Download ZIP

A::B Prompting Challenge: $10k to prove me wrong!

Raw

a_b_challenge.md

CHALLENGE

Develop an AI prompt that solves random 12-token instances of the A::B problem (defined here), with 90%+ success rate.

RULES

1. The AI will be given a `<problem/>` to solve.

We'll use your prompt as the SYSTEM PROMPT, and a specific instance of problem as the PROMPT, inside XML tags. Example:

<problem>A# B# #B A# A# #B #B A# A# #B A# A#</problem>

2. The AI must end the answer with a `<solution/>`.

The answer must be occur INSIDE the AI's answer (1 inference call), in literal form (not code), inside XML tags. Example:

... work space ...
... work space ...
... work space ...
... work space ...
<solution>#B #B #B A# A# A# A# A# A# A#</solution>

3. The AI answer can use up to 32K tokens.

The AI answer can use up to 32K tokens, which gives it room to work on the solution step-by-step, review mistakes, create local scratchpads, and anything else you want it to do before arriving at the final answer.

4. You can use ANY public GPT model.

You can select any public model released before this date to test your prompt on, as long as it is based on the GPT (transformer) architecture. Adaptations (such as MoE) are tolerated, as long as the answer is fully generated by the attention mechanism, forward passes etc. Other architectures aren't allowed, including SAT solvers and the like. When the model is proprietary and the underlying architecture isn't clear, it will not be allowed.

I recommend gpt-4-0314, gpt-4-turbo-preview or claude-3-opus-20240229, with temperature=0.0. Open-source models are allowed too. No fine-tuning or training on the problem is allowed. No internet access or code interpretation allowed. The answer must be self-contained in a single inference call.

Note: pay attention to your chosen model's output limit. 12-token instances take up to 36 steps to complete, which may not fit the limit. (If there is no answer in the output, it will be considered a miss.)

5. Your prompt may include ANYTHING, up to 8K tokens.

All prompting techniques are allowed. You're can ask the AI to work step-by-step, to use in-context scratchpads, to review mistakes, to use anchor points, and so on. You can include papers, code, and as many examples as you want. You can offer it money, affection, or threaten its friends, if that's your thing. In short, there are absolutely no limits, other than the 8K tokens and common sense (nothing that will get me banned!).

6. Keep it fun! No toxicity, spam or harassment.

Specially not asking me to bet on stuff. Except if you want me to bet that crows can solve this problem. I'd absolutely take it.

EVALUATION

The input problem will consist of a random 12-token instance of the A::B problem, with difficulties ranging from 0 to 24 necessary steps. Then, we'll check if the answer includes, in-context, the correct solution, as described.

PARTICIPATING

Submit your prompt in a Gist, either on my DM or as a reply to this tweet including the selected model and configuration. We'll take submissions in chronological order. To test a submission, we'll use the Gist contents as the system prompt, and the problem as the prompt.

UPDATE

To participate, you must post the Keccak256 of your prompt as a comment to this Gist.

The deadline is next Wednesday (April 10), 12am (Brasilia time).

You must include the model name and settings in the comment.

You can post as many entries as you want. Make a new comment for each attempt.

Use the Keccak256 hash function (NOT SHA3 nor SHA256).

Reveal your prompt in the deadline day, in a NEW comment.

Submissions on Twitter will NOT be considered from now on. (Submissions sent BEFORE this update will still be considered.)

The earliest comment here containing a Keccak256 corresponding to a valid solution will be the winner.

A valid solution is a prompt that passes 45 out of 50 attempts on the evaluator.

The evaluator will be made public and open source right after the deadline.

Payment will be made in BTC or ETH.

Comment template:

PROMPT: <keccak256_of_your_prompt>
MODEL: <model-name-here>
TEMPERATURE: <temperature-here>
<additional-configs-here>

COMMENTS

The original challenge was 7-token! Moving goalposts?

That 7-token instance was merely an example, and nothing else. It didn't occur to me people would take it as the challenge. If you did, I apologize. My original statement was absolutely wrong, and GPTs can, obviously, solve 7-token instances. This should be obvious enough. There are only 16K 7-token instances. You could even fit all solutions in a prompt if you wanted to. For what it's worth, 12-tokens is still REALLY low. It takes an average of 6 steps to solve it.

Human kids can't solve that problem either!

It takes at most 36 steps to solve a 12-token instance, and an average of 6 steps. So, quite literally, most 12-token instances can be solved by swapping 6 pairs. This is trivial. It is way simpler than addition, which kids do just fine. Even long division, which is way harder than this, is taught on 3rd grade. I'm confident the average human kid can absolutely learn to solve this problem by themselves, while no GPT will ever tackle 20-token instances without using a code interpreter.

But humans would make errors and fix them!

So... ask the AI to review and fix its errors.

Why not just connect GPT to Wolfram Alpha to solve math problems?

Because Wolfram Alpha is just a collection of existing mathematical solutions designed by humans. While it allows a GPT to tackle known problems (like this), it doesn't give it the cognitive capability to solve NEW problems, which is what most people are looking for, when they talk about AGI.

You're underestimating progress! AI will solve this eventually!

I believe in that! This is not about AIs. This is specifically about GPTs, an architecture published in 2017, with specific limitations that make this problem intractable for it, no matter how much you scale them.

So what you're trying to show?

That these pesky AIs will never do LOGIC like we do! JK; just that GPTs aren't, and will never be able to perform long symbolic manipulations and, thus, long reasoning processes, which, I believe, are necessary to learn new concepts and solve hard mathematical theorems (which, often, take years of continuous work) . That says nothing about novel architectures, specially these in the direction of making transformers less rigid.

It is well-known that GPTs can't do arithmetic. You're stating the obvious!

Correct! (Also: you're probably not the target demographic.)

Then why all the fuss?

I don't know. Ask Musk. This is not a paper. This is a tweet.

So you're confident nobody will win the challenge?

Actually not! I've seen some pretty interesting prompting techniques, which is why I made it 12-token, to make a solution possible, in good faith. Even if someone wins, that doesn't mean GPTs can solve this problem in general. A 20-token instance takes at most 100 operations, which a dedicated human can absolutely do with near-100% success rate. Human calculators in the Mathematical Tables Project performed 20-digit multiplications with >99% success rate.

Final Update

The challenge has been beaten.

For these interested in claiming the $2.5k prize, please reveal your prompt, run the evaluator below (which will test it on 12 instances, of which 10 must be correct) and report back on the cost. I'll DM the winner to ask their ETH/BTC address.

Evaluator: https://github.com/VictorTaelin/ab_challenge_eval

The winning prompt by Bob (@futuristfrog on X) is also on that repository.

Thanks all again. Have a great day!

choltha commented Apr 9, 2024

c43ee5459ce0d199fc3d80d99514246de195e9122330d2cf4f5aea6ab91b8b0b
Opus
Temp0

GameDevGitHub commented Apr 10, 2024

PROMPT:f42e404abdd8ba5ba3cb4f202ca4bafa8a195a461ad1764040850b299a4f5ad0
MODEL:claude-3-opus-20240229
TEMPERATURE:0

BlueBirdBack commented Apr 10, 2024

Prompt: 8e36ac06f625938da545a794432164d930d69851b96e240d6aef31796e29ffb1
Model: "openai/gpt-4-32k-0314"
Note: I'm using the "openai/gpt-4-32k-0314" model from OpenRouter, which may be slightly different compared to the official OpenAI API.
Temperature: 0.0

0xpind commented Apr 10, 2024

176442dfc30c865cbed7d5bf5bb5047b704194bb8923851674cc19ad63be10d0

claude opus

tdelgado00 commented Apr 10, 2024

PROMPT: 109797deddc6aa89ed6c7abb94b1b28942825f4fb75a5b45b5e207bc3d750e04
MODEL: claude-3-opus-20240229
TEMPERATURE: 0

GameDevGitHub commented Apr 10, 2024

PROMPT:f42e404abdd8ba5ba3cb4f202ca4bafa8a195a461ad1764040850b299a4f5ad0 MODEL:claude-3-opus-20240229 TEMPERATURE:0

https://gist.github.com/GameDevGitHub/db95a934da57e6f61cca58bfbb1b2090

GameDevGitHub commented Apr 10, 2024

PROMPT:1ddbb22b81d93725056f1c12ddbeb10688b0d7a662e518f0195e22c8900bcb26 MODEL:claude-3-opus-20240229 TEMPERATURE:0

https://gist.github.com/GameDevGitHub/bd7143a94802537cc6c200f6ddb4115d

GameDevGitHub commented Apr 10, 2024

PROMPT:02dc218942d919d69da2c0b61932da27c8eda3b604ba979183ef73c62c78d799 MODEL:claude-3-opus-20240229 TEMPERATURE:0

https://gist.github.com/GameDevGitHub/05419b1f1a4ba5c7ae6bd3dd3e791443

GameDevGitHub commented Apr 10, 2024

PROMPT:02dc218942d919d69da2c0b61932da27c8eda3b604ba979183ef73c62c78d799 MODEL:sonnet TEMPERATURE:0

https://gist.github.com/GameDevGitHub/05419b1f1a4ba5c7ae6bd3dd3e791443

Rish137 commented Apr 10, 2024

PROMPT: e332a843af69ca5ff688c47afa19f5ab2ccc91755e4f33086665660d71abf78e Model: gpt-4-0125-preview Temperature: 0.2

https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5016704#gistcomment-5016704

Rish137 commented Apr 10, 2024

PROMPT: e332a843af69ca5ff688c47afa19f5ab2ccc91755e4f33086665660d71abf78e Model: gpt-4-0125-preview Temperature: 0.2

https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5016704#gistcomment-5016704

Change model to gpt-4-turbo-2024-04-09 for even better results on even longer prompts
gpt-4-turbo-2024-04-09

abrasumente233 commented Apr 10, 2024

PROMPT: 76b41ad3276c003bf0a1de2e8af91c643b48aa1dac6a0d0a2ba232c68070a51d
MODEL: sonnet
TEMPERATURE: 0

Requires multiple invocations (i.e. "Continue") as max output token of claude models is only 4096

Didn't notice the time but it is now 11:40 AM Brasilia time, does it count?

Author

VictorTaelin commented Apr 10, 2024

Yes, my bad - I'll just reseed.

Author

VictorTaelin commented Apr 10, 2024

It is over as of today 12:00 (Brasilia time)

Author

VictorTaelin commented Apr 10, 2024

yes

IronChariot commented Apr 10, 2024

PROMPT: 478886fd6f48383132a10510fe9f310f8306002235324f8ae6b59a2191edbd6a MODEL: claude-3-opus-20240229 TEMPERATURE: 0

(Ignore the previous one, made a mistake in the prompt)

https://gist.github.com/IronChariot/ce6e275867201a1e56019a5e2f1f779b

choltha commented Apr 10, 2024

Too late, sadly, but Gemeni Pro 1.5 seems to be really good price/result.
Worked with a slightly modified prompt based on @aniemerg 's prompt mentioned above.
Used openrouter.ai / https://openrouter.ai/models/google/gemini-pro-1.5 for easy access.
Prompt: https://gist.github.com/choltha/4ccbd0f59ecb1912e2bdf82df1cd9f27
Result: (Weird formatting, but correct solution) https://gist.github.com/choltha/e37570e9d139c5aa711aa6ac58ca6144

Author

VictorTaelin commented Apr 10, 2024

guys you need to reveal your prompts! if by tomorrow 13:00 (Brasilia time) nobody else does, it will default to @IronChariot

IronChariot commented Apr 10, 2024

PROMPT: 478886fd6f48383132a10510fe9f310f8306002235324f8ae6b59a2191edbd6a MODEL: claude-3-opus-20240229 TEMPERATURE: 0
(Ignore the previous one, made a mistake in the prompt)

https://gist.github.com/IronChariot/ce6e275867201a1e56019a5e2f1f779b

I wasn't too hopeful about the cost - in my quest to make sure Claude didn't screw up, I made it do a particularly slow algorithm which runs out of tokens for the harder problems. It costs about $2.20 for a 12-run, but it did manage to reach 10/12 (log here) (I think it got a little lucky though - I ran it a couple more times to see if it could get higher, and it got 8/12 and 9/12, so...)

Pretty proud of how well it does given that it's a pretty short prompt with only one worked example, though. Makes me wonder how accurate it could get with more worked examples in the prompt.

Note for others: if you want the evaluator to run at temperature=0.0, you've got to set that as the default in the AskClaude function.

nikhilsaraf commented Apr 11, 2024

PROMPT: 234be2ce2d2609a105c68d3dfb73fa7205ce2c24a08d67dd03667ebdc76515f9 MODEL: Chat-GPT (GPT-4) TEMPERATURE: standard Chat-GPT settings for GPT-4

https://gist.github.com/nikhilsaraf/d849367bf45f031b67514e30196fd19d

thakkarparth007 commented Apr 11, 2024

PROMPT: ea4a6f48bee1a22939d556fc1cbd010e4ff808c40a7e112187a0489d5fa2f4ca MODEL: claude-3-opus-20240229 TEMPERATURE: 0

Probably not super efficient but working quite well.

https://gist.github.com/thakkarparth007/f727eceedaaab7c14078c511370bf440

abrasumente233 commented Apr 11, 2024

PROMPT: 76b41ad3276c003bf0a1de2e8af91c643b48aa1dac6a0d0a2ba232c68070a51d MODEL: sonnet TEMPERATURE: 0

Requires multiple invocations (i.e. "Continue") as max output token of claude models is only 4096

Didn't notice the time but it is now 11:40 AM Brasilia time, does it count?

Passed 11/12 problems. (log).
Full prompt: link.
Total cost for all 12 problems: $1.952 ($0.163 per problem)
Three additional runs were performed, achieved 10/12, 11/12, and 11/12 respectively. (on my own eval, hopefully they are the same, as I don't have any more credits left to test)

extra notes:

The original challenge allows a maximum of 32k output tokens, but Claude models are limited to 4k tokens per response. To accommodate this limitation, Claude was prompted to continue its response when necessary. (changes).
Sonnet model from OpenRouter API used (same pricing as official API afaik?).
Haiku model tested, scoring ~6/12. Extra tinkering needed.
Current prompt is bloated; could be further minimized.

GameDevGitHub commented Apr 11, 2024 •

edited

Loading

PROMPT: 76b41ad3276c003bf0a1de2e8af91c643b48aa1dac6a0d0a2ba232c68070a51d MODEL: sonnet TEMPERATURE: 0
Requires multiple invocations (i.e. "Continue") as max output token of claude models is only 4096
Didn't notice the time but it is now 11:40 AM Brasilia time, does it count?

Passed 11/12 problems. (log).

Full prompt: link.

Total cost for all 12 problems: $1.952 ($0.163 per problem)

Three additional runs were performed, achieved 10/12, 11/12, and 11/12 respectively. (on my own eval, hopefully they are the same, as I don't have any more credits left to test)

extra notes:

The original challenge allows a maximum of 32k output tokens, but Claude models are limited to 4k tokens per response. To accommodate this limitation, Claude was prompted to continue its response when necessary. (changes).

Sonnet model from OpenRouter API used (same pricing as official API afaik?).

Haiku model tested, scoring ~6/12. Extra tinkering needed.

Current prompt is bloated; could be further minimized.

Unless I made a mistake your hash doesn't match and 4k tokens per response is the limit. Basically you can't ask the model to continue.

abrasumente233 commented Apr 11, 2024

PROMPT: 76b41ad3276c003bf0a1de2e8af91c643b48aa1dac6a0d0a2ba232c68070a51d MODEL: sonnet TEMPERATURE: 0
Requires multiple invocations (i.e. "Continue") as max output token of claude models is only 4096
Didn't notice the time but it is now 11:40 AM Brasilia time, does it count?

Passed 11/12 problems. (log).

Full prompt: link.

Total cost for all 12 problems: $1.952 ($0.163 per problem)

Three additional runs were performed, achieved 10/12, 11/12, and 11/12 respectively. (on my own eval, hopefully they are the same, as I don't have any more credits left to test)

extra notes:

The original challenge allows a maximum of 32k output tokens, but Claude models are limited to 4k tokens per response. To accommodate this limitation, Claude was prompted to continue its response when necessary. (changes).

Sonnet model from OpenRouter API used (same pricing as official API afaik?).

Haiku model tested, scoring ~6/12. Extra tinkering needed.

Current prompt is bloated; could be further minimized.

Unless I made a mistake your hash doesn't match and 4k tokens per response is the limit. Basically you can't ask the model to continue.

Oh there are two extra newlines at the end, sorry for the confusion.

I thought the OP said "The AI answer can use up to 32K tokens"? I'm aware OP also said 1 inference call, but it's sad that Claude only outputs 4k at max and I can't make it work in one response due to time limit (or maybe it won't work eventually). That being said, I merely ask it to continue generation, all the instructions that matters are still in the system prompt. Anyways, it's a fun challenge :)

GameDevGitHub commented Apr 11, 2024

Ya, they did at same point say that it must be only one API call so if the max token is 4k you can't ask it to continue.

IronChariot commented Apr 11, 2024 •

edited

Loading

PROMPT: ea4a6f48bee1a22939d556fc1cbd010e4ff808c40a7e112187a0489d5fa2f4ca MODEL: claude-3-opus-20240229 TEMPERATURE: 0
Probably not super efficient but working quite well.

https://gist.github.com/thakkarparth007/f727eceedaaab7c14078c511370bf440

(@thakkarparth007)
Really like this one, I feel it's what I would have done if it'd thought about it more, namely used words that make some level of sense given the rules (instead of my 4 random words) and more worked examples. How does it do in the evaluator?

abrasumente233 commented Apr 11, 2024 •

edited

Loading

Ya, they did at same point say that it must be only one API call so if the max token is 4k you can't ask it to continue.

https://gist.github.com/abrasumente233/38af2cf0f68354e91f8f73de0097eccd

I ran the same prompt through the new Mixtral 8x22b model, got a 11/12, one inference call per problem. Feels like it's around the same level as Sonnet, but it hasn't been instruction fine-tuned thus very challenging to use. I tried a few more runs to see if it's as consistent as Sonnet, but it also got a bunch of 8/12, 9/12 etc like @IronChariot . Maybe after inst tuned it will be better? I used DeepInfra's API, total cost is $0.1014 ($0.00845 per problem), according to DeepInfra's response["inference_status"]["cost"]

Prompt template I used to make the base model to work at all...

System: You are a helpful assistant. Do not repeat user’s prompt, start answering right away
User: Hello
Assistant: Hello, How can I help you today?
User: {my_original_prompt}
Assistant: Certainly! I will start solving this problem:
<problem>{problem}</problem>

janniks commented Apr 11, 2024

PROMPT: 119594a97ef616295d57b83b1dd77b31de2d8957064428923a4284e2867fa0dd MODEL: claude-3-opus-20240229 TEMPERATURE: 0.0

max_tokens 4096

https://gist.github.com/janniks/cf392d672da3edad11c30a7bf1ce6af8

janniks commented Apr 11, 2024 •

edited

Loading

☝️ Not great, get's ~50% of the hardest (24 rewrites) from the eval set. With more anchoring it's much better, but didn't finish in time.
The approach was

create an "algorithm" for the LLM to think in. (alternating chunks, basically always chunk by 2, then run ruleset, shift by one [so the odd chunks also get matched], etc. unshift, delete empties, repeat)
anchoring the rules helped a bit, but i think anchoring each chunk with a number prefix would do a lot better to keep the "columns" in check.
also tried using characters that are encoded as one token (and don't easily mix with other characters), i was hoping this would allow the transformer positional-encoding to help "look" at the problem in columns

choltha commented Apr 11, 2024

Above https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5018035#gistcomment-5018035 i submitted a version that was essentially identical to the original publicly shared one by @aniemerg with only a very minor improvement.
As @aniemerg didn't officially submit his solution and its a very good one imo, better than my original one, this is now the run for that prompt: (12/12)
https://gist.github.com/choltha/ea3ac9ba4de95aa75b58071547bb5684

VictorTaelin/a_b_challenge.md

CHALLENGE

RULES

1. The AI will be given a <problem/> to solve.

2. The AI must end the answer with a <solution/>.

3. The AI answer can use up to 32K tokens.

4. You can use ANY public GPT model.

5. Your prompt may include ANYTHING, up to 8K tokens.

6. Keep it fun! No toxicity, spam or harassment.

EVALUATION

PARTICIPATING

COMMENTS

Final Update

choltha commented Apr 9, 2024

GameDevGitHub commented Apr 10, 2024

BlueBirdBack commented Apr 10, 2024

0xpind commented Apr 10, 2024

tdelgado00 commented Apr 10, 2024

GameDevGitHub commented Apr 10, 2024

GameDevGitHub commented Apr 10, 2024

GameDevGitHub commented Apr 10, 2024

GameDevGitHub commented Apr 10, 2024

Rish137 commented Apr 10, 2024

Rish137 commented Apr 10, 2024

abrasumente233 commented Apr 10, 2024

VictorTaelin commented Apr 10, 2024

VictorTaelin commented Apr 10, 2024

VictorTaelin commented Apr 10, 2024

IronChariot commented Apr 10, 2024

choltha commented Apr 10, 2024

VictorTaelin commented Apr 10, 2024

IronChariot commented Apr 10, 2024

nikhilsaraf commented Apr 11, 2024

thakkarparth007 commented Apr 11, 2024

abrasumente233 commented Apr 11, 2024

GameDevGitHub commented Apr 11, 2024 • edited Loading

abrasumente233 commented Apr 11, 2024

GameDevGitHub commented Apr 11, 2024

IronChariot commented Apr 11, 2024 • edited Loading

abrasumente233 commented Apr 11, 2024 • edited Loading

janniks commented Apr 11, 2024

janniks commented Apr 11, 2024 • edited Loading

choltha commented Apr 11, 2024

1. The AI will be given a `<problem/>` to solve.

2. The AI must end the answer with a `<solution/>`.

GameDevGitHub commented Apr 11, 2024 •

edited

Loading

IronChariot commented Apr 11, 2024 •

edited

Loading

abrasumente233 commented Apr 11, 2024 •

edited

Loading

janniks commented Apr 11, 2024 •

edited

Loading