aidando73/README.md

Last active December 26, 2024 02:17

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/aidando73/4422e5482766939c0a8d8fa229d3a386.js"></script>
Save aidando73/4422e5482766939c0a8d8fa229d3a386 to your computer and use it in GitHub Desktop.

Download ZIP

SWE-Bench-Lite CodeAct 2.1 - Llama 405B/3.3 70B

Raw

Commands run were: https://github.com/aidando73/OpenHands/pull/1/files
Ran on OpenHands + CodeAct 2.1 - ecff5c67fb7f1995556f0f36f5050f33dc0953d2
Ran on SWE bench lite (300 instances)
Provider: OpenRouter

Llama 3.3 70B Instruct

Score: 0.047
model: "openrouter/meta-llama/llama-3.3-70b-instruct"
Total cost: $19.6 USD
Reference: OpenHands result for 3.1 70B is 0.08 [1]

Llama 3.1 405b Instruct

Score: 0.053
model: "openrouter/meta-llama/llama-3.1-405b-instruct"
Total cost: ~$110 USD
Official OpenHands result for 405b (CodeAct 1.9) is 0.14 [1]
- Mine is a lot lower. Could be because I used openrouter (depending on who they route to they have different max tokens - potentially different implementations as well), also my evaluation stopped and started many times due to errors.

Eval outputs

I've uploaded all the outputs on huggingface: https://huggingface.co/datasets/aidando73/open-hands-swe-bench-evals/tree/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment