- Commands run were: https://github.com/aidando73/OpenHands/pull/1/files
- Ran on OpenHands + CodeAct 2.1 - ecff5c67fb7f1995556f0f36f5050f33dc0953d2
- Ran on SWE bench lite (300 instances)
- Provider: OpenRouter
- Score: 0.047
- model: "openrouter/meta-llama/llama-3.3-70b-instruct"
- Total cost: $19.6 USD
- Reference: OpenHands result for 3.1 70B is 0.08 [1]
- Score: 0.053
- model: "openrouter/meta-llama/llama-3.1-405b-instruct"
- Total cost: ~$110 USD
- Official OpenHands result for 405b (CodeAct 1.9) is 0.14 [1]
- Mine is a lot lower. Could be because I used openrouter (depending on who they route to they have different max tokens - potentially different implementations as well), also my evaluation stopped and started many times due to errors.
I've uploaded all the outputs on huggingface: https://huggingface.co/datasets/aidando73/open-hands-swe-bench-evals/tree/main