Skip to content

Instantly share code, notes, and snippets.

@aidando73
Last active December 26, 2024 02:17
Show Gist options
  • Save aidando73/4422e5482766939c0a8d8fa229d3a386 to your computer and use it in GitHub Desktop.
Save aidando73/4422e5482766939c0a8d8fa229d3a386 to your computer and use it in GitHub Desktop.
SWE-Bench-Lite CodeAct 2.1 - Llama 405B/3.3 70B

Llama 3.3 70B Instruct

  • Score: 0.047
  • model: "openrouter/meta-llama/llama-3.3-70b-instruct"
  • Total cost: $19.6 USD
  • Reference: OpenHands result for 3.1 70B is 0.08 [1]

Llama 3.1 405b Instruct

  • Score: 0.053
  • model: "openrouter/meta-llama/llama-3.1-405b-instruct"
  • Total cost: ~$110 USD
  • Official OpenHands result for 405b (CodeAct 1.9) is 0.14 [1]
    • Mine is a lot lower. Could be because I used openrouter (depending on who they route to they have different max tokens - potentially different implementations as well), also my evaluation stopped and started many times due to errors.

Eval outputs

I've uploaded all the outputs on huggingface: https://huggingface.co/datasets/aidando73/open-hands-swe-bench-evals/tree/main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment