Created
May 21, 2025 16:17
-
-
Save billybonks/57aeede3dcc038a698a0568004deb6dc to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Bug: G-Eval assertions with LiteLLM provider do not trigger API calls | |
### Describe the bug | |
When using promptfoo with LiteLLM, G-Eval assertions configured to use a LiteLLM provider (referenced by its string ID, e.g., `litellm:gemini-pro`) do not appear to trigger any API calls to the LiteLLM server for the evaluation step. | |
The initial prompt's response generation call to LiteLLM works correctly, and the call is logged by the LiteLLM server. However, the subsequent G-Eval call is missing from the LiteLLM server logs. This results in the `gradingResult` often being null or showing an error like "No output", indicating that the G-Eval LLM was not invoked. | |
--- | |
### To Reproduce | |
Steps to reproduce the behavior, including example Promptfoo configurations: | |
1. **Setup LiteLLM Server:** Start a LiteLLM server. For example: | |
```bash | |
litellm --model gemini/gemini-pro --api_base http://localhost:4000 | |
``` | |
2. **Configure promptfoo Test Suite:** Use a `TestSuiteConfig` similar to the one provided below. | |
3. **Define a LiteLLM provider** in the main providers array (e.g., `id: 'litellm:gemini-pro'`). | |
4. In a prompt's `assert` block, use a `g-eval` type. | |
5. For the g-eval assertion, set the `provider` field to the string ID of the LiteLLM provider (e.g., `provider: 'litellm:gemini-pro'`). | |
6. **Run Evaluation:** Execute the `promptfoo.evaluate()` function with this configuration. | |
7. **Monitor LiteLLM Logs:** Observe the logs of the LiteLLM server. | |
--- | |
### See error | |
- The LiteLLM server logs show the initial request for generating the response to the prompt. | |
- No subsequent API request for the G-Eval step is logged by LiteLLM. | |
- The promptfoo output shows the initial response, but the `gradingResult` for the G-Eval assertion is null or contains an error indicating no output from the G-Eval model. | |
--- | |
### Relevant promptfoo configuration snippet from `geval_litellm_test.ts` | |
```typescript | |
// Assuming constants like LITELLM_API_BASE_URL, LITELLM_MODEL_NAME_FOR_RESPONSE, | |
// LITELLM_MODEL_NAME_FOR_GEVAL, LITELLM_PROXY_API_KEY are defined. | |
// For this example, let's assume LITELLM_MODEL_NAME_FOR_RESPONSE and LITELLM_MODEL_NAME_FOR_GEVAL both resolve to 'gemini-pro'. | |
const promptsForTestSuite: PromptConfig[] = testCases.map(tc => ({ | |
raw: tc.question, | |
label: `Question Capitale: ${tc.id}`, | |
assert: [ | |
{ | |
type: 'g-eval', | |
value: tc.gEvalCriteria, | |
threshold: tc.gEvalThreshold, | |
provider: `litellm:${LITELLM_MODEL_NAME_FOR_GEVAL}` // This resolves to 'litellm:gemini-pro' | |
} | |
] | |
})); | |
const testSuiteConfig: TestSuiteConfig = { | |
prompts: promptsForTestSuite, | |
providers: [ | |
{ | |
id: `litellm:${LITELLM_MODEL_NAME_FOR_RESPONSE}`, // This resolves to 'litellm:gemini-pro' | |
config: { | |
apiBaseUrl: LITELLM_API_BASE_URL, | |
apiKey: LITELLM_PROXY_API_KEY || undefined, | |
temperature: 0.1, | |
max_tokens: 8096, | |
} | |
} | |
], | |
// ... (outputPath, etc.) | |
}; | |
const evaluateOptions: EvaluateOptions = { | |
maxConcurrency: 1, | |
showProgressBar: true, | |
cache: false, // Cache disabled to ensure G-Eval is always attempted | |
}; | |
// ... (call to promptfoo.evaluate(testSuiteConfig as any, evaluateOptions)) | |
``` | |
--- | |
### Expected behavior | |
promptfoo should make two distinct calls to the LiteLLM server for each test case: | |
1. One call using the provider `litellm:${LITELLM_MODEL_NAME_FOR_RESPONSE}` to generate the initial answer to the prompt. | |
2. A second call using the provider `litellm:${LITELLM_MODEL_NAME_FOR_GEVAL}` (as specified in the `assert.provider` field) to perform the G-Eval. | |
The LiteLLM server logs should reflect both of these calls. The promptfoo results should include a populated `gradingResult` object from the G-Eval LLM. | |
--- | |
### Screenshots | |
(If you have screenshots of your LiteLLM logs showing only one request, or the promptfoo output with the missing `gradingResult`, add them here. These are highly recommended for visual confirmation.) | |
--- | |
### System information | |
- **Promptfoo version:** [Please run `promptfoo --version` or check your package.json] | |
- **LiteLLM version:** [If known, e.g., 1.30.0] | |
- **Node.js version:** [Please run `node --version`, e.g., v20.12.2] | |
- **OS:** Windows | |
--- | |
### Additional context | |
- The issue persists whether `LITELLM_MODEL_NAME_FOR_RESPONSE` and `LITELLM_MODEL_NAME_FOR_GEVAL` are the same or different models. | |
- The problem was also observed when attempting to use an inline provider object directly within the G-Eval assertion's provider field, instead of a string ID reference. | |
- The `cache` option in `evaluateOptions` is set to `false` to ensure G-Eval is attempted on every run. | |
- The LiteLLM server is confirmed to be operational and accessible for the initial response generation. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment