Can you tell an LLM "don't hallucinate" and expect it to work? my gut reaction was "oh this is so silly" but upon some reflection, it really isn't. There is actually no reason why it shouldn't work, especially if it was preference-fine-tuned on instructions with "don't hallucinate" in them, and if it a recent commercial model, it likely was.
What does an LLM need in order to follow an instruction? It needs two things:
- an ability to perform then task. Something in its parameters/mechanism should be indicative of the task objective, in a way that can be influenced. (In our case, it should "know" when it hallucinates, and/or should be able to change or adapt its behavior to reduce the chance of hallucinations.)
- an ability to ground the instruction: the model should be able to associate the requested behavior with its parameters/mechanisms. (In our case, the model should associate "don't hallucinate" with the behavior related to 1).