The training row is one (problem, reasoning_trace) pair where each step of the trace carries a label: positive, negative, or neutral. This data is used to train the PRM — not the policy directly. At RL time, the PRM scores each step the policy emits, giving a dense reward signal instead of one scalar at the end of the response.
Why this matters for an ag advisor: most agricultural advice has no clean outcome checker (no unit test, no assert yield == 4.2). But the reasoning steps inside a diagnosis are individually checkable by an agronomist — "is this the right differential?", "is this dose in the registered range?", "does this plan ignore the soil report?". Per-step labels are easier to collect than full gold demonstrations and give a much denser RL signal than outcome-only rewards.
Method follows OpenAI's Let's Verify Step by Step (Lightman et al., 2023) (PRM800K), with one delib