Generative verifiers reward modeling as next-token prediction


Keywords:LLM reasoning, reward models, verifiers

TL;DR:We pose the problem of reward modeling as next-token prediction, allowing for Chain-of-Thought and Majority Voting in reward models

Abstract:Verifiers or reward models are often used to improve the reasoning performance of big language models (LLMs). A shared approach is the Best-of-N way, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not make use of the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose development verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, permit chain-of-thought reasoning, and can use additional test-time compute via majority voting for better verification. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in a 1

generative verifiers reward modeling as next-token prediction

[Papierüberprüfung] Generative Verifiers: Reward Modeling as Next-Token Prediction

The paper titled "Generative Verifiers: Reward Modeling as Next-Token Prediction" introduces a novel approach to improve the reasoning performance of large language models (LLMs) by utilizing generative verifiers (GenRM) that leverage next-token prediction for verification tasks. This method contrasts with traditional discriminative reward models that typically rank solutions based on binary correctness scores, thereby not capitalizing on the LLM's generative capabilities.

Core Methodology Overview

1. Generative Verifiers (GenRM):

  • The authors propose that verifiers should be trained with a generative approach, specifically using next-token prediction. GenRM aims to assess whether candidates (solutions) generated by language models are correct by predicting tokens that signify correctness (e.g., 'Yes' or 'No').
  • Unlike conventional methods where correctness is output directly as a score, GenRM uses the probabilistic framework of LLMs. The correctness is expressed through the model's output probabilities. Thus, when provided with a candidate solution, the model assesses correctness by genera

    Could teaching AI to grade like a human actually make it *better* at thinking?

    Overview

    • The paper discusses a novel approach called "Generative Verifiers" for learning reward models by treating reward modeling as a next-token prediction task.
    • The key idea is to train a language model to predict the next token in a sequence, where the target sequence is the desired reward signal.
    • This approach aims to overcome limitations of existing reward modeling techniques and enable more scalable and general reward learning.

    Plain English Explanation

    The paper presents a new way to train AI systems to understand and learn rewards. Traditionally, reward models have been trained using specialized techniques that can be complex and limited in their applicability. The authors propose a simpler approach that treats reward modeling as a language modeling task.

    The core idea is to train an AI model, similar to a predictive text system, to guess the next word in a sequence. However, instead of predicting the next word in a normal text, the model is trained to predict the "reward" signal - a measure of how good or desirable an action or outcome is.

    By framing reward mode

    Reward Modeling

    Reward models are core to the modern approach to RLHF. Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [1]. The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent’s reward function given trajectories of behavior [2], and other areas of deep reinforcement learning. Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [3].

    The most common reward model predicts the probability that a piece of text was close to a “preferred” piece of text from the training comparisons. Later in this section we also compare these to Outcome Reward Models (ORMs) that predict the probability that a completion results in a correct answer or a Process Reward Model (PRM) that assigns a score to each step in reasoning. When not indicated, the reward models mentioned are those predicting preference between text.

    Training Reward Models

    There are two popular expressions for how to train a standard reward model for RLHF – they are numerically equivalent. The canonical implementation is derived from the Bradley-Terry

    Abstract

    Verifiers or reward models are often used to enhance the reasoningperformance of large language models (LLMs). A common approach is the Best-of-Nmethod, where N candidate solutions generated by the LLM are ranked by averifier, and the best one is selected. While LLM-based verifiers are typicallytrained as discriminative classifiers to score solutions, they do not utilizethe text generation capabilities of pretrained LLMs. To overcome thislimitation, we instead propose training verifiers using the ubiquitousnext-token prediction objective, jointly on verification and solutiongeneration. Compared to standard verifiers, such generative verifiers (GenRM)can benefit from several advantages of LLMs: they integrate seamlessly withinstruction tuning, enable chain-of-thought reasoning, and can utilizeadditional test-time compute via majority voting for better verification. Wedemonstrate that GenRM outperforms discriminative, DPO verifiers, andLLM-as-a-Judge, resulting in a 16-40% improvement in the number of problemssolved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, wefind that training GenRM with synthetic verification rationales is sufficientto pick o