Skip to Content
Reward Functions

Reward Functions

This guide will try to help you understand what reward functions are, how they work, how we use them and how you can write better ones.

Since this rabbit hole is deep, we will try to keep it as simple and actionable as possible. Also note that this is an area of active research and we are all still learning. We will do our best to share the latest insights and best practices. I would recommend you visit this page from time to time to get the latest updates.

This is difficult. We are happy to help you though! Please react out to us per email at support@augento.ai.

What is a reward function?

Simply put, a reward function is a function that takes an action and returns a score. The score is then used to update the model1.

Reward functions are the bread and butter of reinforcement learning.

Training a dog 🐶

Before we dive into technical details, let’s take a look at how humans train dogs, since it’s a similar process.

Imagine you want to train your dog to sit. The process, expressed as a flow chart, looks like this:

We can learn a lot from this process:

  • We didn’t tell the dog what “sit” means. The first time the dog sits, the dog was just doing it randomly, not because it knew what “sit” means.
  • The dog learned the association between the prompt “sit”, the action of sitting and the reward of a treat.
  • The dog will sit more often because it knows it will get a treat.

Let’s translate this to reinforcement learning.

A robot dog

Now let’s imagine you want to build a robotic dog powered by an LLM. Now you want to train this robot dog to sit.

The process stays the same, but now we have to translate it into LLM terms.

  1. Instead of taking the dog for a walk, we will run the LLM for you.
  2. Instead of talking to the dog, you prompt the LLM by writing to it “sit”.
  3. Instead of performing a phsyical action, the LLM will generate text. I.e. to sit the LLM could generate <action>sit</action>.
  4. Instead of looking at the dog, we will give you the text output of the LLM and you check if the action was correct. I.e. checking if the <action>sit</action> was in the text output.
  5. Instead of giving the dog a treat, you tell us that we should reward the LLM for the action or punish it because it was wrong.2

  1. We take care of saving the prompts to the LLM (that is our interceptor service).
  2. We will replay the prompt to the LLM and save the output.
  3. We will send the output to your reward function.
  4. Your reward function checks if the action was correct (for example like so):
def reward_function(completion): if "<action>sit</action>" in completion.completion: return 1 else: return 0
  1. We will use the output of the reward function to update the LLM.

Writing good reward functions

Build Stable Reward Functions

The way our trainig works, it’s important that the reward function is stable. I.e. the reward for the same action should always be the same (or at least in reasonable bounds).

This is important because the reward function is used to update the model. If the reward function is not stable, the model will not learn.

Align Rewards Directly with Task Objectives

Define the reward to reflect the exact success criteria of your domain task. The closer the reward is to “what you actually care about,” the better.

“The key is having a well-designed reward closely aligned with the goal; otherwise the model will not learn the desired behaviors.”3

For example:

  • For math problems: Reward the model only when it produces the correct answer
  • For coding tasks: Give positive reward for code that passes all unit tests

Avoid indirect metrics that might correlate with success but aren’t guaranteed. For example, don’t reward the number of equations written thinking it’ll encourage reasoning – the model might then write many irrelevant equations.

In practice, design your reward function to be as simple as possible and directly reflect the success criteria of the task.

Here’s an example for a C code compilation task:

import os def reward_function(completion): code = completion.completion with open("code.c", "w") as f: f.write(code) try: os.system("gcc code.c -o a.out && ./a.out") return 1 except: return 0 finally: os.remove("code.c")

Or if you want to differentiate between successful compilation and execution:

def reward_function(completion): code = completion.completion with open("code.c", "w") as f: # Write the code to a file f.write(code) reward = 0 try: if system(f"gcc code.c -o a.out") == 0: # Compilation successful reward = 1 if system(f"./a.out") == 0: # Execution successful reward = 2 else: reward = 0 except: pass finally: os.remove("code.c") return reward

Leverage Verifiable Ground-Truth Signals

Whenever possible, use verifiable rewards in your system - these are clear-cut checks that confirm whether the model succeeded.

In their simplest form, these rewards are binary:

  • 1 for correct outputs
  • 0 for incorrect ones

This approach has proven powerful in recent AI advances like DeepSeek-R1 and Tülu-3.

Why are verifiable rewards so effective?

  1. Perfect alignment with ground truth: The model learns to produce the known correct outcome.

  2. Easy design: Create these without machine learning expertise using simple pattern matching or output verification.

  3. Resistance to reward hacking4: It’s nearly impossible to game a hard verification check - an answer is either correct or it isn’t.

For example, with math word problems, you can have your model produce its final answer after a specific token (like ”####”), then automatically compare that to the known solution.

Pro tip: Implement verifiable checks whenever your domain allows it - they work particularly well for math, coding, or any task with clearly defined correct outcomes.

Provide Shaping or Partial Rewards to Guide Learning

While the final answer should earn the biggest reward, smaller bonuses for progress can speed up learning.5

This approach, called “reward shaping,” acknowledges steps in the journey, not just the destination.

For example:

  • Instead of only rewarding full math problem solutions, give smaller rewards for correct intermediate steps
  • For programming, reward passing individual test cases before the entire suite

DeepSeek-R1 used this strategy to tackle complex reasoning problems that require multiple steps.

Benefits of this approach:

  • Helps the model “think in stages”
  • Solves the “credit assignment problem” by clarifying which parts of the work are good
  • Prevents discouragement from total failure

Even simple format compliance can be worth rewarding. DeepSeekMath gave a small 0.1 reward just for providing an answer in the expected format.6

Pro tip: Use small interim rewards to encourage exploration, but always save the biggest rewards for complete success to prevent the model from settling for “good enough.”

Summary

Writing effective reward functions requires four key principles:

  1. Build stable functions that consistently reward the same behaviors
  2. Align rewards directly with task objectives rather than proxy metrics
  3. Use verifiable ground-truth signals for clear success/failure feedback
  4. Consider partial rewards to guide the learning process

The best reward functions are simple, directly tied to success criteria, and provide unambiguous signals about correct behavior.

Further reading

Footnotes

Footnotes

  1. In machine learning research, the term “policy” is used to refer to the model.

  2. In reinforcement learning, the term “punishment” is used to refer to the opposite of a reward. Please don’t punish your dog for not sitting.

  3. DeepSeek Explained 6

  4. Reward hacking refers to when a models finds ways to maximize its reward signal without actually achieving the intended goal. It’s like a student who finds loopholes in grading criteria rather than truly learning the material. For example, a model might learn to generate answers that superficially match expected patterns without solving the underlying problem. With verifiable rewards based on hard checks (like exact answer matching), this becomes much harder since the model can’t “trick” the verification process - it must actually produce the correct result.

  5. Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications

  6. “Reinforcement Learning from Verifiable Rewards”, Label Studio, discussing how verifiable rewards are simple checks that determine whether an output meets a predefined correctness criterion.

Last updated on