Reward Function
As you now have setup and connected your agent to Augento, we can start with the Reinforcement Learning preparation.
For this we will need a reward function, that, during training, will grade how good or bad the models outputs were.
For example to teach a model how to write correct programming language code, the reward function could simply return a binary based on if the code output of the model compiles.
from pl import compiler
async def reward(completion: Completion) -> float:
try:
compiler.compile(completion)
return 1
except:
return 0
Setup A Reward Function Server
On the Augento Plaform, a reward function is simply a REST endpoint with a POST
route, taking the completion of the model as input and returning a scalar reward as output.
It is your choice to host it on your own machines (preferable when the reward function needs to access your environment) or on a compute platform (we recommend fly.io )
To get you started quickly, we provide templates in python
and typescript
:
Reward Function Endpoint
The interface of the POST
route on the reward function server has to adhere to the following specification.
Request body
prompt_messages | Previous conversation, in the exact same format as specified by the OpenAI API |
completion | The completion, outputted by the model during training, expecting to get a reward by the reward function server |
extra_data | Additional data that adds context to the verification function |
{
"prompt_messages": [
{
"role": "developer",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"completion": "Hello! I am a useful language model",
"extra_data": {
"prompt_id": "example-001",
"notes": "Sample request for grading"
}
}
Returns
reward | A scalar reward that grades the completion of the model |
{
"reward": "0.5",
}