The overall training process is a 3-step feedback cycle between the human, the agent’s understanding of the goal, and the RL training. Our AI agent starts by acting randomly in the environment. Periodically, two video clips of its behavior are given to a human, and the human decides which of the two clips is closest to fulfilling its goal—in this case, a backflip. The AI gradually builds a model o
![Learning from human preferences](https://cdn-ak-scissors.b.st-hatena.com/image/square/d71211d09a37af7c757d16b1222f6a2ce81edc26/height=288;version=1;width=512/https%3A%2F%2Fimages.openai.com%2Fblob%2F745ba770-7a51-45b1-ab65-7b6a4398f385%2Flearning-from-human-preferences.png%3Ftrim%3D0%252C0%252C691%252C0%26width%3D1000%26quality%3D80)