The Power of Reinforcement Learning with Human Feedback

Human with Artificial General Intelligence AGI

ChatGPT set the record for the fastest-growing consumer application in history by reaching 100 million monthly active users just two months after its launch. In a recent interview with Lex Fridman, OpenAI’s CEO Sam Altman explained the magic behind ChatGPT’s popularity. While the ChatGPT model was trained from a massive amount of data, he argued that it wasn’t the underlying model that mattered, but rather, its usability is what fueled its rapid adoption.

He attributed that usability to its chat interface and to a technique they used to improve the model after it was trained called Reinforcement Learning with Human Feedback (RLHF).

Creating an Engaging Experience

Putting the model behind an interface that enables humans to have a dialogue with the model is what makes the model engaging to users. It creates an interactive and intuitive experience that allows users to feel like they are conversing with a human-like assistant, making it easier for them to obtain desired information and assistance. Even though each interaction with ChatGPT is stateless, meaning it does not actually remember anything between questions, the application in front of the model maintains context for each conversation thread, allowing ChatGPT to be conversational and answer follow-up questions, admit to and correct previous mistakes.

The Impact of Reinforcement Learning with Human Feedback

But perhaps the most critical aspect of its success is the application of Reinforcement Learning with Human Feedback (RLHF). This process combines reinforcement learning and human input to improve the performance and alignment of AI models. In this approach, a base model is first trained on a vast dataset, after which human feedback is collected to fine-tune the model’s responses. The feedback often involves comparing multiple model outputs and selecting the one that is more helpful or accurate, according to human judgment. This feedback is then used to adjust the model through reinforcement learning, ultimately leading to a more useful, user-friendly, and aligned AI system that better caters to human needs and preferences.

Without this process, the model will still perform well on evaluations, such as passing tests, since the knowledge is present. However, it may not be easy for humans to use. The process of adding human guidance to a giant language model significantly improves its usability and ease of use. This alignment with human expectations creates the impression that the AI understands users’ questions and genuinely attempts to help. Surprisingly, this level of alignment can be achieved with relatively little data and human supervision.

RLHF is Critical for Artificial General Intelligence

But the significance and impact of RLHF cannot be understated, especially as we march towards Artificial General Intelligence (AGI). AGI is the ultimate form of AI with the ability to understand, learn, and apply knowledge at a level equal to or beyond human capabilities. ChatGPT, like all other current AI systems today, is considered narrow or specialized AI, as it can only perform specific tasks, while AGI will adapt to new situations, solve unfamiliar problems, and transfer knowledge from one domain to another.

Applying RLHF to align AGI with human values is crucial for ensuring that AGI serves humanity’s best interests and operates ethically. RLHF will allow AGI to learn from human feedback, enabling it to better understand human preferences, ethical concerns, and context-specific nuances. Incorporating human values into AGI’s decision-making processes will enhance the likelihood that AGI will act in a manner beneficial and safe for humanity.

The Importance of Aligning AGI with Human Values

If AGI is not aligned with human values, there could be severe negative consequences. An AGI system that operates without regard for human values might prioritize objectives that are misaligned with human interests, potentially leading to unintended harmful actions. It could also exacerbate existing societal issues such as bias, inequality, and misinformation. Moreover, a misaligned AGI system could pose an existential risk if it were to act on goals that directly or indirectly threaten human civilization.

Aligning AGI with human values is a critical step in ensuring the safe and responsible development of AGI systems, and RLHF is one of the most promising approaches we have today to perform this alignment.