OpenAI trained o1 and o3 to ‘think’ about its safety policy
OpenAI uses deliberative alignment to refine AI safety in o-series models, reducing unsafe responses.
OpenAI announced advances in AI safety with its new o-series models, o1 and o3. The company employed a method called 'deliberative alignment' to ensure these models adhere to safety policies even during the inference phase, where responses are generated after user prompts.
This method involves having the models re-prompt themselves with text from OpenAI's safety policy during their internal reasoning or 'chain-of-thought' process. As a result, these models have shown increased alignment with desired safety outcomes, particularly noticeable in reduced unsafe responses, as indicated by benchmarks like Pareto.
Additionally, OpenAI has leveraged synthetic data for supervised fine-tuning and reinforcement learning of its models. This approach bypasses the need for human-written training samples by using AI-generated examples, potentially offering a scalable and efficient solution for AI alignment. The o3 model, expected in 2025, might showcase further advancements in safety.