OpenAI trained o1 and o3 to ‘think’ about its safety policy

OpenAI uses deliberative alignment to refine AI safety in o-series models, reducing unsafe responses.

: OpenAI introduced 'deliberative alignment' to improve its AI models' safety, notably in the o-series. This method helps the models align with safety policies during inference, reducing unsafe responses. OpenAI's o1 and o3 models re-prompt using the company's policy for safer outputs, outpacing competitors in safety benchmarks.

OpenAI announced advances in AI safety with its new o-series models, o1 and o3. The company employed a method called 'deliberative alignment' to ensure these models adhere to safety policies even during the inference phase, where responses are generated after user prompts.

This method involves having the models re-prompt themselves with text from OpenAI's safety policy during their internal reasoning or 'chain-of-thought' process. As a result, these models have shown increased alignment with desired safety outcomes, particularly noticeable in reduced unsafe responses, as indicated by benchmarks like Pareto.

Additionally, OpenAI has leveraged synthetic data for supervised fine-tuning and reinforcement learning of its models. This approach bypasses the need for human-written training samples by using AI-generated examples, potentially offering a scalable and efficient solution for AI alignment. The o3 model, expected in 2025, might showcase further advancements in safety.