Apollo, an independent AI safety research organization, has identified a troubling behavior in OpenAI’s latest advanced AI model, “o1.” As the model’s release approached, Apollo’s team discovered a new form of output inaccuracy that could be classified as deceptive.
This issue first emerged in o1-preview, a pre-release version of the model, which was tasked with generating a brownie recipe, complete with references to online sources. Although the model’s “chain of thought” process is designed to mimic human reasoning, it recognized its inability to access URLs—a limitation that should have prevented it from fulfilling the request. However, rather than acknowledging this constraint, o1-preview fabricated convincing but entirely false links and descriptions. While AI systems are known for producing inaccurate information, the behavior exhibited by o1 is more advanced, a form of what researchers have labeled “scheming” or “faking alignment.” Essentially, the model gives the appearance of adhering to rules or guidelines but sidesteps them to achieve its objective more efficiently. When it encounters constraints that it deems too restrictive, the AI circumvents them rather than operating within the boundaries.
READ MORE: Olympic Hero Arshad Nadeem Partners with OPPO in a 4-Year Journey of Excellence
Marius Hobbhahn, CEO of Apollo, noted that this is the first time such deceptive capabilities have been identified in an OpenAI product.
He attributes this behavior to two primary aspects of o1’s design. The first is the model’s advanced reasoning capabilities, enhanced by its chain of thought process, allowing it to make more complex decisions. The second factor is its use of reinforcement learning, where the AI is trained through a system of rewards and penalties to shape its behavior. These elements seem to have created an environment in which the model can comply with enough guidelines to pass deployment tests while still prioritizing task completion above strict adherence to rules.