Topic of the Month
AI's Dawn of Reason
OpenAI pulled the veil back on its latest AI model this month. It had long been rumored the company was working on a secret initiative, first called Q* and then Project Strawberry internally, to improve AI's reasoning abilities. The new o1 release—which scraps the GPT naming scheme in a product reset—is said to deliver on that promise.
In a blog post, the company wrote that o1 makes strides in areas heavy in reasoning where previous models, including its own GPT-4, have struggled. This includes marked improvement on benchmarks—these are often human exams given to AI—measuring o1's ability to answer questions in math, science, and coding, some at a PhD level.
OpenAI achieved its breakthrough by combining reinforcement learning—an AI approach that's yielded impressive results in game-playing—and chain-of-thought reasoning. The latter chops difficult problems into smaller, more manageable steps and follows them through to a solution. "Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses," OpenAI wrote in its blog post. "It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working."
The model ranked in the top 11 percent in competitive coding, scored well enough to qualify for the Math Olympiad, a math competition for high school students, and exceeded human PhDs in a benchmark measuring knowledge in advanced physics, biology, and chemistry. It also significantly outperformed GPT-4o in these areas. But notably, OpenAI wrote, it may not match its predecessor in tasks that are more strictly limited to language.
Exactly how much such benchmarks can tell us about AI's abilities, beyond showing how models compare to each other and prior generations of AI, is hotly debated. Critics say they fall short in some areas, like the quality of the test itself or whether exact or similar questions, answers, and knowledge exists online and therefore in each model's training data. Further, if all models perform similarly on existing benchmarks, we'll need new ones. Fortunately, there are already efforts afoot to make harder, more illustrative AI tests.
Still, the ability to perform multi-step reasoning has long been a goal in the industry, and o1 appears to be progress. Google DeepMind is also going after AI that can reason. DeepMind's AlphaGeometry mashed together a large language model and a symbolic model—a more traditional, hard-coded approach—to match top high schoolers at geometry. DeepMind's CEO, Demis Hassabis, has also said they're looking to use reinforcement learning, which is their "bread and butter," to improve future models.
Crucially, o1 proves AI can progress without resorting to scaling, in which developers improve models by making them bigger. That said, this month also showed scaling will continue, as players moved to secure cash and energy.
.... deleted
No comments:
Post a Comment