Tag Archives: AI benchmarks

AI companies want you to stop chatting with bots and start managing them

Despite the hype about these agents being co-workers, from our experience, these agents tend to work best if you think of them as tools that amplify existing skills, not as the autonomous co-workers the marketing language implies. They can produce impressive drafts fast but still require constant human course-correction. The Frontier launch came just three… Read More »

OpenAI releases GPT-5.2 after “code red” Google threat alert

In attempting to keep up with (or ahead of) the competition, model releases proceed at a steady clip: GPT-5.2 represents OpenAI’s third major model release since August. GPT-5 launched that month with a new routing system that toggles between instant-response and simulated reasoning modes, though users complained about responses that felt cold and clinical. November’s… Read More »

Anthropic’s Claude Haiku 4.5 matches May’s frontier model at fraction of cost

And speaking of cost, Haiku 4.5 is included for subscribers of the Claude web and app plans. Through the API (for developers), the small model is priced at $1 per million input tokens and $5 per million output tokens. That compares to Sonnet 4.5 at $3 per million input and $15 per million output tokens,… Read More »

Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks

Claude 4.5 is available everywhere today. Through the API, the model maintains the same pricing as Claude Sonnet 4, at $3 per million input tokens and $15 per million output tokens. Developers can access it through the Claude API using “claude-sonnet-4-5” as the model identifier. Other new features Some ancillary features of the Claude family… Read More »

OpenAI jumps gun on International Math Olympiad gold medal announcement

The early announcement has prompted Google DeepMind, which had prepared its own IMO results for the agreed-upon date, to move up its own IMO-related announcement to later today. Harmonic plans to share its results as originally scheduled on July 28. In response to the controversy, OpenAI research scientist Noam Brown posted on X, “We weren’t… Read More »

ChatGPT’s new AI agent can browse the web and create PowerPoint slideshows

On Thursday, OpenAI launched ChatGPT Agent, a new feature that lets the company’s AI assistant complete multi-step tasks by controlling its own web browser. The update merges capabilities from OpenAI’s earlier Operator tool and the Deep Research feature, allowing ChatGPT to navigate websites, run code, and create documents while users maintain control over the process.… Read More »

Musk’s Grok 4 launches one day after chatbot generated Hitler praise on X

Musk has also apparently used the Grok chatbots as an automated extension of his trolling habits, showing examples of Grok 3 producing “based” opinions that criticized the media in February. In May, Grok on X began repeatedly generating outputs about white genocide in South Africa, and most recently, we’ve seen the Grok Nazi output debacle.… Read More »

With the launch of o3-pro, let’s talk about what AI “reasoning” actually does

Why use o3-pro? Unlike general-purpose models like GPT-4o that prioritize speed, broad knowledge, and making users feel good about themselves, o3-pro uses a chain-of-thought simulated reasoning process to devote more output tokens toward working through complex problems, making it generally better for technical challenges that require deeper analysis. But it’s still not perfect. An OpenAI’s… Read More »

CMU research shows compression alone may unlock AI puzzle-solving abilities

This new research matters because it challenges the prevailing wisdom in AI development, which typically relies on massive pre-training datasets and computationally expensive models. While leading AI companies push toward ever-larger models trained on more extensive datasets, CompressARC suggests intelligence emerging from a fundamentally different principle. “CompressARC’s intelligence emerges not from pretraining, vast datasets, exhaustive… Read More »

New secret math benchmark stumps AI models and PhDs alike

Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely challenging,” Tao said in feedback provided to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a… Read More »