The Cost Wall of GPT
Over the past week, a new focus has emerged in the discussion about GPT on X: not capability, but cost.
ARC-AGI: The Boundary of Intelligence
The performance of the most cutting-edge models on ARC-AGI-2:
| Model | ARC-AGI-2 Score |
|---|---|
| GPT-5.2 Pro | ~54% |
| GPT-5.2 Refine | ~73% |
| Human | 100% |
The gap between 54% and 73% is not an intelligence issue, but "refinement"—allowing the model to repeatedly check its own answers. This requires more computation, meaning higher costs.
The Real Cost of Agents
Annual cost of 24/7 enterprise-level agents (20 million input + 20 million output tokens per day):
| Model | Annual Cost |
|---|---|
| Palmyra X5 | ~$48K |
| GPT-5.2 Standard | ~$57K |
| Gemini 2.5 Pro | ~$82K |
| Claude Sonnet 4.5 | ~$131K |
| Claude Opus 4.6 | ~$219K |
| GPT-5.2 Pro | ~$690K |
GPT-5.2 Pro is 12 times more expensive than GPT-5.2 Standard. This is not a pricing strategy issue, but a cost structure issue.
"Before you deploy 100 AI agents, run the math." — @waseem_s
The New Turing Test
A simple question is becoming the new intelligence test:
"The car wash is 40 meters from my house. I want to wash my car. Should I walk or drive?"
Models that passed: GPT-5.2 Thinking, Opus 4.6, Gemini 3 Pro Models that failed: GPT-5.2 Instant, GPT-4o, Haiku 4.5, Sonnet 4.5
Why is this test meaningful? Because it tests "common sense reasoning" rather than "knowledge retrieval." 40 meters is walking distance. The car is dirty and needs washing. But you wouldn't drive a dirty car 40 meters to wash it—unless you lack common sense.
History Doesn't Repeat, But It Rhymes
"Expert systems were born in the 1970s, flourished in the 1980s, and were widely regarded as the future of AI." — @ChombaBupe
GPT models were born in 2018, flourished in the 2020s, and are widely regarded as the future of AI.
The failure of expert systems was not because they weren't smart enough, but because the maintenance cost was too high and the scalability was too poor. When the knowledge base requires manual maintenance, scale is the enemy.
GPT faces a mirrored problem: the models are smart, but the cost of reasoning is too high. When every request requires a lot of computation, scale is also the enemy.
Next Steps
Multiple new models are expected to be released this week: Gemini 3.1 Pro, Claude Sonnet 5, GPT-5.3, DeepSeek V4, Qwen 3.5.
The competition is shifting from "who is smarter" to "who is cheaper." This is good news for users. For OpenAI? Not necessarily.




