SOOHAK + Use-Case Filter: When is AI mature?

How mature is an AI model for open knowledge work? A good test: How often does it recognize that a task simply has no solution? Not a single tested frontier model gets past 50%.

SOOHAK had 64 mathematicians build 99 deliberately unsolvable tasks. Qwen3 refuses in less than 3% of cases, Gemini 3 Pro holds the top score with just under half. Terence Tao separately tested frontier models on open Erdős problems and saw success rates of 1 to 2%. Caveat: the study tested Opus 4.5, not 4.7. The latest generation would score better, but the underlying pattern doesn't shift.

I've made this point before, but the latest study confirms the necessary realistic assessment once again. Where the result can be quickly verified (code compiles and can be tested, table is correct, formula calculates right), AI works, usually significantly better than a human in the same timeframe. There, AI is a real superpower. Where the result is open (market analysis, risk assessment, strategic recommendation, project management), I still find substantial flaws in the overwhelming majority of AI answers.

For open knowledge work, my honest assessment today is that current AI models are simply not yet at a level where they produce truly reliable results. Anyone wanting to use AI productively in such areas has to plan in human validation as a mandatory step, or shelve the use case altogether — because AI's confidently formulated wrong answers can cause risky errors downstream in processes.

We are not where some headlines suggest. We are in a phase where repetitive tasks with clear result verification are the real, very large potential, while many other areas either don't work at all or lead to completely wrong results. Anyone expecting more and rolling out AI too quickly in such areas can produce very expensive mistakes.

Background: https://the-decoder.com/new-math-benchmark-reveals-ai-models-confidently-solve-problems-that-have-no-solution/