For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
Not related to this release, but is anyone aware of what's happening with Deepseek? The usual cascade of synced releases has been lacking this frontier lab whale for a while now.
> Not related to this release, but is anyone aware of what's happening with Deepseek?
Given that no-one is talking about DeepSeek, I assume it is coming this month.
They are still releasing research papers and that is what really matters and not the .1 increment releases of AI models to massage benchmarks or create hype around.
For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
Given that no-one is talking about DeepSeek, I assume it is coming this month.
They are still releasing research papers and that is what really matters and not the .1 increment releases of AI models to massage benchmarks or create hype around.
They're either stuck/dead or they're sitting on something really fantastic that they only want to release once they've perfected it.
My realistic side thinks the former, my optimism on the latter.
In the meantime, GLM 5.1 is actually really good.
https://deepinfra.com/zai-org/GLM-5.1
Are you seeing meaningful improvements in reasoning reliability, or mostly incremental quality changes compared to previous releases?