The current generation of reasoning models is genuinely impressive. They plan, they backtrack, they catch some of their own mistakes mid-stream. It is easy to read a clean trace and conclude the era of model errors is over.
It is not. The reliability gap on hard problems is smaller than it used to be, but the gap that remains is now subtler — the model is more confident, the reasoning looks plausible, and the wrong answer takes longer to spot. In some ways that is worse than the old failure mode.
Which is why verification is the part of the stack that is quietly maturing fastest. Tool-grounded checks, self-consistency across multiple samples, dedicated verifier models, structured output that can be parsed and validated, and human review on the path that matters. None of these are flashy. All of them are the difference between a demo and a system you can put in front of a customer.
The practical rule I use: never let a reasoning model be both the generator and the only judge of its output on anything that costs real money or real trust. Some other process — code, tool, model, person — needs to be able to say no.