Agent demos make for great videos because they show the happy path. The query is well-formed, the tools respond, the model strings together the right calls, the user smiles. Production is different. Production is mostly the failure path.
The interesting question is not "can the agent do the task" but "what happens when it cannot." A robust agentic workflow has explicit answers to: what does an error look like to the model, what state does it need to preserve to retry, when does it give up, and how does it tell a human something is wrong.
A pattern I keep coming back to is treating every tool call as a transaction with three outcomes — success, recoverable failure, terminal failure — and making the model see all three as first-class. Hiding errors behind generic exceptions teaches the model that failure is invisible, and the agent will happily march off a cliff.
The other half is bounded recovery. Without a step budget, retry budget, and an escape hatch, agents drift into expensive loops trying to fix the unfixable. The harness is what enforces those bounds — the model on its own will not.