Agentic workflows are only as good as their weakest component + error analysis

When an agentic workflow fails, we need to look at the intermediate outputs.

In agentic systems, a trace is the complete set of outputs from all intermediate steps, while a span is the output from a single step. By examining traces on failure cases (not successes), you can see which component is underperforming. The methodology is straightforward: compare each span to what a human expert would produce, build a spreadsheet counting errors per component, and focus your improvement efforts where the data tells you.

This matters because agentic workflows follow the same principle as Theory of Constraints - the system is only as good as its weakest component. If your research agent fails because web search returns poor results 45% of the time while search term generation only fails 5% of the time, optimizing the search term generator is pointless. Look at the traces, count the errors, fix the bottleneck.

NOTE: instead of evaluating the workflow e2e (which can get expensive), it's also possible to create an eval for a single component, and focus on measuring its performance by changing different variables (prompt, tools, etc...)

See also: