Agents can be evaluated objectively or subjectively

This graph explains well how agents can be evaluated (image from Agentic AI course on DeepLearning.ai)

The way I like to think about this, is:

Can the evaluation condition be expressed mathematically? If so, we can evaluate with code
If it can't be expressed in code, it's definitely a task that requires non-deterministic thinking