"LLM as a judge" grading with a rubric gives more consistent results when evaluating Reflection (Agentic pattern)

Let's say we have an LLM producing code for a function in Javascript. If we want to evaluate its readability, we could just ask the LLM itself to do it (zero-shot prompting), but we know that Reflection (Agentic Pattern) consistently outperforms direct generation on a variety of tasks.

It'd be better to add a reflection step. But how can we evaluate the output of an LLM working on "subjective tasks" like writing readable code?

"LLM as a judge" can help here, by providing a rubric that the LLM can use in the reflection prompt to judge the output.

In contrast, for "non-subjective tasks" we can use "objective evals" by writing a script that can track the LLM performance by building a dataset of ground truth examples.

"LLM as a judge" can also use external feedback to improve output even more. See Reflection (Agentic Pattern) can use external feedback from tools to improve output