Quick thoughts on evaluating agents

I recently encountered some companies which are in the due diligence and report generation space. They all use a multitude of AI agents to plan and do the research and this made me wonder how these agent stacks are evaluated.

Primarily we can simplify AI agents to be nothing but LLMs + tool use + some logic flows which might be controlled by an LLM. While evaluating these agents, we are primarily concerned about:

Do they say the right thing? - What does it mean?

Are they acting the right way? - What does that mean?

Real life evaluation considerations:

·