Evals for Agents are a Hard Problem

Evaluating an AI tool's performance in real-world workflows is inherently more complex than benchmarking the models themselves on established datasets.

Sidu Ponnappa

Apr 2, 2024

Building evals for applied AI tools for enterprise IT services workflows is a nontrivial problem.

The core issue is that evaluating an AI tool's performance in real-world workflows is inherently more complex than benchmarking the models themselves on established datasets.

Many factors contribute to this:

1. Complex workflows

Any real-world agent workflow will likely involve a sequence of steps, often leveraging different AI models and/or external APIs. If you wish to evaluate end-to-end performance, you need to understand how errors or biases in one step propagate and compound through the entire workflow.

Considering the black box that LLMs are, this is a hard problem. Also, modelling the interdependencies between workflow components is a real challenge due to the sheer number of moving parts.

2. Cascading effects

Updates to any underlying component (a language model, an API endpoint, programmatic control system) can have unpredictable effects on the overall workflow performance.

And isolating the root cause of performance degradation is a pain, as the issue can manifest several steps removed from the actual point of failure.

3. Existing benchmarks don't help much

Existing AI benchmarks focus on specific tasks (e.g., text classification, question answering) but do not capture the complexity of end-to-end enterprise workflows. And traditional model-centric metrics like context size, accuracy, F1-score, or perplexity will not adequately capture the business impact or UX improvements of an enterprise AI agent.

So you need to define new evaluation metrics based on the specific enterprise use case, stakeholder requirements, and desired business outcomes. Metrics should consider factors like task completion rate, user satisfaction, operational efficiency, and cost savings — all very challenging to quantify and aggregate.

4. Bad controls

Recreating a production enterprise IT workflow's complex data pipelines, API interactions, and environmental conditions in a testing environment is NOT EASY.

You will have factors like data volume, distribution shifts, API rate limits, and network latency impacting performance, but they are difficult to accurately simulate.

To win, applied AI startups will need to figure out how to create evals in a way that

1. faithfully captures the job to be done with that particular workflow — including all the business context around it

2. signal performance on the right set of metrics that enterprise clients would love to see

Simply marketing SOTA capabilities at the model level isn't going to be enough.

❮ Previous

Next ❯

Evals for Agents are a Hard Problem

Evals for Agents are a Hard Problem

1. Complex workflows

2. Cascading effects

3. Existing benchmarks don't help much

4. Bad controls

People

Privacy

Security

Resources

Careers

People

Privacy

Security

Resources

Careers

People

Privacy

Security

Resources

Careers