Sidu Ponnappa
Apr 2, 2024
Building evals for applied AI tools for enterprise IT services workflows is a nontrivial problem.
The core issue is that evaluating an AI tool's performance in real-world workflows is inherently more complex than benchmarking the models themselves on established datasets.
Many factors contribute to this:
1. Complex workflows
Any real-world agent workflow will likely involve a sequence of steps, often leveraging different AI models and/or external APIs. If you wish to evaluate end-to-end performance, you need to understand how errors or biases in one step propagate and compound through the entire workflow.
Considering the black box that LLMs are, this is a hard problem. Also, modelling the interdependencies between workflow components is a real challenge due to the sheer number of moving parts.
2. Cascading effects
Updates to any underlying component (a language model, an API endpoint, programmatic control system) can have unpredictable effects on the overall workflow performance.
And isolating the root cause of performance degradation is a pain, as the issue can manifest several steps removed from the actual point of failure.
3. Existing benchmarks don't help much
Existing AI benchmarks focus on specific tasks (e.g., text classification, question answering) but do not capture the complexity of end-to-end enterprise workflows. And traditional model-centric metrics like context size, accuracy, F1-score, or perplexity will not adequately capture the business impact or UX improvements of an enterprise AI agent.
So you need to define new evaluation metrics based on the specific enterprise use case, stakeholder requirements, and desired business outcomes. Metrics should consider factors like task completion rate, user satisfaction, operational efficiency, and cost savings — all very challenging to quantify and aggregate.
4. Bad controls
Recreating a production enterprise IT workflow's complex data pipelines, API interactions, and environmental conditions in a testing environment is NOT EASY.
You will have factors like data volume, distribution shifts, API rate limits, and network latency impacting performance, but they are difficult to accurately simulate.
To win, applied AI startups will need to figure out how to create evals in a way that
1. faithfully captures the job to be done with that particular workflow — including all the business context around it
2. signal performance on the right set of metrics that enterprise clients would love to see
Simply marketing SOTA capabilities at the model level isn't going to be enough.