Here's how to evaluate AI Agents for Enterprise use-cases

Here's how to evaluate AI Agents for Enterprise use-cases

A problem evaluation framework we've developed over the past year for both GenAI founders and investors.

A problem evaluation framework we've developed over the past year for both GenAI founders and investors.

Sidu Ponnappa

Sep 20, 2024

Recently, I was asked to review an AI Agent company for potential investment. This led me to articulate a problem evaluation framework we've developed over the past year. I hope this proves useful for both founders of agentic startups and investors in these companies. As always, your mileage may vary. Caveat emptor.

The Core Framework

Good first-degree problem evaluation criteria tend to be simple. Here's what we've landed on for assessing problems as candidates for commercial AI Agents:

  1. Cycle time on agent capability development

  2. Objective, trusted verification of commercial viability of agent output

  3. Acceptable consequences to real-world verification of output

Let's break these down:

1. Cycle Time on Agent Capability Development

What's the cycle time per generate/verify/accept execution loop for the agent on real work - that is, actual commercially useful work in a real commercial context?

Ideally, you want this to be in seconds to minutes, not hours, weeks, or months. If you've picked a problem in the latter category, you should be raising tens or hundreds of millions in capital.

It's self-evident that improvements in capability should be measurable for the loop to be actionable. Everyone knows evals matter, so I won't belabor the point.

A common trap is to 'simulate' the Real Work for quick cycle times, only to hit a wall with actual Real Work because relevant complexity was eliminated in the simulation.

2. Objective Verification of Agent Output

Can a trusted third party objectively verify the agent's output, confirming its commercial viability without uncertainty?

For example, PRD generation requires subjective human evaluation, which slows down the capability loop and creates uncertainty across different reviewers or customers.

In contrast, coding use cases with objective verification by a compiler or static analyzer fit this criteria well.

3. Consequences of Real-World Verification

Since agentic output is non-deterministic, what are the consequences of a verification failure in the real world? Alternatively, is it economically feasible to create a perfect replica of the 'production environment' for verification?

For instance, with DevOps Agents dealing with complex custom prod setups, it's rare for customers to have a perfect production replica. Applying agent-developed changes to prod infra can have catastrophic consequences, often manifesting days or weeks later.

Footnotes

1. Questions like "How do you fine-tune?" or "How do you do RAG?" are secondary. They're relevant only insofar as they help answer the primary questions above.

2. The concept of real work is crucial to this framework. Real Work is "work that is commercially viable, done in a real-world work environment".

3. Problems rating high on all three criteria will likely be red-ocean markets. They're easy to start in but very hard to build a moat.

4. Problems where access to real work is hard will likely have substantial legal complexity to navigate, often exceeding the technical complexity of developing the agent's capabilities.

5. Agentic use cases will manifest non-technical, second-order problems that are crucial to solve in parallel. Don't venture here if you're not ready to dive into areas like M&A strategy, contract law, or IP law.

6. The traditional divide between "business" and "technology" founders blurs in this space. Successful founders need to straddle both realms, dealing with them as a unified whole.

To close, while distribution remains king (a point so well-understood it hardly needs mentioning), these technical and operational considerations are crucial for evaluating the potential of AI Agent companies. The interplay between technical capability, real-world applicability, and business strategy will determine the winners in this field.