Key Takeaways

The most reliable evaluation framework uses four auditable dimensions: velocity progression, dependency trajectory, quality under acceleration, and portfolio breadth.
All four should be git-verifiable, not self-reported.
Velocity progression is the clearest signal.

Measure output, not credentials. The most reliable evaluation framework uses four auditable dimensions: velocity progression, dependency trajectory, quality under acceleration, and portfolio breadth. All four should be git-verifiable, not self-reported.

Velocity progression is the clearest signal. An operator whose capability is genuinely expanding will show compressing build times across successive projects. In the validated portfolio, early projects took 23 to 43 days to reach functional product; late projects took 4 to 9 days. That compression is not random — it is the signature of a compounding system where each build generates reusable infrastructure for the next (CS06). The metric to track: days to functional product, measured from first commit to deployment, plotted across the portfolio chronologically. A capable operator's curve bends downward.

Dependency trajectory reveals whether the operator is actually building capability or just coordinating it. Track the ratio of operator-authored commits versus external commits over time. In the validated case, operator direct contribution rose from 30% in October 2025 to 93% in January 2026, while external contractor spend fell from $6,486/month to $0 (CS06, CS09). The McKinsey Global Institute's 2023 research on AI in software development notes that the highest-impact AI adopters are those who internalize capability rather than outsourcing it to AI service providers. A capable operator's dependency curve moves in one direction: toward zero.

Quality under acceleration separates real capability from velocity theater. It is easy to ship fast if you do not care about defects. The relevant benchmark: the portfolio maintained a 12.1% product defect rate — half to one-fifth of the industry norm of 20% to 50% (Capers Jones, "Applied Software Measurement," 2008) — while output velocity increased 4.6x over the build period. Speed and quality moved together. If an operator's velocity increases but defect rates spike, that is not capability growth — it is technical debt accumulation.

Portfolio breadth tests whether the capability is narrow or generalizable. Shipping one product fast could be pattern-matching on a single domain. Shipping across 7 verticals — insurance, e-commerce, legal services, consumer reporting, and an internal operations platform — demonstrates transferable architectural judgment (CS06). For PE evaluators, breadth matters because it predicts the operator's ability to execute across a portfolio of holdings, not just one.

The evaluation protocol: request git repository access, run the velocity and dependency metrics yourself, cross-reference contractor invoices against the timeline, and check defect rates against industry benchmarks. If the operator cannot produce this data, the capability claim is unverifiable. If they can, the numbers will tell you exactly where on the progression curve they sit.

References

McKinsey Global Institute (2023). "The Economic Potential of Generative AI." Research on AI adoption patterns and capability internalization versus outsourcing.
Jones, C. (2008). Applied Software Measurement. Industry defect rate benchmarks and software quality measurement frameworks.
Keating, M.G. (2026). "Case Study: The Full Portfolio." Stealth Labz. Read case study
Keating, M.G. (2026). "Case Study: Zero to Builder." Stealth Labz. Read case study

How Do You Evaluate the Capability of an AI-Enabled Operator?

References