How Braintrust uses AI agents, evals, and CI to ship better software

Playback speed

Share post at current time

Share from 0:00

0:00

How Braintrust uses AI agents, evals, and CI to ship better software | Ankur Goyal

🎙️ “Evals are the modern version of a PRD”: Ankur Goyal (Braintrust) on encoding taste into systems that scale

Claire Vo

Jun 15, 2026

In this episode, I sit down with Ankur Goyal, founder and CEO of Braintrust, the AI evals and observability platform used by teams like Notion, Stripe, Vercel, and Zapier. This one is for the senior engineers, staff engineers, VPs of engineering, and CTOs in my audience. We get into how coding agents can take on deeply technical architecture and infrastructure work that no single human engineer could tackle before, and then we demystify evals so you can use them to make your AI products better without touching the implementation.

Listen or watch on YouTube, Spotify, or Apple Podcasts

What you’ll learn:

How Ankur uses Codex to run week-long benchmark experiments across database indexes, column store formats, and execution engines to speed up slow queries
Why he argues there’s no excuse to skip rigorous benchmarking now that agents can run them tirelessly
The “agent line” framework: how to decide which decisions, directions, and interactions you can hand off to an agent
How I think about the practical vs. theoretical quality of AI on hard technical problems, and why human attention decays on tedious work
Why evals are the modern version of a PRD, and how to encode “what good looks like” so a model can figure out the “how”
How to build a scoring function live and let an agent improve your prompt inside a safe playground
How Ankur turned his designer David’s taste into a repeatable eval so quality scales beyond one person
Why fixing your CI is the highest-leverage way to speed up engineering velocity