Podcastskeyboard_arrow_rightSDS 857: How to Ensure AI Agents Are Accurate and Reliable, with Brooke Hopkins

79 minutes

Machine LearningData ScienceArtificial Intelligence

SDS 857: How to Ensure AI Agents Are Accurate and Reliable, with Brooke Hopkins

Podcast Guest: Brooke Hopkins

Tuesday Jan 28, 2025

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


Brooke Hopkins speaks to Jon Krohn about technology’s new frontiers in AI agents, how these agents will impact society, work and our creative enterprises, and what this might mean for our data-driven future. You will learn how Coval, a simulation and evaluation platform for AI voice and chat agents, helps companies balance precision and scalability while making few concessions on the way.


Thanks to our Sponsors:



Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Brooke 
Brooke Hopkins is the founder of Coval, a simulation, evaluation and monitoring platform for voice and chat agents. Previously, she led the evaluation job infrastructure team at Waymo, building many of the developer tools for launching and running simulations. Seeing conversational AI agents face many of the same problems of scaling non-deterministic autonomous driving systems, she founded Coval to bring self-driving autonomy and reliability to these AI agents, enabling engineering teams to ship with speed and confidence.
 
Overview
Brooke Hopkins speaks to Jon Krohn about technology’s new frontiers in AI agents, how these agents will impact society, work and our creative enterprises, and what this might mean for our data-driven future. You will learn how Coval, a simulation and evaluation platform for AI voice and chat agents, helps companies balance precision and scalability while making few concessions on the way. 

Jon met Brooke at a startup competition, run by the GenAI Collective. Out of several hundred startups, Brooke was one among just ten selected to present their work. Within two minutes, she revealed the purpose behind Coval, a simulation and evaluation platform for AI voice and chat agents. Presently, Coval can simulate a working AI to fit a broader context, such as making driverless cars that operate in one town applicable and functional to regional and even international roads. In this podcast episode, she tells Jon that she plans to expand Coval’s functionality to any autonomous agent beyond voice and chat. 

Brooke is especially excited about Coval’s accessibility. She wants to make it “obvious what to do next,” helping developers to identify and solve the problems in their systems. Brooke says this is essential, as low-code solutions have come to be ubiquitous for business owners without engineering backgrounds. 

Where Coval fits in this growing industry is that it lets its clients figure out the right options for their unique business problem, preferably combining precision with scalability. 

Jon and Brooke also spoke about the future of AI agents. For Brooke, it is essential that Coval can offer reliable agents to its customers that work beyond their demo cases. This means offering multiple custom metrics to manage complex scenarios, such as determining if users are following workflows. 

Finally, Jon asks Brooke for her thoughts about the future of AI agents and how it may impact society. She says that there is a real opportunity for agents being able to benefit our overall well being, and she expects a rapid rate of adoption in the coming years. She ultimately feels positive about humanity’s future, saying that we have only become more creative in the face of advancing technology. 

Listen to the episode to hear Brooke talk about her professional experience at Waymo and with Y Combinator, a detailed view of Coval’s workflows, and the meaning behind the name Coval! 

In this episode you will learn:
  • (07:49) What Coval does and how the platform works 
  • (21:16) Coval’s workflows 
  • (37:40) The future of AI agents 
  • (46:28) The metrics to evaluate performance 
  • (55:08) How close we are to achieving AI agent autonomy

Items mentioned in this podcast:
Jon Krohn: 00:00:05
This is episode number 857 with Brooke Hopkins, founder and CEO of Coval. Today’s episode is brought to you by ODSC, the Open Data Science Conference.

00:00:17
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week we bring you fun and inspiring people and ideas, exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, Jon Krohn. Thanks for joining me today. And now let's make the complex simple.

00:00:50
Welcome back to the SuperDataScience Podcast. Today I'm delighted to be joined by the dynamic AI entrepreneur, Brooke Hopkins. Brooke is founder and CEO of Coval, a Y Combinator-backed San Francisco based startup that provides a simulation and evaluation platform for AI agents. They also recently closed a $3.3 million fundraise that includes heavy hitter venture capital firms like General Catalyst, Mac and Y Combinator. Previously, Brooke was a tech lead and senior software engineer at Waymo, where she worked on simulation and evaluation for Waymo's self-driving cars.

00:01:23
Before that, she was a software engineer at Google. She holds a degree in computer science and mathematics from New York University's Abu Dhabi campus. Despite Brooke's highly technical background, our conversation is largely conceptual and high level, allowing anyone who's interested in developing and deploying agentic AI applications to enjoy today's episode.

00:01:42
In today's episode, Brooke details how simulation and testing best practices inspired by autonomous vehicle development are being applied by her team at Coval to make AI agents useful and trustworthy in the real world. She talks about why voice agents are poised to be the next major platform shift after mobile, creating entirely new ways to interact with technology. She talks about how companies are using creative strategies like background overthinkers to make AI agents more robust. Those overthinkers are AI agents themselves, by the way, and she provides us with a glimpse of what the rise of AI agents means for the future of human work and creativity. Indeed, how agents will transform all of society. All right, you ready for this fascinating episode? Let's go.

00:02:30
Brooke, welcome to the SuperDataScience Podcast. I'm excited to have you here. Where are you calling in from today?

Brooke Hopkins: 00:02:36
Calling in from San Francisco.

Jon Krohn: 00:02:39
You and I met in San Francisco. I'm going to butcher the exact name of this event, but it was an event run by the GenAI Collective, I think is their official name. And it was a startup competition. There were, from what I understand, a huge number of startups, over a hundred, maybe several hundred GenAI startups applied to be part of this GenAI competition. You were one 10 startups selected to present at this GenAI Collective. So it was a cool thing where you had two minutes to demo the product, you weren't allowed to have a slide deck, and not only were you one of the 10 companies invited to do this, which was an extremely high barrier to clear in and of itself, but you won.

Brooke Hopkins: 00:03:24
Yeah, that was a really exciting day. We actually had also launched on Product Hunt that day and this event was co-hosted with Product Hunt and the CEO of Product Hunt was there as well as all the people from the GenAI Collective. And so it was a very exciting day to be. We launched on Product Hunt, we got number one on Product Hunt and then went to this event. The energy was electric. It was really exciting.

Jon Krohn: 00:03:47
It was a really cool event and I was delighted that you took the time to speak to me afterward and that you're interested in being on the SuperDataScience Podcast. Thank you, Brooke.

 00:03:57 Let's dig into why you won that day at the GenAI Collective event with your company Coval. So you previously led the evaluation job infrastructure at Waymo, which is Google's self-driving car project. I'm personally a huge fan of Waymo. I love being in them. I feel so safe. When I'm driving, so after I've been in San Francisco around in Waymo, then I'm back driving a car myself, I think to myself drive like a Waymo, be patient.

Brooke Hopkins: 00:04:29
Totally. I think it's so fun to be in Waymo because it feels so magical every time. I think how Waymo was able to transform the fear and uncertainty around self-driving cars to the other extreme of where now you get in Waymo and feel safer than if you were in another ride-sharing service is really, I think speaks to just amazing technical talent and deployment as well.

Jon Krohn: 00:04:55
Being in a human driven car, it feels like you're in the Wild West. You're like, "What is this wildness?"

Brooke Hopkins: 00:05:00
Totally. And every once in a while you get the rogue driver who is a little bit crazy in some way. It makes you wish for Waymo.

Jon Krohn: 00:05:11
Absolutely, yeah. In episode 849, which was an annual episode that we do where we predict trends for the coming year, data science trends. My guest in those episodes for many years now is Sadie St. Lawrence, and we did something new this year, which is we also, in addition to making predictions for 2025, we also created some awards. So things like our biggest wow moment of the year, our biggest disappointment, what company we think made the biggest progress in AI in the past year, that kind of thing. And our wow moment of the year was being in a Waymo.

Brooke Hopkins: 00:05:48
Wow, that's amazing. I'm super glad to hear that. It definitely would be my wow moment of the year, probably for the last five years.

Jon Krohn: 00:05:56
Yeah, it's one of those things. It's my go-to example now. A question that I get from people, lay people, friends, family, that they're like, "What's something that we need to know about AI?" And my go-to answer since I was in a Waymo in the Northern Hemisphere summer in 2024, my go-to answer is, you can go to San Francisco, use an app like Uber and have a car come and pick you up with nobody in it and drive you to wherever you'd like in the city, drop you off. You feel safe. When you could relay that to people and have them think about how that shows that there's going to be a huge amount of change in the coming years. It just hasn't proliferated around the world yet.

Brooke Hopkins: 00:06:45
A hundred percent. I think the future is here, it's just not evenly distributed, is very real.

Jon Krohn: 00:06:50
It's just not evenly distributed. That's the quote I was looking for.

Brooke Hopkins: 00:06:54
Yeah. I think seeing how fast Waymo is deployed to new cities is also a really exciting part of this. Where to get from Mountain View to San Francisco took so long, it probably took... Waymo was started 10 years ago. We only deployed fully right there only in San Francisco two years ago, and now we're already deploying in Los Angeles, in Phoenix. They're expanding to all sorts of new cities and the speed of deployment is just accelerating for each new city. I think that speaks to a lot of the developments both in how the model development works, but also how simulation was able to aid where you don't have to have that manual deployment in all these cities. You don't have to be running nearly as many driving logs because our simulations have become so accurate and you're able to scale them to a level that they previously didn't have confidence in.

Jon Krohn: 00:07:49
Yes, yes. And speaking of simulations at Waymo, you're the founder of Coval and the CEO, which is a simulation and evaluation platform for AI agents. So you're starting with voice and chat assistance. Can you parse for us what this means? So we kind of have a sense, maybe you could even use Waymo as an analogy because it's quite easy to visualize in that scenario how simulation and the evaluation work that you were doing at Waymo, how that then transferred over to what you're doing with the hottest topic in I think the world, although obviously super biased as an AI person, but AI agents applying that Waymo knowledge to now this super hot field of AI agents.

Brooke Hopkins: 00:08:36
Totally. Yeah. So what Coval is doing is we're building a simulation evaluation and monitoring platform for voice and chat agents, but eventually we want to do any autonomous agent. So an autonomous agent is an agent that's navigating the world and responding to the world. So think like a web browsing agent or a voice agent or a chat agent are all responding to what you say back to the agent. And so in the same way that when Waymo is driving from point A to point B, it needs to respond to a pedestrian crossing the street or it needs to respond to maybe some changing features on the road such as construction or a new road has been created.

00:09:15
There's somewhat parts car, there's all these different changing environments. We're trying to take the same learnings from how we conquered that at Waymo in order to create really robust, scalable self-driving software and transfer that into how can we build really reliable, robust voice and chat agents or web agents that are able to navigate autonomous situations while also balancing how do you balance running tons of tests that are really expensive while also having really high coverage.

00:09:46
And so these were a lot of the trade-offs that we made at Waymo as well as how do you run these really complex simulations on distributed systems? How do you do it at scale? How do you distill a lot of the complexity of this to something that's really simple for model and ML and AI engineers to be able to understand so that they can focus on the other hard parts they're working on? And then also just how do you measure this? What do the metrics look like? How do you interpret the results and how do you get signal from all of these massive amounts of data?

Jon Krohn: 00:10:17
I think we're going to dig into basically all of those topic areas right now. Before I dig into those kinds of things like balancing accuracy and scalability, compounding error issues that you get when you have a chain of AI agents, real-time monitoring, all these kinds of things that Coval offers, maybe you could illustrate for our listeners, and this can be maybe kind of a tricky thing to do in an audio only format. Actually, I had the pleasure of you providing me with a demo of your platform, sharing your screen and showing that to me last week. So I kind of have in my head some visuals of how the platform works, but maybe you could use a case study with a client or two. You don't need to necessarily name them by name if that's not something obviously you're authorized to do with a client. But just describing a situation that a client's in and how Coval is able to automate reliably their simulation and evaluation.

Brooke Hopkins: 00:11:16
Totally. So a common pattern that we see when you're developing agents is that in order to test these multi-step agent workflows, there are a couple of things that make it really hard. So one, to test this manually often takes a lot more time because instead of just putting in one input such as clicking a button or getting an LLM response from a call, you have to go through multiple steps. And so with a phone call, this could take as long as a phone call takes, which might be minutes or even 15 minutes or longer. And then as well, you have to recreate all of these different contexts in states, and these are really hard to manage. Even if you are willing to put in the time to do this, you have to remember, okay, I went down this pathway but I haven't tested this pathway and then can I remember what it was like when I tested it the first time?

00:12:04
And so similar to self-driving cars, to get from point A to point B, you have all these possible paths that you can go and some of them are right and some of them are wrong and some of them are hard to tell. And so really what you want to be doing is running all of the possible pathways or at least a representative subset of the pathways so that you can have high signal and high confidence in what you're testing. And then you want to see how often do certain types of events happen? So for example, how often am I seeing the agent get stuck? How often am I seeing the agent mispronounciate things? Mispronounce.

Jon Krohn: 00:12:51
That's perfect. I love that.

Brooke Hopkins: 00:12:52
Just like me, just like a human. So how often are you seeing transcription errors? How often are you seeing logic errors? All these types of things. You want to see how often they're getting the conversation wrong or right. Some of our customers, for example, a lot of our customers are customer service agents. So this is an area we're seeing just really explode within voice agents. So you have a customer that's calling you and they want to book an appointment, so they want to book an appointment for tomorrow or next week, or they want to book an appointment for the next available time or Tuesday the 24th, and you should assume that that's in 2025, or all these different permutations of possible ways to book an appointment. And so what our customers will do is they simulate booking an appointment and the prompt for this simulation would be book an appointment for some time in the future or just book an appointment.

00:13:44
And then you can vary how deterministic or non-deterministic these simulations are with temperature and other things and then be able to map out all these different pathways. Or if you care more about I want to test booking an appointment for tomorrow because I've seen some errors with that case, I'll prompt it with book an appointment for tomorrow and then run that ten times or a hundred times, see how often it's failing. If I feel like sometimes it works and sometimes it doesn't, can I see how often it's not working?

00:14:14
Another thing that we can do is we can re-simulate from transcripts. So if you're a user, you go into your logs and you find examples where your voice agent is performing in a way that's an unexpected way. So you can go into your logs, find those examples and then re-simulate them. This is actually borrowed from self-driving as well. This is a really common developer workflow where we'll drive manual miles on the road or we'll drive supervised autonomous miles, take those logs from production and then re-simulate them through our simulation system. So this allows you to reproduce issues to a much finer granularity than if you use fully synthetic data.

Jon Krohn: 00:14:56
To kind of give a visual of this and kind of bring the analogy to life a little bit. I was recently in Austin, Texas and saw Waymo's driving around with somebody in the passenger seat, or sorry, somebody in the driver's seat. And so that's an example. You were talking about expanding to new regions. And so in that scenario you have somebody maybe getting footage of something that's specific to Austin, Texas that would've been difficult to simulate from data in San Francisco or Mountain View. And now with Coval, a developer can be taking a chat experience and going through a specific flow that they think is really important, but then also simulating based on that to have more variability without all the effort.

Brooke Hopkins: 00:15:43
Totally. I think that's why Waymo is actually a great analogy to this or a place where we can draw a lot of learnings because in the same way that Waymo doesn't a hundred percent rely on their simulations, they use it to filter what should humans really look into, how can we move faster, how can we discover issues faster and how can we have much larger scale coverage than we ever would if we are doing only manual testing? But that doesn't mean that you don't have humans reviewing all of the performance or looking into specific issues. So in that same way, how can you use the manual driving time once you're really sure that that software is up to snuff and that the software is doing what we expect and then finding those really long tail cases or really just proving out the true reliability of Waymo versus I think previously in robotics, a lot of this would be done manually.

00:16:38
You would be manually testing all of these different scenarios and then trying to reproduce these each time. That's where we're seeing voice AI is right now where people are going back and forth with their agents manually and that's what most companies are doing and it's really painful. So what Coval does is... A lot of companies, a lot of engineers are going back and forth with their agents all day. Maybe they have a script in the best case where they're simulating a transcript, but we help those engineers who are going back and forth with their agents be able to reduce the developer time and then also have run far more tests than they ever would be able to with manual testing alone.

Jon Krohn: 00:17:24
Nice. Yeah. So I realize that this is going to be tricky without visuals, but can you explain maybe even just at a high level how that happens in the platform? I know that a lot of design work has gone into building your platform effectively to strike that balance of making Coval both intuitive to use as maybe even a first time user while simultaneously offering the breadth of functionality that you've been describing the power users might want to have.

Brooke Hopkins: 00:17:53
Totally. I think this is an amazing challenge of developer tools and we definitely have ideal companies like Vercel or Linear where they take really complicated things and distill them into really simple products. I think developer tools, really well-done developer tools take a really complicated thing and make it obvious what you should do next. Because I think at the end of the day, AI engineers have so many complex problems that they're solving all at once. Voice alone, voice of video streaming, this has been a hard problem for over a decade. I think it's amazing how hard it still is to deal with audio bytes, video streaming, et cetera.

00:18:36
There's all of the complexities of prompting and dealing with models, building out your reg infrastructure, building out infrastructure, traditional infrastructure, understanding your user, building out the right workflows. And so testing this is just one more piece that often isn't the core competency of these companies, nor should it be. So what we're trying to do is make it really simple and obvious so that they don't have to spend tons of time thinking through their eval strategy, figuring out how they should be evaluating these setups, figuring out how can we build out metrics, how can we build out really complex systems to do this evaluation, but instead we streamline them through that process so it becomes overwhelmingly obvious what to do next, overwhelmingly obvious what their problems are in their system.

Jon Krohn: 00:19:23
Nice. Yeah, so that's the design challenge. How do you tackle it?

Brooke Hopkins: 00:19:27
Something that has been nice is that we built this so many times at Waymo, we built several iterations of it and saw a lot of the common design patterns that happen when you have complex configuration files for simulation. And so we've taken a lot of those learnings away from when we thought something was going to be really obvious or we thought something would stay simple for a long time, knowing how it might evolve over time. So for example, things like configuration files for the simulator, what types of arguments go in there? Where might we go in the future even if that's not what we have today? I think that's been really helpful for knowing how to modularize things so that you have small digestible components such as we have the simulator, we have metrics, we have analysis, but then also still making it so that you don't have to have a thousand different configuration pieces.

00:20:23
I think that's been... We've taken a lot of learnings from there. The other aspect is just making a lot of it really easy to use via our UI. So making it visually, I think a lot of this is a UX problem of how do you take large amounts of data and distill them for the users so that they can understand what they're simulating and what the results mean. Those two things, while really simple to say I think are a really hard problem that we spent a lot of time at Waymo solving of how can you tell the user what they're simulating and make it really clear? Often one of the failure modes is you just ran the wrong tests, right? Your data set wasn't representative of all the cases you're trying to test. Your configuration wasn't enabling the right modules, you weren't running the right setup in some way. And so for our agents, it's really important to figure out how can we distill this so it becomes really clear what they're simulating and what they're analyzing.

Jon Krohn: 00:21:21
Nice. So let's say I'm a client of yours and I have a customer service agent and I'm going into the Coval platform for the first time, blank slate. What do I do? Where do I go to start to make my life easier and start to have comprehensive testing and simulations going? Yeah, what's the flow like as a user?

Brooke Hopkins: 00:21:43
I'll talk through the whole developer lifecycle. So it's day one, you're building a voice agent. You go and you find a pretty easy-to-use platform to build a voice agent and you build an MVP. Then you can come into our platform and you can iterate on your prompt directly so that you can say, how does it prompt play out in just a super basic environments? Not even enabling voice necessarily, just can I see how the conversation plays out with the single prompt. That's usually this first stage. And then you might make your agent a bit more complicated. You might add in some reg or you add multiple agents or you add some flows to it.

00:22:23
And then what you can do is simulate, set up some simulated tests through our system. You'll create a test set. That test set might have a bunch of different scenarios like book an appointment, book an appointment for next week, call to issue a refund, complain about the recent experience you had on their airline, et cetera, and you'll run all of these scenarios through our simulator. Then you'll have a bunch of simulated conversations. That alone is super helpful because now you can look through lots of different... You can run a hundred simulated conversations at once and then digest the ones that maybe failed to complete or had a flag, a failed metric. Maybe it shows that the conversation was ended abruptly or the user wasn't able to achieve their goal or the appointment was not successfully booked. So then you can go in and manually review those and try and understand what's happening.

00:23:17
Here's where our users really iterate on both their evals and their system. So you might go in and realize these are the things that I'm trying to manually detect. I'll create a metric in order to detect those things, and then I realize that I'm interrupting the user, so I'll change some parameters so that I'm not interrupting the user as eagerly. Then I'll rerun it through simulation and be able to say, "Okay, is it now clear that my interruptions have decreased?" So once you have a good workflow there, you can automate those and then start monitoring how well your system's doing in production.

Jon Krohn: 00:23:57
Nice. I might be interrupting you or you might have gotten through developer life cycle, but I have a couple of questions for you. You mentioned at the beginning on day one, the developer selects a, an agent. Do you have, can you disclose preferences that you might have or Coval might have as kind of preferred agent providers? Is that something that you do or you can provide guidance on in public?

Brooke Hopkins: 00:24:20
Yeah, so I think there really is not necessarily a right platform or a one size fits all. I'm saying that honestly because we've seen all sorts of different platforms work well. We also want to be the evaluation framework that's agnostic of which framework you're using because this allows you to really easily switch between platforms as everything is moving so fast in voice AI, and also as prices go up or as your requirements change or as your product evolves, different solutions might make sense for you at different times.

00:24:55
I don't think that we're biased towards one or the other. I think there's a bunch that make sense for different use cases, and there's a couple of axes in which I would make that decision. So for example, on the scale of low-code to more configurable, you have much more low-code solutions that are servicing business owners or any non-engineering background where they can really set up a voice agent as easy as setting up an email newsletter or any other easy to configure system, but here you're going to have a lot less configurability.

00:25:30
Setting up function calling or reg is going to be a lot more limited, whereas a higher configurability option, such as some of the open source voice orchestrators, those are going to give you a lot more control over function calling, over being able to add in different infrastructure and mix and match that with your own in-house built infrastructure. And so I think figuring out where on that spectrum you are, then also looking at which companies support the developer needs that you're looking for.

00:26:04
Some other considerations that we've seen are function. So there's a couple of important things when you're building voice agents such as instruction following, function calling, workflow following, conversational or how natural the voice sounds, creativity. So if you're building a application that's talking to you as a friend versus you're building an application that has to follow a very strict workflow in order to collect a certain amount of data for a patient intake versus a voice application that is calling out to do a bunch of function calls such as updating records or booking appointments, or you're in a very high compliance industry where instruction following is really important, it really needs to do the things that you tell it to.

00:26:51
These are all different trade-offs, and I think different platforms excel at different things. So for example, the conversational and creativity, you might be looking at different models then. For example, if you really care about function calling and making sure that you can do really complex function calls that might not fit into the more opinionated platforms that require you to set everything up in the way that they determine. That being said, there's some platforms that allow you to set up workflows in these really beautiful ways and makes it really easy so you don't have to code this giant mess. Those are kind of the trade-offs that we make and we actually work with our customers to figure out what the right platform is for them. So if you're having these questions, reach out to me and I'm happy to even just bat around some ideas.

Jon Krohn: 00:27:42
Excited to announce, my friends, that the 10th annual ODSC East (Open Data Science Conference East), the one conference you don't want to miss in 2025, is returning to Boston from May 13th to 15th! And I'll be there leading a hands-on workshop on Agentic AI! Plus, you can kickstart your learning tomorrow! Your ODSC East pass includes the AI Builders Summit, running from January 15th to February 6th, where you can dive into LLMs, RAG, and AI Agents, no need to wait until May! No matter your skill level, ODSC East will help you gain the AI expertise to take your career to the next level. Don’t miss - the Early bird discount ends soon! Learn more at odsc.com/boston.

00:28:27
Nice. Yeah, so you mentioning there, Brooke, how agentic AI platforms are designed to try to provide ease of use and they might have graph visualizations perhaps to allow that to happen. That reminded me how during the demo that you gave me at Coval last week, you had a graph aspect of the platform that allowed users to create nodes and connections between those nodes to map out conversation flows, which you could imagine would be very helpful in say a customer service example where somebody comes in and you could have one flow where it's dealing with an issue that they're having or booking appointments. And then down the issue leg of this graph, you could have a whole bunch of common questions or flows that happen when somebody is encountering an issue versus when somebody is looking to book an appointment, there's a completely different way that the conversation could go.

00:28:27
Nice. Yeah, so you mentioning there, Brooke, how agentic AI platforms are designed to try to provide ease of use and they might have graph visualizations perhaps to allow that to happen. That reminded me how during the demo that you gave me at Coval last week, you had a graph aspect of the platform that allowed users to create nodes and connections between those nodes to map out conversation flows, which you could imagine would be very helpful in say a customer service example where somebody comes in and you could have one flow where it's dealing with an issue that they're having or booking appointments. And then down the issue leg of this graph, you could have a whole bunch of common questions or flows that happen when somebody is encountering an issue versus when somebody is looking to book an appointment, there's a completely different way that the conversation could go.

00:29:23
So that's an example of how Coval is trying to combine or succeeding at combining precision and scalability because these are often conflicting goals. When you think about trying to have an AI agent work effectively, the most precise thing to do, but would be extremely time-consuming, would be to create maybe thousands or tens of thousands of different conversation flows that cover the gamut of possibilities and really comprehensively cover all the possible scenarios that your users could go through, which might be impossible, but let's say, let's pretend that it's possible to do all that coverage. It's going to take thousands, tens of thousands of manually created conversational flows.

00:30:13
That's not very scalable. You make a change to your platform, you're offering some more flexibility, you open up to a different kind of customer base, any of those, it could be very small shifts and then all of a sudden, wow, we're going to need thousands more conversation flows to handle this new niche that we're covering and this new functionality that our AI agent has. So that's the far end of precision. On the very far, on the opposite end of the spectrum, to maximize scalability, you could have something like, hey, you could have a chat with some kind of conversational GenAI agent and say, "I'm creating a conversational agent that will work in this particular scenario, create a whole bunch of tests for me," and then you just use those without reviewing them. So yeah, so I think hopefully I've done a passable job of explaining this spectrum of scalability to precision and yeah, I'd love to hear your thoughts on that and how Coval addresses it.

Brooke Hopkins: 00:31:17
Yeah, I love that you're already realizing this because I think that it is a non-obvious piece of the puzzle, but it's not only of can I run the right test and get signal out of it, but what should I even be running? How do I make these trade-offs of scalability and signal? So we made these trade-offs a lot at Waymo, basically always balancing cost and latency with signal. You can obviously always spend more to make things faster and run more scenarios, but this obviously comes at the cost of it being more expensive, it taking longer in your developer iteration cycle. On the flip side, you could run no scenarios and it would be very fast and cheap, but you'll have no visibility into your system. And so this is something actually we work with our customers to figure out what is the right balance of how many scenarios should they be running at which points in their workflow?

00:32:15
So what does it make sense to run say on every PR that you submit or what makes sense to run every six hours or nightly? And then what types of sets should you be creating to run regression sets with or run before big releases? So I think this is a really big problem is figuring out not only once I know what to run and how to get signal from that, run the right metrics, then how do I scale that as I add more customers, as I add more use cases? I think here is also where that kind of developer experience that I mentioned is really important is being able to show what is the distribution of the data set that you're running? How many examples are you running? Are they all on the same topic or are they on different topics? How do they compare to what you're seeing in production? Are they pretty similar to those examples in production or are you running examples that are completely different than what we're seeing in production?

00:33:12
And so that's where we think it's really important to have this end-to-end workflow where you can go from monitoring, simulation, testing, then see how it's behaving in production and then be able to rerun those logs through simulation or match is what we're testing actually surfacing the issues we're seeing in production?

Jon Krohn: 00:33:30
Great answer. I love that. Crystal clear. In addition to this complexity of accuracy, sorry, of precision versus scalability, another big issue that happens with AI systems, with agentic AI systems is that there can often be a lengthy cascade. You mentioned earlier this idea of tool calling. So you could have an AI agent that is kind of triaging the call that is figuring out, okay, based on the conversation so far, it seems like I'm going to need to call tool A. And then maybe later in the conversation they need to call tool B, or maybe tool A in order to do its job effectively needs to call on tool C.

00:34:23
That was a bit of a vague example, but it was to kind of illustrate that you could end up with this cascade of multiple agents in a sequence potentially making requests in parallel or having multiple things happen sequentially. Multiple calls happen sequentially, all in parallel, and AI agents are responsible for all that without a human in the loop. So in that kind of scenario, even a small error, especially early on, what if the triaging agent right at the beginning got it wrong, it called tool A, and it should have called tool D.

Brooke Hopkins: 00:35:00
Totally.

Jon Krohn: 00:35:01
So that can lead to a butterfly effect where one small error in an earlier step can lead to a massively wrong output later on in the conversation. So what strategies can we employ to mitigate this kind of butterfly effect and how do you ensure graceful failure when these errors do happen?

Brooke Hopkins: 00:35:24
Totally. I think this is one of the reasons why evaluation of agents and multi-step evaluations is so different than evaluations of LLMs or any call where you have some input and some output because not only do you have the non-determinism of a single call, you have the non-determinism and all of the possible pathways through these cascading failure points that just explode in terms of the possible pathways and the possible types of failures.

00:35:54
There's also an interesting case where it goes off track, but then the agent saves it. It realizes that it made the wrong mistake. Which kind of leads me to, I think a lot of the ways people have been solving this is self-healing agents. Where some interesting things I've seen is having an agent in the background for voice where you have a cheaper and faster low latency model coming up with responses, but then you have an overthinker in the background that's looking at the whole conversation and maybe it takes longer to make that call, but the latency is okay as long as it's in the background and can help prompt the agent to get it back on track to saying, you messed up this order or you forgot to ask something.

00:36:42
You can also employ other strategies for graceful failures around having multiple redundancy in your systems. I think there's a lot to learn from aerospace and self-driving here as well, where I think self-driving has really mastered graceful failures, where it has fallback mechanisms, there are ways that it can pull over or I can ask questions to write or assistance. There's all sorts of systems in place so that it's not just reliant on any one system within the voice agent or within the agent at play. I think how this translates to voice agents is can the voice agent self-determine when the request is too complex for itself? This is already happening a lot today. I would say the majority of our customers already have the ability to transfer to a human when they determine that the task is too complex for them. But could you have-

Jon Krohn: 00:37:39
Yeah, that's something that didn't occur to me until now, of course.

Brooke Hopkins: 00:37:43
Exactly. So that's an example of a redundant system, but I think we can create redundancy in all these other ways. And something that's interesting, a counterpoint to people who say that voice agents will never be reliable enough, they're inherently non-deterministic and very hard to corral into doing a task reliably right. I think we've seen in infrastructure this not be the case because servers are inherently very unreliable and they yet were able to create infrastructure on cloud, or on all sorts of unreliable systems at many, many layers that theoretically should all compound into a massive error percentage. We've seen create six nines of reliability for those systems. And that's through redundancy, that's through fallback mechanisms, that's through all sorts of other engineering techniques. And so I think we're going to see the same thing happen with agents where you can create reliability out of unreliable systems.

Jon Krohn: 00:38:43
Classic Luddite thing, this will never be possible. And then I can point you in the direction Luddite of Waymo's. It's the kind of thing people say, oh yeah, it's the same thing as nuclear fusion. It's like self-driving cars are kind of always 20 years away, have been for decades, but now it's happening. It's the same kind of thing with this with voice agents with more and more kind agentic systems. The server example that you gave was beautiful because yeah, six nines of reliability, that is possible with agentic systems as LLMs themselves get better at not hallucinating on an individual call, but then also as these kinds of redundancies are built in like you've been discussing. And so it is inevitable. It is not impossible. It is inevitable. This is what is going to happen. And if you don't think agents are going to be able to handle a huge swath of complex tasks in the coming years, you're wrong.

Brooke Hopkins: 00:39:50
You heard it here first.

Jon Krohn: 00:39:51
Yeah, yeah, yeah. It's not even a risky thing for me to say that. That is what's happening. Just as there are going to be self-driving cars in more and more cities for more and more providers handling a broader range of tasks, it's going to happen. So yeah.

Brooke Hopkins: 00:40:11
I think there's an interesting parallel there of how some agents acting reliably can actually raise the tide for everyone, where if you have lots of good examples of agents being deployed in enterprises successfully, that's going to create an environment where more agents are able to take on larger and larger tasks. And I think we saw this with self-driving where, as you're able to carefully safely scale out self-driving, it doesn't matter which company does it, that's going to make it a more favorable environment for any company to be able to develop self-driving. So I think something we want to do at Coval is also giving companies the tools to be able to show to their customers that this is an agent that is going to perform reliably and you can trust that this is performing not just on the demo cases that I showed you and that maybe smoke and mirrors but is actually working for all of the cases that you're interested in.

00:41:14
And then you can go and explore those cases and have confidence that the agent will be behaving as you expect and then monitor that over time. And so something Coval wants to do is we want to be able to power enterprises to be able to understand how their agents are behaving in a world where these systems are so much more complex than just knowing if your web app that you use for accounting works. You just log in and it's working or it's not working. But with agents, there's just so much less visibility and I think that makes people inherently distrustful of the systems, even if they can produce so much value and the technology is already there.

Jon Krohn: 00:41:53
I just realized I've been butchering the pronunciation of your company this entire episode. You've been doing Coval like all of the above, and I've been doing Coval like Albert.

Brooke Hopkins: 00:42:05
Actually, this is funny. I think we don't have a consistent pronunciation, so you don't have to worry about it.

Jon Krohn: 00:42:12
Okay.

Brooke Hopkins: 00:42:14
Our name actually comes from... We are named after or we named the company after Sofia Kovalevski, who was the first female mathematician to get her PhD. And then also it's collaborative eval or conversational eval, and so it kind of has this double meaning.

Jon Krohn: 00:42:32
That's super cool. I love that. I'm going to have some info in the show notes on Sofia Kovalevski for people, so you can click through and read about probably her Wikipedia profile or something. I'll be sure to include that. That's really cool.

Brooke Hopkins: 00:42:46
Yeah, I'm sure I'm butchering her name, which I should really know, but there's a lot of consonants in there.

Jon Krohn: 00:42:50
Yeah, yeah, yeah. I'm sure you're doing better than I am. At least I should be pronouncing it the same way you do on air. I'll try to switch to Coval. Coval. Yes, that's what you could do.

Brooke Hopkins: 00:43:06
I think I honestly switched back and forth, so I wouldn't stress about it.

Jon Krohn: 00:43:12
Gotcha. I think the-

Brooke Hopkins: 00:43:13

Coval. Jon Krohn: 00:43:13
And all this is... Yeah, I think the reason why-

Brooke Hopkins: 00:43:15
You can do Coval.

Jon Krohn: 00:43:16

Nice. Well, sweet. I think part of why that one seems so right to me is it's like eval. Coval, eval.

Brooke Hopkins: 00:43:22
Yes. That's what we try and say, like copy it after. That's how I'm trying to pronounce it. Pronounce. I can't say that word for some reason.

Jon Krohn: 00:43:32
Pronounce and mispronounce. Nice. All right, so anyway, back to the conversation flow. You were just giving a great answer for me on the butterfly effect. And yeah, crystal clear tools like Coval, Coval are going to be able to move us in the direction of having agents handling more and more kinds of tasks in more and more kinds of scenarios. They're going to be ubiquitous in the future. It is inevitable.

00:44:07
Something else that Coval offers is custom metrics. So there could be complex scenarios where standard metrics, just plain old accuracy aren't useful. I mean actually that would be something. In a scenario where this isn't like a math test, scoring a conversation isn't like a math test where there's a correct answer, you just get to some integer or some float and you're like, "Okay, that is the correct answer, nice work, algorithm." When you have an agent handling a complex task, there's an effectively infinite amount of variability where there's an infinite number of ways that it could be right not even including the infinite number of ways that it could also be wrong. So what kinds of metrics do you use to evaluate whether an agent is performing correctly and then maybe building on that, what kinds of custom metrics might your clients need?

Brooke Hopkins: 00:45:11
I think you're exactly right that it's really hard to find the line between this is objectively a good conversation and this is objectively a failing conversation, but rather it's a spectrum. And so what we find works really well is layering metrics. So being able to run a whole suite of metrics and then looking at trends within those metrics. This allows you to make trade-offs as well. So maybe you're a little bit worse at instruction following, but you get the cases that you care about most a hundred percent correct. Because the distribution of how well you do on all these cases isn't like machine learning where you just care about getting 99% of examples right. Because if you're getting the one most oftenly used case wrong, it doesn't matter if you get the other 99% right, because when someone tries to book an appointment, they fail.

00:45:59
And so we see that these patterns of what matters is correct is different than other traditional software applications or machine learning applications or even robotics. The other piece of this is being able to show by having a variety of metrics, you can create a whole picture of how the system is behaving. So for example, a short conversation isn't inherently bad, but a short conversation where the goal wasn't achieved and the steps that the agent was supposed to take were not executed, that's an objectively bad conversation. So you can filter down by whether potentially true failures or false positives or false failures, et cetera. You can basically figure out which ones are ones worth looking into through filtering by these metrics.

00:46:53
So I think while we aim to provide all automated metrics for things like did it follow the workflow, was the conversation successfully completed? Were all the right function called with the right arguments? There's also always going to be space, I think for human review and really diving into those examples. And the question is how can you use that time most effectively? So it's not that you never look at all these examples, but you're looking at the most interesting examples.

Jon Krohn: 00:47:19
Did you know that the number one thing hiring managers look at are the projects you've completed? That's why building a strong portfolio in machine learning and AI is crucial to your success. At Super Data Science, you'll learn how to start your portfolio on platforms like Hugging Face and GitHub, filling it with diverse projects. In expert-led live labs, you'll complete an exciting new project every week. Plus, through community-driven projects, you'll tackle real-world, multi-week assignments while working in a team. Get hands-on experience with projects like retail demand forecasting, building an AI model from scratch, deploying your own LLM in the cloud and many more. Start your 14 day free trial today and build your portfolio with superdatascience.com.

00:48:00
Nice. Very cool. That's a great example of what to prioritize. Are you able to give concrete examples of metrics? What are the most common metrics for evaluating performance?

Brooke Hopkins: 00:48:12
Yeah, so we have a metric that allows you to determine if you're following a workflow. So for a given workflow described in JSON, which is pretty common in a lot of these different voice platforms, can you determine if you're following these steps outlined in that workflow and determine when you're not meeting those in the conversation. This is super useful I think especially for objective-oriented agents where they're trying to complete a task. Often if they miss a step in that workflow, it's a really good indicator that the task wasn't completed correctly. So for example, if you're booking an appointment, just to use a consistent example, if you're booking the appointment and it asks for the email and the day that they want to book the appointment for, but they forget to ask for the phone number, that task has been completed technically but hasn't been completed correctly because it missed this key step in the workflow.

00:49:08
Another interesting metric that we do, and then we also dynamically create these workflows in monitoring so that you can see what workflows your agents are actually going through in production and see how often if that matches with your expectations or where you're seeing new use cases or new patterns of user behavior. We also have metrics around function calling, so where the right arguments called for these different tool calls, and that's all custom configurable. What's interesting here is I think we try to make all of our metrics reference free. So there's two types of metrics. There's reference-based and reference-free. Reference-based is metrics where you have an expected output and you must curate that expected output with a golden data set and maintain that as your agent behavior changes. Reference-free, we infer what the correct answer should be based on the context of the conversation.

00:50:12
I think for LLMs in general, reference-free evaluation is really helpful because of the non-deterministic nature, whereas traditional unit testing and software is all reference-based, right? It's easy to make some assertions about what an API call should look like, but even more so with voice and chat agents, the conversations can go so many different ways. And this changes when you change your prompt, when you change the models, when you change your infrastructure. So having reference-free metrics, or at least a strong subset and test sets that rely on those is really important for being able to iterate really quickly. So we try to do function calling, create a reference-free evaluation for function calling. So we say, for example, if we're taking the order, can we confirm that the right function call was made based on what was described in the order from the user? It should match. Those two things should match based on a prompt and inside of heuristics. So this gives you, the users a lot more flexibility.

00:51:18
Those are just two examples we actually will build. We've been building out a lot of metrics for new use cases and kind of pulling them from all over the map of using off-the-shelf models, drawing from inspiration in self-driving of can we measure, for example, the agent performance against the human performance? If it took the agent longer to perform a task or shorter to perform a task, that's interesting intel. It's not necessarily good or bad when it stands alone, but if the agent takes significantly longer to perform a task and then ultimately doesn't or is repeating itself a lot, it's a good indication that your agent is going in circles.

Jon Krohn: 00:51:57
Nice. That was a great comprehensive answer. If I try to kind of recap back for you how we can effectively evaluate conversational agents, it would be to have lots of, in a way that Coval makes it easy to have lots of permutations of relevant conversations. So you can have lots of different examples that you test over and then you have a handful of metrics that you evaluate each of those scenarios in. And so you kind of, through scale, you end up being able to ensure robustness and you can then watch those changes over time. So you could say... Well, there's probably not a huge number of... I don't know, maybe there are. I was going to say there aren't a huge number of people doing agentic AI where they're training or fine-tuning their own LLMs for doing this, but let's just... I'm thinking to my experience training a deep learning model where you can over time see how the training accuracy and the validation accuracy are trending over time.

00:53:06
You could imagine that same kind of thing here where if you were training your own LLM to be handling some agentic task, you could then run your suite of examples and suite of metrics provided by Coval at some reasonable number of training steps. And you could be watching how that curve, how your metric curves change over time. And you're like, "Okay, we're kind of plateauing across the board. We've probably trained the LLM enough." Similarly, you could compare multiple different LLM providers or you could with your tech, you can actually also monitor in real time. So you can see how these metrics are performing for your customer. Your customers can see how their agent's metrics are performing over time in real time to see if something's going off the rails. Maybe one of the tools that's required for fulfilling a common request in the agentic workflow is down. AWS in Virginia has gone down. And so being able to monitor in real time allows your customers to be able to fix things before they become even bigger issues.

Brooke Hopkins: 00:54:26
Exactly. I think every piece of this puzzle as you shed light on is really important where you might discover some issues are much easier to detect in production monitoring. For example, AWS going down is just going to be something that is... You can obviously have recurring tests, but it's going to be really overwhelmingly clear when you start to see this happening in monitoring or these really long tail issues. For example, being able to see new user trends, unanswered questions from users. So this is something else that we do is we can detect the unanswered questions within your transcripts and then be able to help you either then answer these by adding things to your knowledge base or adding those capabilities or using UX to let your users know that this isn't something we support. So just as much as it is covering the things that you know you should be doing, it's also understanding what the user behavior is or how your system is behaving unexpectedly.

00:55:26
And then yeah, every layer I think will catch different issues and is an important part of that workflow from what do you simulate versus what do you catch in monitoring versus what do you do just by manually testing things. We also have the capability to send things for review, so you can actually send these out to labeling teams and be able to go through tons of examples and then be able to feed that back into your metrics into your evaluations. And so this is really helpful for being able to understand the effectiveness of your metrics over time.

00:56:03
But yeah, as you mentioned, I think the long-term vision of being able to have self-improving agents so that they get better over time based on these metrics you define is a really exciting goal. I think it's still too early to do this. We get a lot of questions around doing automated prompt optimization and automated agent optimization. I think we're still so early in agents that having visibility into how these systems are improving ultimately produces better results than the time savings from having self-improving agents. But I think that will change a lot, who knows. At this rate in the next few months, who knows.

Jon Krohn: 00:56:40
You've right at the end there with this idea of self-improving agents kind of being in the loop and adapting their own prompts. That relates to a question that our researcher, Serg Masis pulled out to ask you, which is related to, so in self-driving cars, and I guess in autonomous systems in general, according to what Serg has written here, level five autonomy refers to complete autonomy. So this is a self-driving car that can operate in all conditions without a human behind the wheel.

00:57:11
So bringing that analogy over to these kinds of conversational agents or agents more broadly, web-based agents that you'll be supporting in the future at Coval. I guess you kind of answered the question there, which is that at this time it seems like it would be premature to try to have a fully automated system without a human in the loop at all and without potentially being able to in some scenarios, being able to have the redundancy of a human operator be able to come in and help out. But it also sounds for me response like we could potentially be months away. And it seems like certainly years away from having that kind of complete autonomy.

Brooke Hopkins: 00:57:59
Yeah. And who knows I think what the timelines are, but I think there's two parts of autonomy here. It's how the agents are developed and how autonomous that development life cycle looks like. And then on the flip side, how autonomous the agents are once they are released and within the task, how autonomous they are. I think the exciting parallel with self-driving there is can the agents figure out things on its own without having to be programmed? So a lot of the response, there are many systems right now responding to the non-determinism and creating reliability through making the steps of the agent even more clear, even more restricted, having more heuristics or programmatic logic to determine what the agent should do next.

00:58:54
The flip side of this is having an agent that's more autonomous and you give it more context into what a good next step would be so that when it encounters unexpected situations, it's able to better adapt. A good example of this is the example I've been using, which is booking a calendar appointment. So if you have an agent that has very restrictive workflow where you say, first you should say hello, then you should ask for their email, then you should ask for their phone number, then you should offer some dates. If the person says, "Hello, this is Brooke Hopkins, I am calling to book an appointment for tomorrow," and maybe they're weird and they say, my email on the first message, now your agent isn't able to appropriately respond to that or maybe the agent, maybe you ask it some questions about the company which it actually should be able to answer. You're trading off being able to adapt to new scenarios versus precision.

00:59:58
I think with self-driving, there's something to think through of how, for example, Waymo is able to adapt to construction sites versus having logs of road sets already seen before. I think there will always be this trade-off and I am hoping that agents go more in the direction of true autonomy such as function calling for example. There's a` lot of work around can we call these five sets of functions with these arguments? Instead, could agents come up with what APIs exist on the internet? Can I go read the documentation about that and then come up with what the right API format is? And there's no function calling provided. So I think there's a lot to explore in terms of true autonomous agents.

Jon Krohn: 01:00:47
Great. Yeah, so you're kind of giving us a glimpse into the immediate hurdles that autonomy faces and how we might be able to mitigate those hurdles. Assuming that we will be able to mitigate all of them, we're going to have more and more agentic systems. Are you able to try to see into the future? This is a tricky question, but in decades, you are relatively young at the end of your career, what do you think the state of the world might be like? How different might society be as a result of agentic AI systems, AI in general, maybe other exponential technologies like nuclear fusion? Is that something that you spend time thinking about or is this just a silly question?

Brooke Hopkins: 01:01:44
No, it's definitely something we spend time thinking about is especially where Coval will go in the not so distant future of agents being exceptionally capable, that they're near human intelligence and given the task can execute really well. I think even in that near future, the vision of Coval is being able to manage and understand how these agents are behaving at scale regardless of even if you have agents that behave exceptionally well. We still care about, for example, human performance at scale. These call centers or large companies care about performance reviews, and so being able to monitor and understand how agents are behaving is I think just paramount to having agentic systems that in the worst case in sci-fi sense take over the world. But in a less maybe dramatic sense, just having agents that we understand and are able to reason about, legislate, employ for the right use cases, understand how they're impacting our users, all of these things are really important for just the wellbeing of everyone.

01:02:52
And for an even more distant future, I think that things are never, I guess, as bad or as good as they seem. I think the same could be applied to things are neither as dramatic nor as linear as they seem. I'm sure the future 50 years from now for our grandchildren is going to be just dramatically different in the same way that a hundred years ago it was very different from now. So I think there's definitely a rapid pace of adoption, but I think humans are so adaptable that even if you start to have agents that are able to do the majority of mundane tasks like spreadsheets and emails and communication and whatnot, I think that humans are just... I really believe in human creativity and I think humans are exceptionally creative and will continue to build on top of those and become even more capable persons in the same way that computers today have not replaced, we've only become more creative, more connected, and more global as a society.

Jon Krohn: 01:04:01
That was a great answer, Brooke, and probably on some of our evaluation metrics it was the correct answer, but the absolute correct answer would've been, Jon, you will be able to upload your brain and live forever. That was the correct answer that we were looking for, but-

Brooke Hopkins: 01:04:13
Right. That was just actually what I was getting to before you interrupted me. I was about to say that you'll be on a beach on Mars. Your brain will just be making money, more money than everyone else. Everyone will be making more money than everyone else through agents.

Jon Krohn: 01:04:29

Everyone will be the richest person on the planet or in the universe.

Brooke Hopkins: 01:04:33
Yes, exactly.

Jon Krohn: 01:04:34
Yeah. My apologies for interrupting you there with my evaluation metrics on that question, but I guess another kind of related to that great answer that you really did have around where we're going with the future. Why is it that Coval is so excited about voice agents in particular?

Brooke Hopkins: 01:04:56
Yeah. Well, the reason we started with voice agents is because on one hand, compared to self-driving agents or web agents or all of these more complex agents, voice is a really great medium where you have one person talking to another, it's a little bit more constrained. And so we're able to develop these more advanced metrics and these workflows, and we've also seen voice agents just taking off at exploding in a way that no other agent was at least six months ago. And so that's a great place to start is just when you have an exploding space and it's building on top of existing infrastructure. Companies are used to having call centers. They're used to having phone trees. And so taking that one step further to having an automated voice agent that's even smarter than they were before is a much easier step than going from someone who's spending their all day every day thinking about a problem and saying, now an agent is going to handle this.

01:05:51
But I think beyond just the practical reasons why voice is really interesting, I think people are underestimating how exciting voice is as a space because we're not just replacing all of these phone calls that you would make otherwise back and forth. But I think there are several other really exciting things about voice, which is now you have this universal API between any two businesses or establishments. You have essentially a natural language API where you can say, these are all the things that my agent knows about my company and wants to reveal about my company. So either via text or via voice, I can call you and ask about claims data. I can call and ask about appointment availability or opening hours or all sorts of other data. And then conversely, it doesn't matter if the other party either has a traditional phone and so likely wouldn't have an API, or they also have their own agents.

01:06:50
And so now you have this very flexible agent interaction where these two systems can talk to each other without any sort of maintenance. And then I think also there's going to be so many new types of voice experiences that I think come out in the same way that I think Steve Jobs talks about this in a talk that I just watched where he says that when they first had TV, they just put a camera at plays and then put it on TV, and in the same way with computers, they just put essentially static pages on the web and then discovered all these interactive capabilities. And with mobile, they had a website on the phone, but then they evolved all these mobile native applications, and I think voice is the first big platform since mobile that where every company is going to be expected to have a voice agent or a chat agent and that chat and voice agents are going to be expected to have a lot more capability.

01:07:54
You'll be able to do anything like all these natural experiences. And so what does that look like beyond just what I would call a business for, but when I'm in a website and something is pretty verbose to type out, maybe it's better to explain that and then I go in and enter a bunch of numbers into a form which are super annoying to say out over the phone. Or are there ways where you can interact with web applications more seamlessly via voice because you're driving around as a delivery driver or you're a policeman or you're a truck driver where you often are on your computer, but then when you get back to your computer, you're able to go through all of your orders in the web browser. So I think that we're just on the cusp of figuring out the role of these really advanced voice agents, and that's really exciting of what new experiences can we create with this new medium?

Jon Krohn: 01:08:46
That was a beautiful answer. Something else that occurs to me related to this is that in order to be able to have a great voice conversation, it requires a great world model in the model, say the large language model that the age agentic system is relying upon. And so it is also really cool in that respect in the same way that if you have a model like Sora that's creating a video based on some text prompt, it needs to have encoded in some abstract way within its embeddings. It has to be able to encode that a bullet flying through the air has to keep going straight across all the frames in the video clip, so it has this kind of physics understanding built into its embeddings somehow. In a similar way when you're having a conversation, especially a complex conversation, which agents in the future will definitely be able to handle, you need to have these really sophisticated world models of great understanding of how the world works in order for that conversation to go well. So it's cool in that respect, too.

Brooke Hopkins: 01:09:54
I think that is such a great concept is what are all of the accidental things that agents discover in the process? In the same way that you said agents accidentally discovering or models accidentally discovering physics, what will agents accidentally discover?

Jon Krohn: 01:10:12
That's really interesting. Something else that occurred to me is this is going to be my last kind of big question for you, and then we'll just do the wrap up questions. You've been very generous with your time today. But Coval graduated from Y Combinator, and so as you were talking about these kinds of things like refining what your business is doing, I mean, you may have gone into Y Combinator with all of that already figured out, but it just occurs to me that Y Combinator is that kind of place where you really kick the tires around, what market should we be going after first. And so I'd love to hear what your experience was like with Y Combinator. Why did you choose to apply to it? What was the experience like going through it? Would you recommend it to our listeners? All those kinds of things.

Brooke Hopkins: 01:10:56
Yeah. Well, the reason I went into Y Combinator is even though I had pretty much the idea before Y Combinator, I think I honed it in specifically to voice. We were doing more generalized applications before, but really honing into voice based on our first customer, but still went into Y Combinator with that idea. I think the reason I did Y Combinator is because I'm a solo founder, so obviously I think it's easier to run a race with lots of other people, and I knew the person I am. I'm a rare extroverted backend engineer, ML engineer, so I knew that I was going to just personally really enjoy the program and find a lot of... I think one of your biggest challenges as a founder is not the idea or external factors or your luck or anything other than just your own personal psyche is half the battle.

01:11:53
And so if you can find environments in which you thrive and you're inspired and you're pushed and you're constantly pushing harder and harder, that's so important. And so I think I knew that being in the context of a lot of other really smart, inspiring founders was going to be just great for the company in my own experience. But I also have found starting company so much fun and it's really great to be in a community of people who also think the same way, where they're like, this is the best job I've ever had is being able to build something from scratch. I think YC does a pretty good job of filtering for people that are really excited about their company.

Jon Krohn: 01:12:32
Yeah, I mean, it's been great to meet you in person in December and then we spent time together with you showing your platform to me in a demo last week and then now we're recording together. Unusually for a founder, and it seems maybe even more unusually for a solo founder, you don't convey the sense of having the weight of the world on your shoulders. It seems like this is the right fit. Obviously it's going to be challenging, obviously it's going to be a huge amount of work, but you seem like such a safe horse to back because it seems like you have just the right personality to stay calm, to figure it out, and to enjoy the process. So it's really cool. Really appreciate you taking this time with me and with our listeners.

Brooke Hopkins: 01:13:28
Oh, well thank you so much. It's been a pleasure to be on this podcast.

Jon Krohn: 01:13:31
Yeah, and before I let you go though, I do have two quick questions, which is do you have a book recommendation for us?

Brooke Hopkins: 01:13:36
Yes, I have many book recommendations. I'll do one that's more personal and then one I think that's pretty related to work or informed a lot of my work. On a personal note, I think Kim Stanley Robinson's books in general are great. I love Ministry for the Future. I think he did an excellent job of painting a very realistic near but far term version of the future. He talks a lot. It's mostly about climate change and what does our world look like with climate change, but that kind of builds on the question that you asked of what does our world look like with agents. A hundred years into the future can often be really hard to imagine, and I think Kim Stanley Robinson does just a beautiful job in all of his books of painting that future.

01:14:19
And on what has informed a lot of my work side, I really love the book Creativity, Inc., as well as reading that in parallel with Bob Iger's biography and Steve Jobs' biography and how these leaders were able to cultivate creativity within their organizations and what it means to really create a company that builds really novel, beautiful products. It was really exciting to read through how Pixar, which has built out some of the most advanced both technology and really novel films, goes through that process as well as then through the lens of Apple, which is obviously interlaced with Pixar through Steve Jobs's involvement in both and his close relationship with Bob Iger and just how all three of those books paint this image of how do you build really large organizations that are creative and inspired.

Jon Krohn: 01:15:22
Nice. Great recommendations. I love those.

Brooke Hopkins: 01:15:26
And I think I would add actually, they're both creative and also very technically advanced, and those things often can be at odds, but Pixar and Apple, I think are great examples where they're both technologically and from a design perspective incredible.

Jon Krohn: 01:15:44
For sure. And it does sound like something you have prioritized at Coval in terms of your user experience as well, which is super cool. Very last question for you now should be a layup is how can people follow you after this episode? It's been so great to learn from you throughout the episode. If people want to be able to get in touch, you mentioned, for example, people being able to reach out and ask you about what kind of agentic systems or platforms they might want to consider using for their scenario. How can people reach out to you or follow you after the episode?

Brooke Hopkins: 01:16:12
Yeah, definitely. You can always find me on LinkedIn. There, always feel free to shoot me a message. I also, when you sign up through Coval, you can always book time with me to have a quick session on going over your voice architecture or even if you're not using Coval, feel free to just book some time to talk through your voice agents with me, and you can always also see a less filtered version of me on the X or Twitter, whatever people call it these days. So yeah, we'll add those to the show notes of my LinkedIn and Twitter.

Jon Krohn: 01:16:46
Fantastic. Yeah, we will have those. Yeah. Brooke, thanks again for taking the time. I know that you and Coval are going to be such a great success. So yeah, it's been an honor to have you on the show in these early days and maybe we can catch up in a few years and see how the product and the world of agentic AI has been evolving since.

Brooke Hopkins: 01:17:07
Totally, or 50 years and see how your brain is doing on Mars.

Jon Krohn: 01:17:13
Exactly. We can do it in exponential increments, so we'll do it in like 3, 30, 300, 3,000.

Brooke Hopkins: 01:17:21
[inaudible 01:17:20] which we'll be living because of our longevity efforts.

Jon Krohn: 01:17:24
Exactly.

Brooke Hopkins: 01:17:25
Well, thank you so much, Jon. It's been awesome to chat through all of these really exciting concepts that I love nerding out about with someone as smart as yourself.

Jon Krohn: 01:17:33
Perfect. Thank you so much.

01:17:33
I really enjoyed having Brooke Hopkins on the show today. In today's episode, Brooke covered how Coval is building a simulation, evaluation and monitoring platform for AI agents, starting with voice and chat agents, applying lessons learned from Waymo's self-driving car testing. She also talked about how the Coval platform helps companies balance precision versus scalability by enabling comprehensive testing across many conversation flows while maintaining high signal quality.

01:18:05
She talked about how key conversational agent evaluation strategies include reference free metrics, workflow validation, function call validation, and comparison to human performance benchmarks. She talked about how companies are building redundancy into AI agents through techniques like failback mechanisms, self-healing capabilities, and human backup options. She talked about how the development of reliable AI agents will likely follow a similar path to cloud infrastructure, building robust systems from inherently unreliable components through redundancy and engineering. And we talked about how voice agents are taking off because they provide a universal natural language API between businesses and with consumers.

01:18:44
As always, you can get all the show notes including the transcript for this episode, the video recording and materials mentioned on the show, the URLs for Brooke's social media profiles, as well as my own at SuperDataScience.com/857. And if you'd like to connect in real life as opposed to just online, I'll be giving the opening keynote at rvatech Data and AI Summit in Richmond, Virginia on March 19th. Tickets are really reasonable and there's a ton of great speakers, so this could be a great conference to check out, especially if you live anywhere in the Richmond area. It'd be awesome to meet you in person there.

01:19:19
Thanks of course to everyone on the Super Data Science podcast team, our podcast manager, Sonja Brajovic, our media editor, Mario Pombo, partnerships manager, Natalie Ziajski, our researcher, Serg Masis, our writers Dr. Zara Karschay and Sylvia Ogweng, and our founder Kirill Eremenko. Thanks to all of them for producing another fascinating episode for us today. For enabling that super team to create this free podcast for you, we're so grateful to our sponsors. You can support this show by checking out our sponsors' links, which are in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how to do that by making your way to jonkrohn.com/podcast.

01:19:56
Otherwise, you can support us by sharing this episode with someone who might like it. Reviewing the episode on your favorite podcasting app or on YouTube, subscribing if you're not a subscriber, and something that I've only recently started saying is, you're also welcome to edit videos into shorts and post them on social media, YouTube, TikTok, whatever. Just refer to us, and we'd love for you to be doing that. But most importantly, I just hope you'll keep on tuning in. I'm so grateful to have you listening, and I hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there, and I'm looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Show all

arrow_downward

Share on