Podcasts SDS 689: Observing LLMs in Production to Automatically Catch Issues

78 minutes
Artificial Intelligence, Data Science, Machine Learning

SDS 689: Observing LLMs in Production to Automatically Catch Issues

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Amber Roberts and Xander Song from Arize provide a technical deep dive into the major challenges AI systems face in production. They focus on model drift, particularly for Large Language Models (LLMs), and explore the concept of ML Observability as a tool for staying ahead and preventing issues in AI systems.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Amber Roberts

Amber Roberts is a Machine Learning Growth Engineer at Arize AI, an ML observability company. Amber’s role at Arize looks to help teams across all industries build ML Observability into their productionalized AI environments. Previously, Amber was a product manager of AI at Splunk and the Head of Artificial Intelligence at Insight Data Science. A Carnegie Fellow, Amber has an MS in Astrophysics from the Universidad de Chile. When Amber isn’t expertly teaching ML observability best practices, you can find Amber playing with her two puppies, Rusty and Sully, on Florida’s warm beaches.

About Xander Song

Xander is a developer advocate at Arize AI, an ML observability company. He works on the team building Phoenix, Arize’s open-source library for ML and LLM observability, as a contributor and evangelist. Previously, Xander was a machine learning engineer at an early-stage startup in the dev tools space, taught English in South Korea through the Fulbright Program, and studied mathematics and philosophy as an undergraduate. In his free time, he enjoys hiking, swimming, and spending time with his cats, Calvin and Mindy.

Overview

Get ready for a technical deep dive as this week’s episode uncovers the major issues that AI systems encounter in production and provides a roadmap to overcome them. Our guests, Amber Roberts and Xander Song from Arize, bring their expertise to the table and shed light on the various types of drift that can wreak havoc on a production AI system. If you’re a regular listener to the podcast, you’ll know that model drift is a popular topic of discussion. But this week, Amber and Xander deliver their knowledge with a focus on the challenges faced by Large Language Models (LLMs) in particular. They delve into the intricacies of model drift, opening our eyes to the complexities involved.

The key to success often lies in staying ahead of the game, and that’s where the concept of ML Observability, an invaluable tool that takes ML Monitoring to new heights. By automating the discovery and resolution of AI issues in production, ML Observability acts as a guiding light for your AI system, preventing problems before they escalate. It’s like having an ever-vigilant guardian angel by your side.

Amber and Xander also take the time to explore open-source ML Observability options available to the wider AI community. They share insights on the frequency at which production models should be retrained, a critical factor in maintaining optimal AI performance. What’s more, they delve into the captivating realm of bias detection in AI models, specifically targeting biases against specific demographic groups. Through the comparison of natural-language embeddings, they reveal how ML Observability can flag instances where sensitive groups are treated differently, shining a light on potential sources of unwanted bias.

This episode is a must-listen for anyone seeking a deeper understanding of the challenges faced by AI systems in production and how to navigate them successfully. Tune in today to learn from these experts!

In this episode you will learn:

What is ML Observability [05:07]
What is Drift [08:18]
The different kinds of model drift [15:31]
How frequently production models should be retrained? [25:15]
Arize’s open-source product, Phoenix [30:49]
How ML Observability relates to discovering model biases [50:30]
Arize case studies [57:13]
What is a developer advocate [1:04:51]

Items mentioned in this podcast:

Posit RStudio
AWS Trainium
AWS Inferentia
Anaconda
Arize AI
Phoenix GitHub repo
Splunk
ODSC East
Arize certification courses
Kullback–Leibler divergence
Designing Machine Learning Systems by Chip Huyen
Measure What Matters by John Doerr
New York R conference

Follow Amber:

Follow Xander:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 689 with Amber Roberts and Xander Song of Arize AI. Today’s episode is brought to you by Posit, the Open-Source Data science company, by AWS Cloud Computing Services, and by Anaconda, the world’s most popular Python distribution.

00:00:21

Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:51

Welcome back to the SuperDataScience podcast. Today I’m joined by not one, but two guests, Amber Roberts and Xander Song. Both are fabulous and both work at Arize, a Machine Learning Observability platform that has raised over 60 million in venture capital. And just so, you’re aware, Arize is spelled maybe not how you’d expect it’s ARIZE or ZE for you Americans out there. So, let me introduce both Amber and Xander. Amber serves as an ML Growth Lead at Arize, where she has also been an ML engineer. Prior to Arize, she worked as an Al ML Product Manager at Splunk, and as the head of AI at Insight Data Science. She holds a master’s in Astrophysics from the Universidad de Chile in South America. Xander serves as a developer advocate at Arize specializing in their open-source projects. Prior to Arize, he spent three years as an ML engineer. He holds a Bachelors in Mathematics from UC Santa Barbara as well as a BA in Philosophy from UC Berkeley.

00:01:51

Today’s episode will appeal primarily to technical folks like data scientists and ML engineers, but we made an effort to break down technical concepts so, that it’s accessible to anyone who’d like to understand the major issues that AI systems can develop once they’re in production, as well as how to overcome these issues. In this episode, Amber and Xander detail the kinds of drift that can adversely impact a production AI system with a particular focus on the issues that can affect large language models, also, known as LLMs. They talk about what ML Observability is and how it builds upon ML Monitoring to automate the discovery in resolution of production AI issues. They talk about open-source and Observability options, how frequently production models should be retrained, and how ML Observability relates to discovering model biases against particular demographic groups. All right, you ready for this important and exceptionally practical episode? Let’s go.

00:02:50

Amber and Xander, two guests. Double the fun. No doubt. Where are both of you calling in from?

Amber Roberts: 00:02:56

So, I am calling in from the Miami area in Florida.

Jon Krohn: 00:03:01

Nice.

Xander Song: 00:03:02

I’m calling in from the San Francisco Bay Area in California.

Jon Krohn: 00:03:07

Oh, nice. All right. Well, great to have you both on the show. I met Amber in person a couple of weeks ago at the time of filming I was at ODSC East. I did a couple of half-day trainings there. I did a half-day training on Deep Learning, like an intro to Deep Learning with PyTorch and TensorFlow. And then I also, had a half-day training that was a huge amount of fun for me to deliver, a huge amount of work for me to deliver because I hadn’t done a talk on this content before, but it was on large language models, so, natural language processing with LLMs and specifically focusing on how we can be using both the commercial APIs like open AI’s, GPT-4, as well as open-source models through Hugging Face, and then taking advantage of these kinds of tools together with PyTorch Lightning to efficiently train in order to have your own proprietary LLMs that are very powerful and kind of GPT-4 level quality.

00:04:09

So, I was talking about that stuff and I lamented over a lunch to Amber that I wish I had seen her presentation the day or at any time before I was presenting, because I had a slide on issues related to deploying LLMs. And I listed a whole bunch of problems that can happen in production. And my final bullet was there are various ML Observability platforms out there, and I didn’t list any, because I didn’t have any like, particular one that I thought was like really good. I was like, I just said to the audience, go ahead and Google it. Oh, and by the way, all of that talk should now be live on YouTube. By the time that you hear this listener, it isn’t at the time of recording, but I’m pretty confident it will be published in on YouTube if you want to check it out so, you can hear me lament about that and, yeah, so, Amber did a talk specifically on Model Observability, Machine Learning Observability with LLMs.

Amber Roberts: 00:05:15

Yes. Yeah, that, that was, that was a lot of fun, that conference. But yeah, I did a talk on kinda LLMs in production, so, LLMOps and where Arize focuses on is the ML Observability component. So, ML Observability is the software that helps teams automatically monitor AI, understand how to fix it when it’s broken, and ultimately learn how to resolve the issues teams are facing in production. And LLMOps and LM Observability, similar concepts to Machine Learning Observability, but focused on these large language models.

Jon Krohn: 00:05:58

Right. So, I guess there are unique challenges related to LLMs in particular. Maybe we can dig into that in just a little bit. In the meantime, yeah, you can let us know a bit more about why, whether it’s an LLM or not, that we have in production. Why is the Machine Learning Observability platform like Arize useful to a data scientist, a machine learning engineer, or a business that’s deploying a Machine Learning model?

Amber Roberts: 00:06:24

That’s a great question. And mostly it comes down to what you do after you develop this model that, you know, it goes through all the tests and a Jupyter Notebook, it looks great, and then you put it in a production. Because, I think I appalled the audience during my talk, like, how many people have put a model in production on a Friday night and then come back on Monday. And it’s just, you don’t want to be doing that because these models inevitably drift. So, there’s issues with the baseline data versus the incoming data. There’s performance segregation issues. These models are always going to change over time, but it’s important to know how much they’re changing. Is that affecting customers? Is that affecting revenue? Is there fairness and bias issues thrown in there? Are there data quality issues? So, there’s a lot that comes into play in that post-production workflow.

00:07:17

And oftentimes what we see, Arize does a lot of surveying for our, our customers and for ML engineers in the space. And what we find is that 84% of teams say that it takes at least one week to detect and fix an issue in production. So, sometimes these issues could exist as long as six months, and it’s customers calling to complain like “Hey, I’m not happy with the recommendations I’m receiving. I don’t like this plan, I don’t like this. You know, this feedback that I’m getting.” That’s really the last thing you want is you to find out there’s a problem because customers or you find it out because something’s wrong with the revenue, something’s wrong with the model. So, just making sure you have the orchestration in place to prevent all these issues and to catch them early. It’s that time to value that really makes teams want to purchase an observability solution like Arize.

Jon Krohn: 00:08:08

Nice. That makes a lot of sense. So, you talked about drift there. I know that that in particular is something that is an issue with Machine Learning models in production, that that happens all the time. So, what is drift? Why should we monitor it?

Amber Roberts: 00:08:20

Another great question, Jon and Arize does have two free courses. They’re virtual certification courses that can be done at your own pace, and drift is covered in both of those units. One is kind of an introduction to drift, and one is the advanced drift metrics that are used in production. And I would say it comes in of two categories. I’ll let Xander tackle the second category. The first category are the structured use cases that we see. A lot of times you’re looking at distributions and how distributions change over time. So, you always have a reference or baseline distribution, and then you have a production distribution. So, if you have your baseline distribution, let’s say that’s your training and then your production data is going to be your current distribution, you’re looking to see how that data’s changing over time. So, you can use a statistic method like PSI, which is a Population Stability iIndex.

00:09:14

You can use KL Divergence, you can use a Chi-Square test. There’s a lot of different statistical methods that you can use to see how much one distribution is changing from another distribution. And so, there’s methods around that. And then for drift, it’s important to set monitors. So, because you might have a, you know, maybe, you know, certain features like user ID, user IDs are going to drift, that’s totally fine. But if you have something like, say, number of states, you want 50 states if that’s what’s used in your training model, and you know, we have seen teams that, you know, maybe they get, oh, now there’s 60 states coming in, and that’s because we trained our model on uncapitalized data and now we’re getting capitalized data coming in. So, setting those monitors is going to be key. Xander, do you want to talk about how we measure unstructured drift in production?

Xander Song: 00:10:06

Yeah, sure. So, I can, I can talk about the unstructured case. When you’re dealing with unstructured data, what you’re dealing with are embeddings, and embedding just being a vector representation of a piece of data. So, imagine it’s a, you know, imagine you’re dealing with like an image classification problem. You’ve got this image classification model, it’s taking, you know, photographs and it’s trying to tell you what’s in the photograph. An embedding vector would just be this vector that’s basically encoding similar images nearby in the embedding space and dissimilar images far apart. So, you would want to see images of the same class nearby, dissimilar images far apart. And it’s not just like, you know, what kind of what is in the image. It’s like these embedding vectors actually contain a lot of different kinds of information, information that you might not expect. Like is the image grainy, is it corrupted? Is the semantic content of the image changing? it can be really subtle things actually that are difficult for human beings to detect.

00:11:09

And basically what we do is we monitor the distribution of the embeddings training relative to production. So, is that distribution of embeddings in production, is it different from the distribution of embeddings that we saw in training? And, you know, if it’s helpful, I can, I can talk about exactly how we do that. But the basic idea would be are you getting basically new areas of your embedding space? Are you seeing new parts of the embedding space light up in production that weren’t represented in training? And for traditional models where you actually, you know, a traditional CV model where you’re actually training you know, you’re training it on some data set, you could really expect that the, the model is not going to perform well if it’s doing inferences on data, the likes of which it never was trained on. So, that’s in a nutshell, what we’re doing when we detect embedding drift for unstructured cases. And I would emphasize, right, like, that approach is actually very agnostic with respect to the kind of data you’re representing. I was giving CV as an example, but you could also, represent a piece of text as an embedding vector. You could represent well I, you know, a piece of audio as an embedding vector. So, it’s a pretty agnostic approach to detecting drift for unstructured use cases.

Jon Krohn: 00:12:27

Nice. So, with structured data with quantitative numbers that you’d find at a table, we can use what Amber was describing where we just have some baseline distribution that we should be expecting. And if those, if the distribution of those numbers in the table starts to get away from the baseline distribution, then that could set off like an alarm, basically. Like what, like, is that what happens? Like somebody gets like a pager duty notification kind of thing?

Amber Roberts: 00:12:54

Yes. So, you can set your monitors and that’s, that’s where a lot of teams have difficulty. It’s not so, much understanding how drift works, but understand how to do it at scale. So, what Arize does is set monitors for every feature of every model, of every model versions of every model type. So, you’re able to handle it at scale. And so, teams can work in Arize accounts similar to how you can use role access in Google or AWS and work on different sets of models. And then, yeah, that goes to Slack, email, PagerDuty. Anytime a particular feature goes off on performance drift or data quality that would be alerted to a team member.

Jon Krohn: 00:14:14

This episode is brought to you by Posit: the open-source data science company. Posit makes the best tools for data scientists who love open source. Period. No matter which language they prefer. Posit’s popular RStudio IDE and enterprise products, like Posit Workbench, Connect, and Package Manager, help individuals, teams, and organizations scale R & Python development easily and securely. Produce higher-quality analysis faster with great data science tools. Visit Posit.co—that’s P-O-S-I-T dot co—to learn more.

00:14:17

Nice. Makes perfect sense. And yeah, so, just kind of circling back to what Xander was saying there. So, with the, with tabular data, just like straight numbers that we can put into distribution, we have the, we compare the baseline distribution to some production distribution with unstructured data, whether it’s images like the computer vision example that you gave Xander, or whether it’s natural language data it could be audio waveforms, any of these unstructured data types, they can be converted into an embedding, which yeah, is just this a vector of some length that represents numerically kind of an abstraction of all of the various features. So, like you said, with computer vision, like it could be is the image grainy or not? Or is the meaning of what’s in this is the semantic meaning that’s represented by this image different? And so, there would be some set of embeddings, so, similar to the way that we have like a baseline distribution, and we’re comparing that with the production distribution, we have this like baseline set of embeddings, and if we start to see embeddings that are far away from what we were expecting, then that similarly sets off an alarm.

Xander Song: 00:15:25

Exactly. Exactly.

Jon Krohn: 00:15:26

Nice. All right. I think I’m following along. So, I know that there are like lots of different kinds of drift. So, I know there’s like data drift, model drift, others. Are you able to like break down for us what these different kinds of drift are?

Amber Roberts: 00:15:46

Yes. And I will say like the course gets into each type of drift, what it is, how to monitor it, and then what metrics you should use. So, you know, there’s definitely more details if people are interested in the course, but so, different types of drift that we see. We see well what’s known as kind of model drift or a drift in predictions. So, if something’s drifting in predictions, you don’t need the actuals, you don’t need those ground truths. But if you use something called concept drift, that’s a drift in essentially the outputs which means you do need that ground truth. And for teams that get the ground truth back without a really strong delay it’s better to use a performance metric. Because what teams want to use drift for is a proxy to performance. So, if they don’t get that performance back, they want to see like, is anything happening?

00:16:41

And so, a lot of that comes with covariate drift, feature drift, data drift metadata drift all those are essentially putting drift monitors on each feature and you can set those for numeric and categorical features. So, the different types of drift we see are going to be around the features, the inputs, the outputs, and then there’s also something known as upstream drift, which is essentially like something going wrong in the upstream process. So, it’s more likely an engineering issue that’s coming around there. But yeah, the, I would say that the number one type of drift that teams care about is feature drift. Because some of these models have thousands of features and if you’re using things like shop values, feature importance values, you might notice that maybe five features have the most important impact on decisions. And so, they really want to monitor the most important features that are leading to these decisions that are then impacting customers. So, feature drift is, I would say one of the biggest ones. And PSI, which is a Population Stability Index, which is actually derived from KL Divergence, tends to be the most used metric because of like the stability it has and a symmetric property that KL Divergence doesn’t have.

Jon Krohn: 00:18:09

Cool. Yeah. And so, KL Diversion you mentioned a couple of times there. So, Kullback, I’m butchering the pronunciation, Kullback–Leibler Divergence that people can look up in full later. But just, I think when we save really quickly in a podcast like KL, you’re probably like, is that a word? So, two, two letters capitalized, KL, hyphen in between them in many cases, KL Divergence. So, folks can look that up.

Amber Roberts: 00:18:37

And we have, like I said, I mentioned the course, but we have blog posts that really go into detail as well. Yeah, I have trouble pronouncing it because there, there’s js, there’s there, there’s a lot of different abbreviations and they’re not the easiest to say and they’re even harder to calculate sometimes. So, having like a go-to guide has been really helpful and has helped I think a lot of, a lot of our customers understand it because these are non-trivial things to calculate for large models.

Jon Krohn: 00:19:08

Okay, cool. So, I guess with a platform like Arize, does it come built-in with like these kinds of calculations? So, yeah.

Amber Roberts: 00:19:17

Yes. So, if you were to sign up for the Arize platform, which you can. Arize has a free version and you can upload two models into it. It’s based on inferences, so, anyone can go and try it out today. We have automatic schema-detecting capabilities. You can upload your data from a CSV or cloud storage or API. So, a lot of different options for uploading your data. And then we’ll automatically set all your monitors around drift performance and data quality. You can use custom metrics. We have a lot of default metrics that tend to be the most useful for teams. And so, we have normally two types of customers. One type of customer that just wants everything done for them, they want to check on it maybe once a week and make sure that they’re getting a learning when things happen. And then we have another type of team that really wants to dive into fully understand all the features of their model, see if they can improve, you know, performance by 1%. Like what would it take to do that? But yeah, it, it can be as easy automated as you’d like, or you can make it as configurable as you want. Like you can take away all the automated monitors and make, make everything customizable if that’s what you want.

Jon Krohn: 00:20:34

Nice. All right. I’m starting to get a full picture of the shape of this platform and how to use it. So, you know, we’re uploading CSV files so, that the platform can get a sense of what our data are like that allows it to create these kind of baseline distributions or baseline embeddings. And then we can be monitoring in production. And if some production issue happens, then there’s alarms going off, emails, texts, and PagerDuty alerts.

Amber Roberts: 00:21:05

[crosstalk 00:21:05] and there’s, we we’re pointing the finger, we’re saying, this is where it’s happening in your data. And then what we’re also, able to do with some of our unstructured capabilities is export the data as well. So, I think Xander’s going to be talking more about our open-source offering which the capabilities there are also, what you get from the Arize platform and are unstructured, but with all the clustering and the unstructured data, you can actually lasso certain points and export them. And teams can use this for labeling, retraining, but essentially teams want to know where in the data the problem is and be able to see exactly how much that data is influencing the platform. So, you can isolate, you can filter on those, whether it’s metadata, whether it’s a certain class whether it’s, you know, just a few like IDs, it’s a location you could filter on each one of those, and see how it’s, how much it’s actually impacting performance. Like is performance going down 2% when we isolate this data? So, it’s that isolation aspect that is the difference between ML Observability and ML Monitoring. Because ML Monitoring’s kind of like the tip of the iceberg. Like you definitely need that, but if you want to actually avoid these models going down, you need to have that second set of saying, where is this problem and how do I solve it?

Jon Krohn: 00:22:31

Got it. Yeah. So, that is, I guess I wasn’t in my head aware that those two words weren’t synonyms, so, I might have thought that ML Monitoring and ML Observability were the same thing, but, okay. So, ML monitoring is like, somebody could have a, like we could have a screen in the office with like a bunch of charts of like performance over time or these distributions over time and how they’re changing. But that requires somebody to be like keeping an eye on it and you’re like “oh oh”. But with ML Observability, it’s automatically keeping an eye for you.

Amber Roberts: 00:23:09

Right, right. And when you mentioned those screens, it makes me think of like the stock market and stock exchange and you’d have everything going on, and unless you’re constantly looking at it all the time just having something in place like a stop limit. So, you know what, what’s happening when you know where your limits are, and so, you can set those for your models. Like anytime this goes down 2%, you know, it affects our KPIs, our profitability down 10 million dollars. And so, relaying those to business metrics you can have those in place.

Jon Krohn: 00:23:43

Yeah. It’s interesting that you said stop order that is like, are you saying that from like experience in financial markets or is that actually what you also, call it in your platform?

Amber Roberts: 00:23:50

No, that I just came up with. From, from how you were visualizing it for me.

Jon Krohn: 00:23:56

Yeah, yeah, no, exactly. I just, but yeah, that was perfect. And it was so, fluid that I was like, and hey, that’s actually just how they like configure it. But yeah, so, it’s like, yeah, how do you know that? I didn’t notice that you had a trading background from your biography or-

Amber Roberts: 00:24:09

I don’t.

Jon Krohn: 00:24:11

Yeah.

Amber Roberts: 00:24:12

Covid hobbies.

Jon Krohn: 00:24:14

But yeah, that makes perfect sense. So, that like yeah, if you are, if you’re expecting some asset that you are, you know, you’re, you can, you can set it up so, that in real-time with a lot of these trading platforms that if something hits a certain price, you’ll either automatically buy or sell and yeah, that makes perfect sense here. It’s like if the, if some aspect of your model in production drifts so far that it hits this like price that it triggers, it’s like, you know, that change corresponds to yeah, this business impact of this many, this amount of dollars lost. And so, at that point, we’ve got to be sure that we’re on top of it and that it’s worth like yeah that that sequence of whoever’s PagerDuty things are going off in the middle of the night so, that someone has to get up and fix it.

Amber Roberts: 00:25:11

Yeah. A lot of times it’ll trigger a retraining cycle.

Jon Krohn: 00:25:14

Nice. Which is actually one of the topics that I wanted to talk about next, because when something goes wrong, when like an alarm goes off when a model is no longer performing like expected, because that, like fundamentally that’s the idea with drift. If we were to like summarize it, whether it’s concept drift, feature drift, prediction drift with all these different kinds of things, the issue is that the model is no longer for some reason or other, something about it is no longer relevant to the real world data that are coming through your platform. And so, then when this arises, we need to retrain our model, I guess is the most common solution.

Amber Roberts: 00:25:53

Retraining is a very common solution, and I’ll actually let Xander chime in on this because he works a lot with community members that are leveraging Arize for their use cases, but retraining is very common. But a lot of times, you know, it’s, it’s a bigger conversation of what’s going on with the data. I never realized how strange it was to retrain a model on an arbitrary amount of time until I started working at an ML Observability company and then, you know, oh, we’re going to retrain every two weeks or every day or every six months. Like, you realize that really doesn’t make sense. It should be on a needs basis, but then you’re like [crosstalk 00:26:32]

Jon Krohn: 00:26:32

Interesting.

Amber Roberts: 00:26:32

With the community.

Jon Krohn: 00:26:34

Yeah. So, just really quickly Xander before you go, that is a really interesting point that you made and something that I have never thought of. So, for me it is exactly, it’s this, I’ve only ever thought of model retraining indeed in my talk at ODSC East in Boston. I specifically was like yeah, you need to retrain at regular intervals is the exact guidance that I provided on my monitoring ML models in production slide, which now I realize it, it says monitoring ML models in production, and so, yeah, I hadn’t gone that step further and been thinking about like observing. But so, I guess as part of this with observing, you can be having these triggers automatically retrain your model, which makes more sense from, for so, many reasons, because if, like from my suggestion it is, it’s this like daily, weekly, monthly, whatever cadence on retraining the model. And that actually doesn’t make any sense unless you’re aware of how much drift is happening. Because if I’m doing it every day, but I only need to be doing it once a year, I’m wasting a crazy amount of resources.

Amber Roberts: 00:27:42

Yes.

Jon Krohn: 00:27:43

Whereas if I’m only doing it once a year, and I should be doing it every day my platform’s going to be terrible.

Amber Roberts: 00:27:50

Yes, exactly. If it isn’t broken, you know the saying goes, you don’t need to fix it. And then you do have teams on the other side that are afraid to retrain a very large model that cascades into maybe 10 other models that affect 10 other teams, and if they retrain it and something goes wrong so, you have people on, on both ends, they’re like, let’s retrain every time we get any new data. And then teams that are, you know, let’s be very, very cautious and not retrain unless it’s absolutely necessary. And Arize helps folks understand when is the right time to retrain.

Jon Krohn: 00:28:30

Awesome. All right. So, Xander, I didn’t let you speak earlier. You got some examples for us? Well,

Xander Song: 00:28:34

Yeah, I think just one thing I wanted to add on to what Amber’s already said, just around the difference between ML Monitoring and ML Observability. I usually try and break down ML Observability into this simple equation, which is it’s monitoring plus the ability to identify the root cause of the issue. So, just, just to know that something’s gone wrong is not always enough. Like you need to actually have some visibility into what’s causing the problem and figuring that out for machine learning systems can actually be like this really devilish, devilishly tricky problem to do. And so, I think that’s also, one of the distinctions I draw, draw there between those two concepts. Part of what we’re trying to do is not only just detect and give you a little, a little alarm bell, but also, to help you really quickly to give you these like opinionated workflows that are going to really help you quickly identify exactly what the issue is. So, that, that’s, that’s another differentiator I would say, like between monitoring versus ML Observability.

Jon Krohn: 00:30:19

This episode of SuperDataScience is brought to you by AWS Trainium and Inferentia, the ideal accelerators for generative AI. AWS Trainium and Inferentia chips are purpose-built by AWS to train and deploy large-scale models. Whether you are building with large language models or latent diffusion models, you no longer have to choose between optimizing performance or lowering costs. Learn more about how you can save up to 50% on training costs and up to 40% on inference costs with these high performance accelerators. We have all the links for getting started right away in the show notes. Awesome, now back to our show.

00:30:22

Cool. All right. Yeah, thanks for that additional insight, Xander on ML Monitoring versus Observability. And so, thus far we’ve been detailing the Arize commercial platform and how it can be useful. And you’ve even mentioned how people can be having up to two models uploaded for free, which is super cool so, people can check it out. And maybe for like, you know smaller companies or use cases that’ll be enough for them actually without needing to go to a commercial option. But you also, have a brand new open-source product, which is called Phoenix. So, tell us about Phoenix. Can it do everything that the Arize Enterprise product can do, for example?

Xander Song: 00:31:00

Yeah, so, Phoenix, Phoenix is bringing part of the functionality of the Arize Enterprise platform, which is the SaaS platform. It’s bringing part of that functionality into a Notebook environment. So, what it actually is, is an application that runs alongside your Jupyter Notebook, runs on the same server that’s running your, you know, your Notebook server, whether that’s your local computer, whether that is your Colab server. And it actually gives you this interactive experience that’s more immediate than uploading data to a SaaS platform. And right now, at the moment, it’s really focused on unstructured data, on the unstructured offering that I talked about a moment ago. But I think in terms of the long-term scope, we’re really trying to be responsive to the community. What is the community want? What is the community hungry for, hungry for? And right now the answer is they’re really hungry for LLMs. So, that’s really like where the push is happening for us on the Phoenix front right now.

Jon Krohn: 00:32:01

Awesome. Yeah, so, I mean, tying back in perfectly to a lot of podcast episodes that we’ve had recently on the show. We’ve had a lot on generative AI on large language models. Obviously like my talk at ODSC East was focused on that and was a super popular one, and I knew it would be because this is just like what everyone’s talking about, whether you’re in data science or not, and whether, you know, to call it large language models or, you know, to call it generative AI. Everybody’s talking about ChatGPT, GPT-4, Midjourney and how these large language models are impacting the world and, you know, is my job safe? Or like, how is this going to change things in the future? What are the policy implications? What does this mean for misinformation? So, yeah, so, like one way or another in the data science world or not, these are major topics and so, it’s really cool that you have decided to go, to be kind of LLM first, unstructured data first with your Phoenix open source product. So, yeah, so, I mean, is this something like people just go to the GitHub repo and it’s like straightforward how to be using this with your LLMs?

Xander Song: 00:33:14

Yeah, yeah. So, you can just pip install it. It’s pip install arize-phoenix. You can check out our GitHub repo, you can check out our docs. We got a bunch of tutorial notebooks up. Yeah, so, it’s, you know, free to get started. Just put it in your Notebook and run it.

Jon Krohn: 00:33:31

Nice. And then like, are there additional kind of resource requirements? Like if we, I guess if we already have a pretty, so, typically to be running an LLM we’d need to have a pretty, some pretty big infrastructure running probably with at least one GPU. And then, so, I guess in addition to that, there’s probably no additional like infrastructure requirements. It’ll just, it’ll be relatively lightweight relative to the model?

Xander Song: 00:33:59

Yeah. There’s no [inaudible 00:34:00] requirements at all. You can run it along, you know, if you’re running a, in a Notebook environment that has like a GPU, you can run it there, but you don’t need a GPU in order to run the app. Of course, the app is just this pretty lightweight application, and it actually is going to be a UI that you can actually use either in a separate browser tab or literally you can open up the UI and use the application literally inside of the Jupyter Notebook.

Jon Krohn: 00:34:23

Gotcha, gotcha, gotcha. So, once you get Phoenix running that, yeah, it allows you similarly to the way that you might have Tensor board running or watching your machine learning model train, whether it Yeah, Tensor board comes originally from TensorFlow, but your model could be in PyTorch, it doesn’t matter. With Tensor board, you can open up another browser tab and, or again, you could have it running just there in the Notebook, and you can watch your model’s cost, hopefully go down as your model trains. And so, similarly, you could have another tab open with Phoenix running. And so, you could be, you could be watching in real-time, is your model running in production, just have this extra tab open. And then, so, then that sounds like ML monitoring to me so, far. So, but Phoenix also, has built in like the observability component that if we hit that to go back to the financial algorithm, if we hit that stop order price we’ll end up triggering alerts.

Xander Song: 00:35:20

Oh yeah. Oh yeah. So, one thing I’ll say is like Phoenix at the moment is not real-time. If the community, you know, if we find that the community wants that, we could, we could put that in, but at the moment, that’s not, it’s not a real-time application. That would be the SaaS platform, right? But imagine that you’ve, you’ve got some, you know, some inference data, right? And imagine that we’re in this situation where we want to detect drift in. Before I mentioned, you know, you have, you’re monitoring your embedding distribution, right? And we’re able to actually measure quantitatively how far has the production distribution shifted away, drifted away from the training distribution of embeddings. And so, you can imagine you’ve got like this graph showing you drift of your production data relative to your training data over time. Again, all this is the drift of the embedding distribution, right?

00:36:06

And then what we can do is like, we can take those embeddings, and this is where I wish I could show the people who are listening, but imagine this, like, imagine you take the embeddings, which are these high dimensional vectors, right? They could be a thousand dimensions or more, right? And you take those embeddings and you do some dimensionality reduction to view them in three dimensions. So, now you’re actually literally able to see your two embedding distributions in three dimensions, your production and your training distributions. And again, like I mentioned, like you, if, if the production distribution has trained has, sorry, has shifted away, has drifted away from that training distribution, again, you’re going to be seeing pockets of production data that you didn’t see during training. So, you, it’s really cool. You can like literally see the exact data points in production where you didn’t have training data. And then what we do is we provide this …Yeah, go ahead. Go ahead.

Jon Krohn: 00:37:02

But, but this is all Phoenix that you’re still describing, right?

Xander Song: 00:37:04

So, this is, this is Phoenix, but this, this is also, in the SaaS platform.

Jon Krohn: 00:37:08

But, so, like, I’m, this is going to sound, I, I must like, have missed something, but, so, you said Phoenix doesn’t work on production, but everything that you were just describing was like production and embeddings and comparing those to training. So, like yeah. So, like, I’m, I don’t, so, like, when we’re watching Phoenix, when we have this extra tab open as we’re training, like what are we watching? Like what are we, like you were talking about it as like comparing production embeddings versus training embeddings, but like, I wouldn’t have, like if we’re just training a model, what does production beddings mean in that context?

Xander Song: 00:37:40

Yeah. Yeah. So, I, I think I was probably, I was just trying to say like, it’s not I was trying to clarify like Phoenix isn’t an app that you’re like logging real-time data to, if that makes sense. It’s a, it would be like, imagine you’ve got-

Amber Roberts: 00:37:53

Batches of production data. So, you could still have batches-

Jon Krohn: 00:37:59

Oh, I see. So, okay, okay. Okay. So, I completely misunderstood. So, I was thinking that Phoenix was like, I was then now like, okay, I have two tabs open. I’ve got my Tensor board running, and maybe this is like, it’s my mistake cause I like, cause it was just kinda like running with this idea. But like, I’m training a model in my Jupyter Notebook and I’ve got a tab open that I’m watching, like my loss functions on, and then I’ve got another one where I’m watching Phoenix. But that isn’t what you’d be doing. You’d be, you’re taking batches of production data and ad hoc looking at them to check for drift. So, Phoenix allows you to do ad hoc ML monitoring to identify like your own issues and to see like to give you a better understanding of maybe where your model’s starting to fall down in production or where it has limitations and where you might want to be retraining or adjusting aspects of your model-

Xander Song: 00:38:55

That is, that is an accurate description. Yeah.

Amber Roberts: 00:38:57

Yeah. And Jon, you can also, use like a validation set versus a training set. You can use it pre-production. But most folks are concerned with that post-production workflow. Even if the inferences are kind of like mock data, most of the time it’s like you get a little bit of data back, you know, you’re kind of AB testing, you’re getting some information back just to see how well your model is doing. It, right, we’re not just like validating the model, we’re validating how well our model is doing with our customers.

Jon Krohn: 00:39:34

I gotcha. I gotcha. And so, yeah, and so, you made a really interesting example there, Amber, that I hadn’t thought of, which is splitting our training and validation sets. So, I’m guessing you can correct me where I’m wrong here, but, I this sounds to me like when we are training our models and we want to make sure that our model works well on data that it hasn’t seen before, we have this validation data set that we set aside, but that validation data set is only useful if it matches our training data in terms of obviously not being identical data points, but having the same kind of distribution in the case of structured numeric data or the same kind of embeddings in the case of unstructured data. Yeah, I actually, I got some head nods, which only are video version actually, and my head would’ve been on screen, so, you couldn’t even see. But Xander and Amber both nodded their heads in agreement at the same time, so, okay. Okay, cool. Yeah, that’s another use case that I hadn’t thought of.

Xander Song: 00:40:37

And I would just add on to that, right? Like, I, I think I, I started out with this idea of training versus production because I think that’s a lot of the time the easiest one to understand the very first time, but in general, Phoenix is this pretty general tool for being able to diff embedding distributions, right? So, as you mentioned, you could, you could be diffing the distributions between training and validation data. You could be diffing the two distributions between a fine-tuned and a pre-trained model. You could be you know, we’re getting into like actually really interesting applications right now. So, actually what, what I’ve been working on the past couple weeks is diffing the distribution of embeddings for a context retrieval knowledge base versus the distribution of user queries to understand, are users asking my Llama index service or my LangChain semantic retrieval service? Are they asking questions of my database that are answered in the database or that aren’t answered, right? So, it’s a, it’s a pretty general idea. It’s like really like, you know, champion challenger anytime you would want to diff to embedding distributions, I would say.

Jon Krohn: 00:41:47

Nice. Great. Thank you for like generalizing this concept as well as then giving specific examples. Very cool. So, yeah, so, the canonical thing that we’re thinking of with ML Observability in general, because this is the thing that, you know, we’re worried about most in production is comparing training data distributions versus production. But of course a tool like Phoenix, which is open-source, which, you know, we can use for comparing any kinds of embeddings, we can be using that for comparing our training data versus validation data, fine-tune model versus not yeah, particular user use cases, making sure that trying to see whether users are kind of using our platform in the way that we anticipated the way that we trained it for. Super. Cool. All right. Yeah, I’m starting to see a huge amount of value here in Phoenix. Yeah. And so, I guess maybe, so, you mentioned how you developed Phoenix on the one hand because LLMs are so, popular today.

00:42:46

But I also, I just had a bit of a brainwave, and you can correct me if I’m wrong on this, but it seems to me like that it might also, be particularly useful because of how for comparing embeddings, this is so, much more complicated than just comparing distributions. So, it also, kind of seems like you’ve created a product that solves like the more complex problem. You know, you know, with respect to, you know, we talked at the beginning of the episode about like structured data where we’re just comparing distributions versus unstructured data where we’re comparing embeddings. It seems to me like that latter problem is more complicated. And so, this Phoenix problem, like scratches that complex itch.

Xander Song: 00:43:26

I think that’s a, yeah, that’s a, that’s a good description. And then I think one, one last thing I would tack on there too is, is again, like the thing that we’re really aiming to do is not only detect like that something’s changed, but it’s really to immediately drill down into exactly what has changed. So, like, what that looks like in the product is we’re actually like pointing out the exact portions, the exact embeddings, the exact data points that are causing the drift or that are causing this change in the distribution and then surfacing those up to the user and telling you, this is what you need. This is the data that you need to look at in order to understand why you’re experiencing this drift issue. Or in order to understand why are the users who are asking questions of my you know, my semantic retrieval LangChain service, why, why are they not getting answered, right? That’s the idea.

Amber Roberts: 00:44:25

And the whole of Arize products are to solve the pain points our customers are facing. So, the problems that customers have with using traditional drift methods aren’t so, much being able to like, understand how it works, but being able to do it at scale, being able to do it with very high volume, being able to select which metric is best for that use case, and that’s what we help solve for more traditional drift metrics, like setting them up for scale. And then for unstructured, it’s, is this even possible? Like, a lot of teams are coming to us that have more traditional models and they’re like, I’m thinking about using these LLMs now. Like, I don’t want to be behind. How, how would that even look? Can I track that? Can I see how that’s doing in production? And so, those, those are just some of the ways we look at it. And like some, some of what sets Arize apart.

Jon Krohn: 00:45:22

Nice. Very cool.

00:46:08

Did you know that Anaconda is the world’s most popular platform for developing and deploying secure Python solutions faster? Anaconda’s solutions enable practitioners and institutions around the world to securely harness the power of open source. And their cloud platform is a place where you can learn and share within the Python community. Master your Python skills with on-demand courses, cloud-hosted notebooks, webinars and so much more! See why over 35 million users trust Anaconda by heading to www.superdatascience.com/anaconda — you’ll find the page pre-populated with our special code “SDS” so you’ll get your first 30 days free. Yep, that’s 30 days of free Python training at www.superdatascience.com/anaconda

00:46:11

So, when we’re dealing with LLMs, the scale of these models can be very large, obviously like billions of parameters. And, you know and then so, when we think about a user of your products, maybe having a very large LLM that also, has a lot of users. So, like they’re needing to scale up yeah, already big servers to, you know, to very large numbers of these servers running, to be handling all the users. So, are their considerations are, are, are these kinds of scale challenges something that the Arize team had to deal with as yeah, as, as you guys decided that, you know, LLMs was something you wanted to focus on? Or did it, was it, is there something about the way that Arize was architected that this scaling just kind of happened automatically?

Amber Roberts: 00:47:07

Xander, do you want to take a crack at that first?

Xander Song: 00:47:09

Yeah, I think I want to also, maybe clarify one thing that I think is a common question we get in the beginning, which is, is like what data is Arize actually taking in? So, we’re not actually taking in the model. We don’t ever, we, we never, you know, we’re not an inference platform. We don’t take in the, we’re not performing inferences, right? That is our customers who are actually handling that responsibility and our customers are logging to us basically all of the data around the inferences. In the case of an LLM, you’re going to have the actual embedding itself. So, you as the customer would at inference time grab that last hidden layer or you know, however you want to construct that embedding and you would log it to us, right? And then we’re taking in that embedding and that is the information that the Arize Platform is actually responsible for.

Jon Krohn: 00:48:05

Gotcha, gotcha. So, it sounds like the answer to my question is that yeah, Arize scales very easily. Like it’s yeah. Got it, got it.

Amber Roberts: 00:48:14

Oh, I’ll just, I was going to add on like we’re very much built for scale. That’s where we win a lot of bake offs. Because any, anything can look good if you’re just experimenting with a very small set of data. But being able to scale it to like, you know, some of our customers have billions of inferences daily, like ad tech, e-commerce they have thousands of models. Each one of these models has thousands of features and metadata and chat values. And then with embeddings too, they can choose to sample it. They can choose to upload all the data because with a lot of the unstructured use cases, they’re looking for trend, they’re looking, they’re looking for major aspects of the data and not just a single anomaly. They’re, they’re looking for patterns, they’re looking for kind of what clusters are emerging, what clusters are new. And so, for, for that sometimes they will, will sample the data or use all of it because as we know, like these could be very, very large.

Jon Krohn: 00:49:20

Gotcha. Gotcha, gotcha. There’s a, you said, did you say bake off or big off?

Amber Roberts: 00:49:25

Bake off.

Jon Krohn: 00:49:25

A bake off.

Amber Roberts: 00:49:28

A bake off. Little accent coming out there. Big off. So-

Jon Krohn: 00:49:36

Oh, bake off. I thought it was actually, so, I assumed that you said bake off, but it would be kind of perfect if you said big off because it’s like, like very like, cause it was specifically to deal with like who can handle the scale well?

Amber Roberts: 00:49:47

Yeah, breaking scale [inaudible 00:49:49], it’s breaking the scale challenge.

Jon Krohn: 00:49:52

Who can make 10 times as much cake, a hundred times as much cake. Cool. All right, so, all right. I’ve, I’ve, I think I’ve got a pretty good grasp on the Phoenix product on Arize’s commercial offering and on these kinds of drift issues, model observability issues in general. A related problem particularly with unstructured data being used in production is bias. So, I know that Arize has a bias tracing tool. Can you dig into like why this is important? How it relates to like model explainability and how practitioners could be leveraging this to debug their models or maybe have models that are safer in production?

Amber Roberts: 00:50:48

Right. I’ll talk about the bias tracing tool and then Xander, maybe you can give like an unstructured example of where you can start seeing poor, like poor actors or bad data coming through unstructured models by isolating data. But with our bias tracing tool. So, when teams say, you know, do you have explainability? What they normally want is bias detection because explainability is great, ChatValues are great, knowing what features lead to model decisions is great, but it doesn’t tell you anything about what the final decision was, the impact it has for that user. So, looking at bias tracing and looking at it on different levels. So, with bias you’re going to want to look at parity. So, is my model making better decisions for this base group as compared to this sensitive group? Because if your, your model could just be, you know, poor at making decisions regardless of the group.

00:51:57

So, then you know, your model’s just bad but it’s not bias. And then sometimes your model’s performance overall is really good, but not when you compare, maybe you compare a certain demographic, a certain cohort to a base cohort, or a cohort where you have more data for, because you might have a minority and majority class. So, by doing this parity comparison and using something like a recall parity or false positive rate parity, seeing the decisions you’re making between groups is really key. And we actually have more information on that in the course and blog posts about how to measure and detect fairness in production. Because a lot of times it could come down there, there’s a lot of things that actually end up causing bias in production. Sometimes you say, oh, I’m removing all the data that relates to a protected class.

00:52:52

But you could have proxies, you could have just not enough data on certain groups. You could have class imbalance issues. You can have, you know, you could have certain biases in the data itself. And if you’re training on a, on a certain set you know, just being able to isolate where these biases are taking place. Because for parity scores, if you’re looking at, say you’re looking at the decisions you’re making for a loan for women versus men, and you want that parity score to be as close to 1 as possible. Because if it’s as close to 1, your false positive rate parity is going to be good for one group and good for another group. But if you see that parity score, it’s if you see if it’s below 0.8 or above 1.25, that tends to be, if you’re outside that range, you’re outside of the Four-Fifths rule, which is you know, it, like I said, this is all new stuff. But that is implemented in Congress as a way to measure equal opportunity. And you know, if, if teams are being essentially biased for jobs.

00:54:04

So, that’s one way we measure it. So, if you’re way off that parity of 1, if it’s like a parity of 0.1 or a parity of 6 or 7, looking more into the data is going to be key. And that, I know that’s a little bit of a, of a run on, it’s just a very big topic. And it’s not just measuring drift and it’s not just measuring performance. Like there are special metrics in place around bias tracing.

Jon Krohn: 00:54:31

Right, right, right, right. In addition to those special metrics, is it the case that these kinds of distribution, monitoring or embedded monitoring, so, like even going back to the conversation that we were having earlier where I had this light bulb go off around what you were saying where we could be using Phoenix for comparing training versus validation and then you went into other examples of fine-tune model versus not. Could we, similarly, is it like in addition to those special metrics that you just mentioned, Amber, can we just also, additionally be like comparing the embeddings of like sensitive group versus not, or am I?

Xander Song: 00:55:11

Yeah, sure. So, one, one thing that you can do, I mean, there’s, there’s a lot that you can do here. One thing that you can do is easily, imagine that we’re dealing with like text output from a model. And you want to know is the is the text output offensive or biased? Or is it, you know, is it actually producing some kind of bias against a certain kind of prompt, for example, maybe it’s a prompt about that has some kind of gender component to it. Is it producing output that’s different based on gender? Right? Maybe that’s a concrete example we could look at, right? You’re dealing with some kind of chatbot, right? Your input prompt has some kind of gender feature or gender part component of the prompt and you’re looking at the responses. And if you actually embed the responses, what you can do is see if there’s certain clusters literally like clusters of your embeddings, the embedding space, right, like clusters of your responses that are offensive or biased or have some other kind of like negative outcome there. And you can literally like look at those data points and color them by the gender of the input prompt, right? And visually see in the embedding space what’s going on, right? So, that’s, that’s one idea. Like the embeddings because they contain this really rich information are going to be very useful for understanding that kind of, that kind of situation and use case.

Jon Krohn: 00:56:47

Nice. Super cool. Yeah. All right. Nice. So, we’ve got a clear idea of, I think a lot of the key offerings that Arize has now, and this has given me and presumably also our listeners, a huge amount of context around why ML Observability is important, as well as how it relates to adjacent, adjacent issues, like what we were just talking about with bias and model explainability. So, you guys have a lot of high-profile customers, companies like Uber, Spotify, eBay, Etsy, are you able to share without like, obviously giving away anything proprietary about you or your clients that you shouldn’t be sharing on air, is there something, are there aspects of like your relationship with these big clients that you can, yeah, you can dig into some case studies of how you were able to help with your solutions?

Amber Roberts: 00:57:37

Yes. That’s a great question, Jon. We, we see a lot of teams coming to us. Well, we see a lot of teams coming to us, either because something went wrong in production, something went really bad. They, you know, models were down for a while. Web, you know, web, these are models that can control websites, revenue profitability, forecasting. They control how much product you’re going to buy. And when those models go down or they drift or they decrease in performance, it has a cascading effect. So, there’s a lot of teams that come to us because something went wrong and they didn’t catch it. You know, it’s at least a week to detect and fix an issue for most teams. And most of the time it’s longer than that. Or we have teams that really want to be preventative. So, some cases we see going back to that retraining discussion we were having – when to retrain your model?

00:58:38

And not everyone agrees at a company when they should be retraining their model and how they should be retraining their model. So, having ML Observability as a tool to justify retraining is going to be important. Some teams are using a lot of resources, retrain their model more frequently than needed, and it’s not always the best case to just automatically retrain your model on data that you haven’t really analyzed. And so, being able to see the data, see what’s going on and say, “Hey, we don’t have this.” Especially in large language models, you can see these new clusters forming. You can see Xander was saying like prompt and responses. You know, so, being able to retrain models based on what you’re seeing and making a justification for that. Machine learning engineers can use our dashboards, they can use our visualizations, put those in a kind of a PDF file and send that to certain managers. And we have seen that help teams a lot on creating a specialized training cadence.

00:59:46

Another thing is for depreciated models. So, essentially these models that you take out to pasture you know, what models are actually making a difference, what models aren’t? And you can see those performances, you can see you could track them over time and that helps a lot of teams. The more models teams have the more complicated their process is, and that like seems to be where Arize really, really shines for teams that have, you know, these models that are three or four years old, a lot of times they don’t have the original members on the team who built these models. So, helping to maintain them without, you know, knowing the source code, without having to, you know, change a lot of things up. You could tell is this model working or not just by implementing it and having it in Arize.

01:00:41

Knowing when models are failing and virgin-control. So, that’s a big thing that Arize offers. Like when you’re coding, you want to virgin-control your code. Teams that we work with often virgin-control their models and then they can compare versions of their models. Just like we were saying, you could test production and training, you could test validation sets, you could test any two data sets against each other. You could also, test models, model versions against each other and see which ones are performing better. So, a lot of times what we see is a tool for the Machine Learning engineer to assist with their job, their workflow, and to catch, obviously to catch issues ahead of time, but they can make justifications and tie that back to business profitability and business metrics.

01:01:30

Another big part of our survey was teams saying you know, business executives have a hard time quantifying the return investment they’re getting on AI. One of our customers said they spent 10 billion on AI solutions. They don’t really know how it works or if it’s even helping the company. All these companies will remain nameless, but you know, we are seeing this happen where they don’t want to not implement the latest technology, but they don’t really understand if it’s working or if it’s actually, you know, improving their profitability for the cost that involves implementing it. And then, like I said, the other big part is scale. So, Spotify recently gave a talk on how they are trying to manage their massive amount of embeddings in production with Arize. As you can guess, Spotify has all kinds of embeddings and these are, these are audio, these are text, you know, they’re at search of retrieval. There’s a lot of algorithms going on and a lot of personalization and real-time aspects that are happening. So, the integration of all those models, the embeddings, the scale, I think those are the most important aspects when thinking of an m ML Observability solution. Because trying to do all that in production while trying to build new models and do everything else a machine learning engineer is supposed to be doing can be very difficult. So, like having something in play where you set it up, you can check on it but you know, if it’s doing fine, it is really key for a lot of teams.

Jon Krohn: 01:03:07

Awesome. Yeah, those were great examples. Even the ones where you couldn’t name the company specifically. That really colored how useful a platform like this is, especially at the scale that these companies are dealing with. And then super cool that you’re able to give that Spotify example in particular. Yeah, it’s, It’s crazy to think how many embeddings they must have, like you say over like, they have, there’s something like a hundred thousand tracks uploaded to Spotify a day, and apparently a lot of that’s actually AI -generated. Like, it’s like-

Amber Roberts: 01:03:34

It’s, it’s AI feeding, AI tracking, like it’s, it’s pretty crazy.

Jon Krohn: 01:03:40

Nice. All right, so, we’ve talked a lot about Arize Machine Learning Observability in this episode, but we actually, I haven’t let the audience really get to know either of you is individuals. So, I’ve kind of got a last topic here for them to learn a little bit about you. So, Amber, you studied Astrophysics, Xander, you studied Math and you also, both had different roles before you got into what you’re doing today. So, maybe just give us a little bit of the taste of how you transitioned into what you’re doing today. So, Xander, you are a developer advocate, which is a kind of role that I think we’re seeing more and more in startups. So, you know, related to working directly with developers, answering the questions, developing community coming onto podcasts. So, we see these kind of developer advocate roles in cool, fast growing startups, like Arize more and more. Amber, your role I had never heard of before. So, ML Growth Lead, Machine Learning Growth Lead. So, you’re going to have to dig into like exactly what that means. But yeah, so, for each of you in turn, maybe we can start with Xander, let us know like how you became a developer advocate and why a listener out there might be a perfect fit for a developer advocate role themselves.

Xander Song: 01:05:06

Yeah, this is a good question. So, I guess, let me, let me backtrack too. And just because actually this is my very first time ever doing this kind of role. I’ve been doing this dev advocacy stuff for like eight months, and actually before I joined Arize, I actually did not know what a dev advocate was. So, for anybody out there who doesn’t know what it is, the basic idea is I think of it as like kind of two components. Part of it is evangelism so, as you said, going on podcasts, like, you know I spoke to a really well, well respected dev advocate in the field, and he basically told me, you need to like breathe, live and breathe the, you know, you have to like convey the passion for what you’re building, what the team is building to the audience and convey the need, right?

01:05:53

So, evangelize the product, right? That’s part of it. And then part of it is like being a pair of boots on the ground that is in touch with the community kind of like, you know, the first, the fir you know, the first person who a user of an open-source product is going to talk to and be, you know, be a point of contact for the community, right? So, if those kind of things appeal to you, like it’s definitely I think a, a good career to consider. And in terms of how I got into it, my background previously I was working as a Machine Learning engineer at an early stage company and we actually died. So, it was actually a smaller startup than Arize. Worked there for like two years. We really fought super hard and we, we died, we couldn’t raise.

01:06:42

And when I kind of took a, took a moment to understand why did we die, really, the thing for me was that we were pretty heads down building stuff. And it turns out that the stuff that we built was not, didn’t have a market, right? It didn’t have a, a strong market. We didn’t achieve product market fit. And that really became this moment for me where I wanted to evaluate, career-wise like, how could I prevent that from happening again, because it was very painful. And for me the answer was “Oh, like I just need to be like really engaged in the community and really in tune with the community” and be for me, like part of it is just like being a conduit between the company and the community as a whole.

Jon Krohn: 01:07:31

Very cool, nice explanation there. And yeah, you’ve certainly got an ear to the ground now and can keep an eye on the trends end. It also, seems like you’ve identified a company that has a great product market fit, so.

Xander Song: 01:07:46

Yeah, yeah.

Jon Krohn: 01:07:47

Nice. Amber the floor is yours.

Amber Roberts: 01:07:52

Awesome. And I will just say my role changes all the time. I went from academia to industry in 2018, and since then I’ve been a AI Program Director an IPM, Head of AI, AI Sales Engineer, Machine Learning Engineer, and ML Growth Lead. And I think that’s one of my favorite parts of doing that and being in tech is that you can try out these roles, you can see what you like, what you don’t like, and I’m always curious about different roles and what folks are doing. And that’s really what led me to growth because I think it’s one thing to be kinda heads down building the product, which I really do enjoy. But you know, what Xander said of like, finding that product market fit and helping folks solve problems is really where I want to be. I want to solve those pain points we see for customers and the growth aspect and being part of a growth team at a startup is incredibly important for word of mouth for funding for people looking at their hands on the product and for helping solve real-world problems.

01:09:03

And so, the growth aspect is keeping in mind, you know, are are people using our product? Are they using open-source? Are they using the platform? Are they staying engaged in it? Like do we have activation? Do we have retention? Do we have people in our open-source community talking about issues and MLOps that they face? You know, and for me, I really like connecting with folks. I like giving talks. I like doing workshops, doing events, and having conversations like, you know, Jon, we met from, you know, an event and that to me is all part of growth, like getting people to understand what Arize is, how to pronounce it, you know, what ML Observability is.

01:09:47

So, making sure that people kind of understand that they’re even having issues in the first place. Like, you know, what we talked about earlier about Monitoring versus Observability, about retraining on an arbitrary amount of time. Like they don’t realize these things you know, aren’t as they seem until you kind of pull the veil up and say like, are you struggling? Like, do you know if your users are churning? Do you know if you’re maximizing the profitability of your models? And that’s what I find really cool because for a lot of teams a few years ago when I started the, you know, ML Observability was a luxury. Like, oh, you know, that’s nice to have, but you know, we’re focused in other areas. And now a lot of teams don’t realize like how their models thrived without observability because they’re thriving way more now and they feel safer about their models. Machine learning engineers feel better about putting their models in production, knowing that if something goes wrong, there are guardrails.

Jon Krohn: 01:10:53

Nice. Yeah. Yeah, it’s, it’s an increasingly important area that we, that everybody think needs to be aware of if you’re in this space. It’s amazing to me that we haven’t done an episode focused exclusively on ML Observability like today. It was very obvious to me as soon as I met you that we needed to do this episode and I’m glad you were willing to come on so quickly because, you know, when you describe situations like at Spotify, but this is even in much smaller companies Machine Learning models depend on Machine Learning models, depend on Machine Learning models in production. Like there’s like these cascades. And so, just one model drifting and having issues could mean that a user’s experience really takes a nose dive.

Amber Roberts: 01:11:39

Yeah, one other interesting thing is at the ODSC I judged a hackathon and these were great presentations, people built amazing products. And at the end when I asked, does it work, do you know that it’s working? Because they would show one or two examples. I’m like, how do you know it’s working? How do you validate it? How do you, do you have any KPIs? That was all, “Oh, tho those will be our next steps.” But it’s interesting cause it, people are very focused on, you know, getting this AI to work. But, and they think like, oh, you know, later on we’ll figure out if this actually adds value, we can find value like right away for our customers. And I think that is the area of ML Observability and LLM Observability that Arize is really focused on.

Jon Krohn: 01:12:26

Nice. Very cool. And yeah, I guess I should have done this right at the very beginning now that you mentioned the pronunciation thing. But Arize is spelled ARIZE or ZE if you’re in the United States. Awesome. Amber, Xander, this has been fabulous. I’ve learned a ton and now our audience is very much aware of ML Observability and its importance. If they didn’t know about it before. Before I let you go, I ask all of my guests for a book recommendation. So, Amber, maybe if you want to go first?

Amber Roberts: 01:12:59

Yes. I recommend Designing Machine Learning Systems. It’s an O’Reilly, written by Chip Huyen and I believe this is the one where our Co-Founder Aparna wrote either a chapter or part of a chapter on ML monitoring and how that could be put into place.

Jon Krohn: 01:13:17

Nice. Yeah, Chip was actually on the show recently, episode number 661, and the title of that book was the title of her episode. So, yeah, we, yeah. Yeah. Really important topic. Super popular book and super popular woman.

Amber Roberts: 01:13:30

Oh yeah, Chip’s great. She’s also super nice. She, she’ll be at a lot of these events, recommend everyone says hi to her because she’s a, she’s a really great person to know in the space.

Jon Krohn: 01:13:41

For sure. All right, Xander, what you got for us?

Xander Song: 01:13:43

I’m currently working on a book that was actually recommended to me by a tech lead of Phoenix called Measure What Matters. It’s a book about it’s actually a book about KPIs and oh, OKRs, OKRs. And I think it’s been, we, we use OKRs pretty aggressively on the Phoenix team. And it’s been really interesting to hear about how some of the largest companies in the world and some of the most successful startups in the world drove, drove their growth and honed their priorities using that particular system. So, I’d recommend that one, Measure What Matters.

Jon Krohn: 01:14:26

Cool. Yeah. Great recommendation. All right. Thanks to both of you, very knowledgeable speakers. Clearly know what you’re doing and I’m sure there’s going to be lots of listeners who would like to be able to continue to learn from you after the episode. Xander, what’s the best way that people can follow you afterward?

Xander Song: 01:14:46

I’m not on Twitter at the moment, although I probably should get on Twitter. But for right now I would say LinkedIn hit me up.

Amber Roberts: 01:14:53

I do have a Twitter astronomyamber, but I think LinkedIn is better, but if you have direct questions, you can join the Arize Community Slack and just Slack me that community. I’m talking with community members every day.

Xander Song: 01:15:06

Me too, me too. So, I would second that, that point. Yeah.

Jon Krohn: 01:15:10

Perfect. We’ll be sure to include all of those links, your social media profiles as well as the Slack channel, the Arize Slack channel in the show notes. All right, awesome. Thanks very much guys. And yeah, we’ll have to catch up again some time with you to see how your ML Observability and job title journey is progressing.

Amber Roberts: 01:15:31

Awesome. Thanks Jon.

Xander Song: 01:15:33

Thank you Jon.

Jon Krohn: 01:15:39

Nice. Thanks to Amber and Xander for that highly informative discussion. In today’s episode, they filled this in on how ML Observability automates ML monitoring to catch and fix production issues before they become a big deal. The various types of drift, for example, feature drift, label drift, model drift can be tracked by comparing baseline probability distributions with production distributions. Or in the case of unstructured data, such as with LLMs, we can compare embeddings. They talked about how Arize’s open-source Phoenix Library provides sophisticated tools for comparing embeddings, allowing us to compare LLM performance during training versus production. This allows us to monitor for drift as well as many other comparative use cases. And they talked about how the comparison of natural language embeddings allows us to compare whether sensitive groups are being treated differently by a model, thereby flagging where unwanted bias may be occurring.

01:16:29

As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Amber and Xander social media profiles, as well as my own social media profiles at www.superdatascience.com/689. That’s www.superdatascience.com/689. If you live in the New York area and would like to engage with me in person on July 14th, I’ll be filming a SuperDataScience episode live on stage at the New York R Conference. My guest will be Chris Wiggins, who is Chief Data Scientist at the New York Times, as well as a faculty member at Columbia University. So, not only can we meet and enjoy a beer together, but you can also, participate in a live episode of this podcast by asking Chris Wiggins your burning questions.

01:17:11

All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another exceptionally practical episode for us today. For enabling that super team to create this free podcast for you we are deeply grateful to our sponsors. Please consider supporting the show by checking out our sponsors’ links, which you can find in the show notes. Finally, thanks of course to you for listening all the way to the very end of the show. I hope I can continue to make episodes you enjoy for many years to come. Well, until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Podcasts SDS 689: Observing LLMs in Production to Automatically Catch Issues

SDS 689: Observing LLMs in Production to Automatically Catch Issues

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

August 12, 2025

August 8, 2025

Podcasts SDS 689: Observing LLMs in Production to Automatically Catch Issues

Share

SDS 689: Observing LLMs in Production to Automatically Catch Issues

Podcast Transcript

Share on

Related Podcasts

August 15, 2025

SDS 914: Data Lakes 101 (and Why They’re Key for AI Models), with Oz Katz

August 12, 2025

SDS 913: LLM Pre-Training and Post-Training 101, with Julien Launay

August 8, 2025

SDS 912: In Case You Missed It in July 2025