Podcasts SDS 655: AI ROI: How to get a profitable return on an AI-project investment

103 minutes
Artificial Intelligence, Business, Data Science

SDS 655: AI ROI: How to get a profitable return on an AI-project investment

Subscribe on Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Pandata’s Data Scientist in Residence Keith McCormick advocates for keeping it simple in data science. In this episode, Keith speaks with host Jon Krohn about the fact that simpler techniques should always be at the forefront of a data scientist’s mind when prototyping, believing that data science applications should be understood by everyone in the company. Keith and Jon discuss how these measures are self-preservatory for data scientists, and how colleagues can prove their value in any data science project.

Thanks to our Sponsors:

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

About Keith McCormick

Keith is a consultant, trainer, speaker, and author of seven books. His consulting specializes in helping analytics leaders build and manage their data science teams. His training, including 20 LinkedIn Learning courses and frequent conference workshops, has reached 100s of thousands of individuals trying to learn statistics, machine learning, and data science. He currently serves at Pandata’s Data Scientist in Residence.

Overview

A complex technique to explore data can completely kill off a business project. Keith McCormick believes that adding complexity can be risky for data scientists as it will always exclude several people from a project. With fewer people capable of understanding it, such projects will naturally take longer to complete, putting pressure on the company’s budget.

Keith urges data scientists to remember that the business world is not a Kaggle competition—business solutions must be viable, achievable and easy to understand. Data scientists must therefore spend some time thinking about the models they produce. Data scientists must also be aware of what Keith terms the “human experience” of using a model, and to build for those purposes. As human psychology plays a large part in the makeup of a data science project, speed and ease of use will sometimes outstrip accuracy. Keith notes that in the real world, such trade-offs are necessary, and judgment on which route to take should always be a part of a data scientist’s toolkit.

While discussing the educational failures of preparing data scientists for the world of work, Keith believes that learning complicated mathematics such as linear algebra will ultimately be less beneficial to students looking to get ahead in data science. He emphasizes the ultimate skill for data scientists instead: understanding how to draw conclusions from data in a way that will help improve the bottom line of a business.

Keith serves up some hard truths in this episode! He and Jon Krohn discuss how “insights” can never be the end product of a data science project, how to ensure you have a specific goal at the start of a project related to revenue, and why there is so much miscommunication between client and data scientist. Exclude the C-suite at your peril!

In this episode you will learn:

What an Executive Data Scientist in Residence is [05:27]
What A.I. transparency is and how it relates to the field of Explainable A.I. (XAI) [17:34]
How companies can ensure they profit from AI projects [36:47]
Possible organization structures for data science teams to be profitable [1:02:41]
The current gaps in data science education [1:09:58]

Items mentioned in this podcast:

Glean.io
Pandata
SDS 628: The Critical Human Element of Successful A.I. Deployments
Keith’s Books
SDS 467: High-Impact Data Science Made Easy
SDS 539: Interpretable Machine Learning with Serg Masís
LIME
SHAP
Confusion Matrix
CRISP-DM: Cross-industry standard process for data mining
KNIME
SPSS Modeler
Nebula
The Ghost Map by Steven Johnson
SDS 643: A.I. for Medicine
Jon’s O’Reilly Deep Learning Course
Jon’s virtual conference on natural language processing with large language models
SDS special code for a free 30-day trial of O’Reilly: SDSPOD23
Jon’s Podcast Page

Follow Keith:

Follow Jon:

Episode Transcript:

Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:00

This is episode number 655 with Keith McCormick, prolific data science educator and executive data scientist in residence at Pandata. Today’s episode is brought to you by Glean.io, the platform for data insights, fast.

00:00:14

Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple.

00:00:50

Welcome back to the SuperDataScience Podcast. Today I’m delighted to be joined by the oh so, very wise Keith McCormick on the show. Keith is executive data scientist in residence at Pandata, a consulting firm focused on transparent, human-centered AI. He’s also a predictive analytics instructor at the University of California Irvine. He’s the creator of 20 LinkedIn learning courses on machine learning and AI with an aggregate hundreds of thousands of students. And on top of all that, he’s the author of four statistics books with a recurring focus on doing stats with a software application called SPSS Modeler. Today’s episode should appeal to anyone who’s eager to get a return on an investment in an AI project, no matter whether you have a technical or non-technical background. In today’s episode, Keith details his straightforward approach to ensuring AI projects are successful, how AI projects need to be set up and managed in order to get a profitable return on the project, the corporate roles that need to be in place in order for a data science team to complete projects that drive value what AI transparency is and how it relates to the field of explainable AI and how data scientists who have advanced software writing skills could benefit from the use of low-code, no-code tools. All right, you ready for this practical information-rich episode? Let’s go.

00:02:13

Keith, welcome back to the SuperDataScience Podcast. Where in the world you calling in from?

Keith McCormick: 00:02:19

Thanks. I’ve been looking forward to this. I’m home, been home more lately, which is lovely. I’m in the Raleigh Durham area of North Carolina, so not too far from Research Triangle Park and all that cool stuff.

Jon Krohn: 00:02:34

Nice. That’s a pretty good region weather-wise, year-round, I guess, eh?

Keith McCormick: 00:02:38

It is. Yeah. No, definitely a temperate zone. So spring and fall are fabulous here.

Jon Krohn: 00:02:44

And for our listeners listening to this as opposed to watching this, which is most of you I do actually, if you wanna see perhaps the most magnificent background of any guest that we’ve ever had on the show, I think it could be Keith’s. It is definitely one of the top ones. It is magnificent. It is so magnificent, in fact, that I spent a while when we were before recording, trying to figure out if this was a Zoom background or not. And it is not it, but yeah, it’s beautiful. Lots of appreciate rich, rich woods and leather bound books and beautiful artwork. Looks like a beautiful home there in North Carolina.

Keith McCormick: 00:03:25

Yeah, I, I appreciate it’s, it’s well, because, because this is where I do webcam stuff, it’s probably the room that I keep the nicest. But it really started before shut down. It was because this is, this was kind of a dark room, so I kept nice stuff in here to keep it out of the sunlight. That’s really, that’s really how it started. And then all of a sudden shutdown happened.

Jon Krohn: 00:03:47

Yeah. Yeah. The Covid Lockdowns. Well, those are long gone. And actually thank goodness for us because we met with Covid lockdowns over. You and I met in San Francisco at ODSC West 2022, which was in the Northern Hemisphere autumn in 2022. And that was my first time back at ODSC West. and my first time back in California, since the pandemic had happened, it was really nice to be there. I was able to catch up with tons of people who had been guests on the show. Actually, some of the most popular episodes of all time. Those guests were all there together. So people like Ben Taylor, Sadie St. Lawrence, Matt Harrison, I’m missing others, great ones. Serg Masís, of course, whom I had met before in person. And then I also had the opportunity to meet people like you that I hadn’t met before.

00:04:43

It was really a treat. And we did something special. ODSC provisioned for us, a little quiet room for us to record a short episode. So episode number 628, you did a, you know, a shorter episode of Friday episode on the critical human element in AI. And so that’s a cool one to check out if you just want to hear quickly from Keith about how, if you don’t take into account the human that’s gonna be in the loop with the AI system, your AI system is sure to fail. And that is something actually that we’re gonna talk about in this episode. We’re gonna dig deeper into this idea of how to make AI projects successful. But first, let’s talk about your day-to-day role. So you are executive data scientist in residence at Pandata. And so Pandata is a data science consultancy. And before I met you, I had never heard of an executive data scientist in Residence. So can you elaborate, Keith on what that is?

Keith McCormick: 00:05:44

Before Cal and I chose that title, I don’t think I’d heard of of this particular combination of of titles before. And a fun fact we did the Five-Minute Friday at ODSC, which was near the very beginning of my time at Pandata. So to put Pandata into context, I’ll just kind of tell you kind of the phases of my career. So, started out as a software trainer for SPSS, and then IBM acquired SPSS. So for quite a few years, if I was sent out to do a consult, it’s because I was the guy that SPSS, or IBM sent, you know and, you know, for data scientists that are trying to do, you know, get established and do a little bit of a personal branding, that can be a limitation, right?

00:06:33

Cuz eventually you want people to find out what, what you can bring to the table. So for years I was a solopreneur and Cal and I met not at the ODSC that we met at, but ODSC East some months earlier and at this stage in my career, my favorite things are mentoring other data scientists, designing solutions, and I really enjoy kind of the, the business development side too. You know, going out to conferences and so on. So a difference between how my career has progressed and Cal, who’s the CEO of Pandata is that he’s very good at building a team, but building a team is hard. You’ve got accounting and logistics and staff and, and all those kinds of things. It’s not something that I aspire to. So with Pandata, what it’s allowed me to do is contribute the parts of the machine learning project that I think I’m best able to do and that, frankly I best enjoy.

00:07:37

And it dovetails beautifully with what Pandata does because they have some wonderfully talented data scientists. But, you know, something magical happens about 10 years into your career or after you’ve done about a dozen projects or so, you start to be able to look around corners and those kinds of things. So that’s, that’s the whole idea of being a data scientist and, and in residence, it’s, it’s it’s a part-time role for one, cuz I’ve got other irons in the fire, like LinkedIn Learning, but it’s a combination of mentoring solution design and going to conferences and getting the Pandata brand out there. All stuff that I really enjoy. But then when the project is more complicated and you need two or three data scientists, I couldn’t take on those projects as a solopreneur. I used to just kind of hand them over to colleagues that I trusted. But at Pandata I can take on projects like that because we have a whole team.

Jon Krohn: 00:08:32

Nice. That was a great explanation. Yes. A solopreneur, the person that I’ve heard used that the most is Noah Gift, who was one of my first guests when I started hosting the SuperDataScience Podcast. So he was in episode number 467, and yeah, he has a kind of a similar profile to you where he does a lot of university instruction and curriculum development, which you do. And he does do some consulting. It’s cool to see that you’ve not developed beyond that. But to see this kind of evolution where now you’re able to kind of get the best of both worlds it sounds like, you get the independence and the leadership that you enjoyed as a solopreneur, the thought leadership that you enjoyed as a solopreneur. But now you also have a dedicated group of talented data scientists and other kinds of technical people and probably non-technical support people who altogether they allow you to broaden your impact into kinds of projects that you previously wouldn’t have been able to tackle.

Keith McCormick: 00:09:39

Yeah, no, that’s, I think you’ve framed it exactly right. I mean, take what somebody at a startup has to do, and we both have lots of startup founders in our circle of friends, right? You’ve got fundraising, recruitment, accounting, all kinds of stuff, which I just never aspired to. You know, it wasn’t the part of it that I wanted to do. But Noah gift is an interesting example of this, actually. I know him by reputation only. I’ve never met him. I believe his specialty is ML Ops. Is it not? Am I remembering that correctly?

Jon Krohn: 00:10:18

That is, he does a lot on like AWS, GCP, that kind of stuff for sure. Yeah. But yeah, he’s got a broad range of, he does machine learning stuff. He teaches machine learning courses at graduate courses at universities.

Keith McCormick: 00:10:32

And, again, I haven’t met him, but I imagine his motivations were very similar to mine in that if there is a pretty intense project coming along, you have to like stop the world, which is hard when you’ve got the book authoring and the teaching and the conference speaking and all these other things going on. But the other limitation of being a solopreneur, and this is why I, I feel so fortunate that I met Cal and and joined Pandata, is that the limitation is that I can come in as kind of an architect type role when I’m a solo external resource to support a team. And I’ve had lots of very interesting gigs over the years where that worked. But it assumes that the client side has quite a bit of bandwidth and quite a bit of talent to work with me so that I’m doing design and mentoring, but that they’re doing much of the execution.

00:11:33

That’s not always the case. Sometimes there is no data science team and the client really needs you to do that first project while they’re trying to figure out even what their team composition will ultimately be. And again, those gigs I had to just walk away from. But Pandata can support those, Pandata, frankly, can just support a wider variety of projects than than I could. So what could be better? You know, I think I’ve got another 15 plus years doing this in me, but I’m definitely in the last third of my career and what could be better than contributing where I think I’m most valuable but also the parts of the process that I think are the most fun, you know, because being the main execution lead is all but impossible for me. Cuz that means that for weeks or months, that’s all I’m doing and I can’t stop the world that long at this point.

Jon Krohn: 00:12:37

Instead of actually finding useful insights from your data, are you spending more time setting up your BI tool and answering repetitive questions from end users? Well, it’s time for you to adopt a modern data visualization tool. As detailed in episode number 653 with founder Carlos Aguilar, Glean.io is the lightweight BI solution that empowers teams to find insights within 10 minutes. With Glean.io, you can standardize metrics from your data warehouse and provide your team with explorable data visualizations right out of the gate. Start building a collaborative data culture today, visit Glean.io to request access.

00:13:19

Cool. And so then, you know, now we have a good sense of what you do at Pandata, what kinds of projects does Pandata specialize in? So I know that there’s kind of this responsibility, AI responsibility and transparency element to Pandata work. So like, maybe you could fill us in a bit more on that. Maybe even if it doesn’t step on any kind of intellectual property or, or privacy issues. If you could go into like a couple case studies of, of projects that you’ve taken on at Pandata.

Keith McCormick: 00:13:49

Well you know, I’ll probably talk more in, you know, in general terms, but in the early days of Pandata, and I’ve been with Pandata just, well, I’ve been working with Cal when we announced the role was a little bit later in the relationship, but I’ve been working with Cal just about a year now and some of the earliest key clients for Pandata where healthcare analytics, and of course, those are areas where transparency are required. But one of the reasons that I knew that Pandata was gonna be a good fit for me was that I believe philosophically is kind of an ethical thing for me, that model transparency is something that’s important even when regulation doesn’t force you to have it. You know, it’s interesting, his name will probably come to me, but the developer of LIME you know, one of the XAI techniques, he developed LIME because he was, I think it happened to be a computer vision thing, but it was performing well in his train and test partitions, but with totally not working when he went to deploy it.

00:14:53

And, you know, sometimes I think we have a little bit too much faith in holdout sampling because there are times when it can fail. And what was happening is the model was picking up an artifact of his data prep, which was polluting both the train and the test samples. But anyway, long story short is no one was forcing him to do explanatory AI, but he needed it to debug because if the model was a black box, he couldn’t figure out what the problem was and get rid of it. Okay, so Pandata may have started with client situations where regulation or other reasons force, you know, their hand and they had to have a transparent solution.

00:16:00

But I believe, and I think certainly Cal and, and really everybody at Pandata believes that it’s always desirable to have transparency even when you’re not required to. But now here we are in the current timeframe, and right on the horizon, the EU is working on an AI law that may require transparency. And all the details of the law aren’t clear yet because it hasn’t been passed yet. But it’s, but this kind of regulation is probably coming. So what’s happened, just like a lot of things in one’s career in life, something that may have started with a couple of client situations starts to become a whole approach and a specialty of, you know, really trying to have this clarity and transparency, not just for regulatory reasons, but also ethics and because it’s just good practice.

Jon Krohn: 00:16:54

Yeah. And so, explainable AI, fascinating area. One of the most popular episodes of 2022 was with Serg Masís, who is our researcher on this show. And I mean, hugely grateful to him for regular listeners, you will hear his name all the time, but he does research on our guests. So, like, for today’s episode, a a huge chunk, all of the best questions, yeah, that you hear me ask Keith, they probably were suggested by Serg. So he did an amazing episode on explainable AI. He’s an expert in that space. He wrote a book on it. And yeah, SuperDataScience episode number 539. So you can check that out. But for you, Keith, I would love to know, when you talk about being able to develop transparent models, what does that mean? I mean, so like, you know, one thing, one obvious thing to me is I’m like, okay, well we could limit ourselves to regression models where we can see the beta weights and we can say, you know, this independent variable is contributing exactly this much to the result.

00:17:58

Cuz we know exactly what the beta weight is and it doesn’t interact with the other input variables. So that’s like one way of doing it. But then we’re pretty limited in the approaches we can have. So you mentioned LIME there. So LIME is a tool alongside SHAP. They’re probably the two most popular tools for doing explainable AI. Is that enough? Is using these kinds of tools post hoc after the model’s been developed to have some sense of how inputs relate to outputs even if we can’t get down to an individual weight level? Yeah. What is, what do we need, I guess, in the healthcare space specifically, what do you, what do you need to show to be able to say we’re transparent and, and yeah, what are you happy with at Pandata in any industry?

Keith McCormick: 00:18:41

Well, it’s a, it’s a tricky issue. And I’m gonna, what I’m gonna speculate about here isn’t so much a Pandata answer as much as what I’ve come up with myself exploring the same kinds of things that I’m sure that Serg was talking about in an e in that episode because I prepared a, an XAI course. In fact, I’m gonna, I’m losing track of my calendar here, but I think I’d given a half day XAI workshop in less than two weeks or just about two weeks or something like that. I’m speaking at TDWI in Las Vegas in February. So the way that I usually describe it to folks is that that first scenario that you were saying, where you have something that’s inherently interpretable is what some folks, I’d like to use this phrase call interpretable machine learning.

Jon Krohn: 00:19:43

Right? So for me, the regression model, the regression model to you is interpretable machine learning. Yeah?

Keith McCormick: 00:19:49

Yeah, I would agree. And, and then I would include in that at, for instance, a single decision tree. So I think part of it is not being shy about using these simpler techniques in two senses. One is, I think you should always, always use a simpler technique when you’re prototyping. It drives me crazy when a modeler will go right to a complex technique, maybe because they did a tournament of algorithms in AutoML or something like that. They probably had some kind of justification for doing it, but they’ll go right to the more complicated technique. And then if I ask them, how much additional accuracy did you gain by going more complicated, I don’t get an answer. That frustrates me. I just think that’s bad practice.

00:20:43

So, so I treat the IML solution as a pace car, so to speak, right? And then I go to see if I go more complicated, what’s the gain? One of the reasons I think this is important, even for analytics leadership, you know, so somebody like an analytics director or something like that, and, and they may have gotten that role because they get promoted out of BI or IT, you know, so they might, they might possibly be managing data scientists but not be a data scientist themselves. This kind of thing happens all the time. I always coach them to do that because that additional gain and accuracy, it’s not just that you’re gonna lose transparency, but you’re probably gonna be forced to do XAI on top of it. So you’ve just lengthened the project so there’s actual, you know, dollars involved now where the project has just become more expensive. So so I reserve explainable AI not for the interpretable piece, but when you have to add that explainability layer on top of a complex model.

Jon Krohn: 00:21:50

Yeah, yeah, yeah. And to your point of expense, those more complex models are probably going to cost more to train and they’re going to cost more to run at inference time as well. So and other reasons why we might wanna have a simpler model, like a regression model or a decision tree. And if you get an, if you get a negligible, you know, there might be, in terms of winning a Kaggle competition or whatever, you know, there might be this 10th of a percent improvement in accuracy or AUC and okay, cool, that wins you the tournament. But in practice that isn’t, that doesn’t change the human’s experience of using this model. Nobody’s going to notice that difference. But they will notice if it takes several seconds for the model to output something to the screen as a, you know, somebody in a production and web interface or something you know, your end user of this model, if they have to wait, you could have been using a simpler model that was tiny, little bit less accurate, negligently less accurate, but gives you results instantly, and it is cheaper to run.

Keith McCormick: 00:23:02

Yeah. Well, well said. First, thank you for making the distinction between training time and scoring time, because I find that folks forget about that because maybe you can tolerate a long-ish training time if it’s gonna be done infrequently. But that scoring time might be critical. And what, you’re gonna wait until you get to the ML Ops phase and you’re gonna have your machine warning engineers or whoever’s in charge of scalability issue, you’re gonna dump that problem on them. You know, you, you have to be thoughtful about that early on. I remember being on a gig it happened to be an IBM gig, and scalability was an issue, and I don’t remember the details, but it was something like you know, Keith, your model’s taking like seven or eight, you know, milliseconds per record at score time, we’d love to get it down to four, how much accuracy will we lose if we go from 40 variables to 30 or 40 variables to 25 and so on. So in other words, I was having a constant dialogue with the person who was gonna put my model into production. But what was the other thing that I was gonna say about that? Yeah, so about score time, but then also, oh, yeah, the trade off. So what I think is so important for us to remember on the, you know, on the technical side, on the data science side, although, you know, people, people are technical in different ways.

00:24:48

Like I don’t think of myself as being very savvy on the data engineering side, for instance, you know, that it’s all kinds of different technical, but for those of us that are technical on the data science side, I don’t think it’s our decision whether or not we go for that extra one-tenth of a point of accuracy. I think it’s a business decision. I mean, I really, I, I really do. I, I hope that most people agree with me, but I think what happens is the fear is that the senior leadership or even analytics leadership won’t fully understand all the details. Everybody’s always talking about how they don’t trust their boss to, to go into the weeds with them. But that is really something that they have to decide. We don’t get to decide that. So as an external resource, or as a consultant, you know, at Pandata, that’s a, that’s a client decision, and that’s a different kind of transparency. That’s not necessarily model transparency, but it’s a kind of process transparency that we would take very seriously to on a consult. There are some consultants that just want to take the data kind of disappear and then come back with the model. And if that’s the case, they’re not giving the client the opportunity to decide between the more expensive model that’s a 10th of a percent more accurate.

Jon Krohn: 00:26:12

Yep. It’s, yeah. I, you know, we’ve, we’ve had a lot of guests on the show talk about this problem of data scientists being too siloed and not kind of working with the business iteratively to see what’s best. So yeah. So thank you for highlighting that. There’s an interesting kind of nomenclature point of difference between you and Serg. So XAI, this explainable AI area is in its infancy in a lot of ways. There aren’t a huge number of tools or a huge number of books out there today. I know that there will be in five years and even more in 10 years. And it’s interesting to hear, so you define interpretable machine learning as a as, as models, like a regression model or a single decision tree where it’s very straightforwardly interpretable. And it’s interesting. So Serg’s book is called Interpretable Machine Learning, and it’s all about XAI techniques like using LIME and SHAP. So it’s interesting. So as, as the field develops we’ll probably start to have more standardized ways of describing these things, I, I totally get the way both of you are doing it, even though it’s different. Both ways make sense to me.

Keith McCormick: 00:27:28

Yeah. And I, I’m glad you raised that issue because I’m when I give talks on the subject, I’m, I try to be cautious about that. I adopted the language that I did. I was influenced by two sources. The Association of Computing Machinery did a, a cover story on this. And I said, well, you know, if I try to use the terms the way the ACM is doing it, then why not? I’m just gonna stick. I’m gonna kind of stick with that. Right? But, but there really isn’t a, you know, a consensus. And somebody else that’s been a real influence by the way, is Cynthia Rudin, who’s at Duke. I’ve never met her, but I’m a huge fan. And she thinks that generally speaking, XAI is a mistake in that it’s, it’s never more than an approximation of the black box model anyway, right? Now, it might be a darn good approximation, but approximation nonetheless. Right? So she thinks that it’s not really truthful to say this XAI is your model. You gotta be clear to the, to the client, to the decision maker. Right? Right. You know, that it’s not also, she thinks that we exaggerate this notion that you have to be more complex to be accurate. So she’s tried to develop algorithms where there’s a very, very thorough search of the problem space, but it results in one tree, not an ensemble.

Jon Krohn: 00:29:07

Right.

Keith McCormick: 00:29:08

And the reason we do reinforce and other things is that trees are greedy, so they tend to optimize locally, and we don’t get a good overall optimization. So she says, well, that’s true. You don’t have to do cart like it’s 1984. Go ahead and leverage your gpu, leverage your fast contemporary computers, but try to have one tree pop out at the other end, not a thousand.

Jon Krohn: 00:29:33

Right.

Keith McCormick: 00:29:34

I’m not an expert certainly on her approach, but she’s been a real big influence. So whenever I present XAI, I don’t take as strong as a position as she does. I think we have to have this skillset, we have to do this, but it’s this constant kind of cautionary reminder that maybe there are times that we go complicated when we don’t have to.

Jon Krohn: 00:29:59

For sure. As we often discuss on air with guests, deep learning is the specific technique behind nearly all of the latest AI and machine learning capabilities. If you’ve been eager to learn exactly how deep learning works, my book, Deep Learning Illustrated, is the perfect place to start. Physical copies of Deep Learning Illustrated are available in seven languages, but you can also access it digitally via the O’Reilly learning platform. Within O’Reilly, you’ll find not only my book, but also more than 18 hours of corresponding video tutorials, if video is your preferred mode of learning. If you don’t already have access to O’Reilly via your employer or school, you can use our code SDSPOD23 to get a free 30-day trial. That’s SDSPOD23. We’ve got a link in the show notes.

00:30:53

Now, what you’re saying there about not having to go complicated makes perfect sense to me, particularly when we’re working with structured data. I can believe that there’s a simpler model, but you know, you’re nodding your head, so you gotta get where I’m going with this. When there’s situations like raw images or raw video or natural language, then it gets trickier. And it seems like you know, having like the deep learning models that are very difficult to explain, they are able to handle all of the nuance and the pattern recognition in those big unstructured media files.

Keith McCormick: 00:31:24

Yeah. And I, I, you came up with exactly the same list that I would video, image, audio, natural language processing, even if it’s even if it’s text, but trying to do that, finish the sentence or, you know, text to image and all this kind of stuff that’s where not only deep learning shines, but where five or 10 years ago we had made so little progress.

Jon Krohn: 00:31:50

Right, yeah.

Keith McCormick: 00:31:51

But I’m a skeptic when it comes to deep learning on structured data. Maybe it’s because it hasn’t happened yet, or I haven’t been convinced yet. I’m not sure, but I, I haven’t seen anything that convinces me yet that I should be doing loan default with deep learning.

Jon Krohn: 00:32:13

Yep. And myself personally, as well as dozens of deep learning students that I’ve had over the years, so I have taught deep learning courses online or in-person for many years now, and you get these people that come from finance and are like, I’m gonna build the best stock prediction model ever, using structured data, or like you’re saying like loan default prediction or any of these kinds of things where you just have discreet independent variables as opposed to unstructured media files as inputs. And yeah. They don’t outperform the, you know, best practice kind of linear regression model, so yeah. Anyway, I just wanna kind of highlight the difference. So did we, have we answered the question? I kind of, we’ve gone down on a long tangent here, but my kind of, my initial question was, what does transparency mean anyway? So, you know, so we know now that, you know, of course, yeah, you and I are on the same page. If we have what you call an interpretable model, like a regression model or a single decision tree, obviously that’s transparent. But then are there situations that you run into Pandata where you are working with unstructured data and deep learning models and you need to be applying XAI, and there’s a point that you get to where you’re like, okay, it is transparent? Or is that just kind of, that’s not really in scope for the kinds of work you guys do?

Keith McCormick: 00:33:33

No, no, no. I think again, I, well, for instance medical imaging is something that Pandata’s done work in. Now, so the XAI is on the table again, and that’s where, although I’m a huge fan, I’m not prepared to go as far as Cynthia Rudin because I’m just not sure how you do that. Right. I’m not sure how you always restrict yourself to, yeah. That’s inherently interpretable if you’re dealing with something like medical images. But I think, you know, as we’ve been talking about it, and I bet my colleagues at Pandata would agree with this, that such an important part of transparency is not just the transparency into the math, is the transparency into the process so that the client can be a full educated partner in the development of the solution. You know, what, what I always remind folks of is that the way that you judge, you know, a model isn’t that it’s better than some set of business rules that existed before the model or something like that. I mean, this gets into our Five-Minute Friday topic, really, where what you’re doing is you’re, you’re entering this new world. There’s the new world that has the model in it. So a proper model evaluation has to compare the entire human computer collaboration that existed before the model was built. And after the model was built, the humans are still part of the equation. So transparency, I think, is transparency into the process, right. So that the client really understands this because they have to embrace it.

Jon Krohn: 00:35:20

Cool. That makes a lot of sense. I like that. I like that definition. Yeah. Transparency is having insight into the process. You have some sense of why a result is coming out the way that it does. Cool. All right. So that was kind of the first, you know, the kind of thing there and understanding what you’re doing at Pandata was the first topic that I wanted to cover, and we ended up having a really rich conversation there. I loved it. Now I’d like to transition to a topic that I alluded to that we’d be covering in this episode, which is getting a return on investment on AI projects. So for two decades now, you’ve been a practitioner, a trainer, and speaker in what used to be known as data mining. You don’t hear that as much anymore. Now it’s just you know, that’s, I guess that’s just part of machine learning.

00:36:10

And one of the topics that you’ve been addressing lately is ensuring the return on investment in AI projects. So, Keith, how do companies fail to make a profit from an AI project? There’s lots of excitement about AI. Everyone’s like, oh my goodness, ChatGPT. Yeah, this is gonna, this is gonna change everything, and it will absolutely change everything. But I think, you know, the vast majority of ideas that people have for AI projects don’t end up coming to fruition in a way that users end up being able to make use of it in the way that was originally conceived. So, yeah. How do you succeed at an AI project in particular? How does a company make a profit from an AI project?

Keith McCormick: 00:36:54

Well, to, to try to tackle the first question, and I’ll, I’ll run the risk of being a little bit blunt. I think the reason that sometimes companies don’t get a return on investment is that they, they don’t even try. They, they don’t even, they don’t even kind of recognize that as the goal. Yeah. So sometimes I get a little bit of gentle pushback when I’m hanging out with data scientists on this, because I think they want to preserve their ability to explore the data, not entirely knowing where that’s gonna take them. And I get that. So, you know, sometimes people refer to the 20% time that I think Google popularized. I totally get that. But when I talk about how projects should always have ROI, I mean, you’re officially starting the project. There’s a kickoff. There is somebody that’s designing the solution.