SDS 699: The Modern Data Stack, with Harry Glaser

Podcast Guest: Harry Glaser

July 25, 2023

Harry Glaser and Jon Krohn discuss Modelbit’s capabilities to automate ML models from notebooks into production-ready models, reducing the time and effort in ‘translating’ information from one mode to another. Harry also expanded on the importance of automating this task and how developments in ML modeling have widened access to entire teams to analyze data, whatever their level of expertise.

Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Harry Glaser

Harry Glaser is the co-founder and CEO of Modelbit, the fastest way for data scientists to deploy ML models. Previously he was co-founder and CEO of Periscope Data, which was acquired by Sisense. He started his career A/B testing the search results page at Google. On nights and weekends he still contributes a little code to Modelbit, much to the frustration of his co-founder.
Overview
In this episode, Harry Glaser, co-founder and CEO of Modelbit, walked through several technical essentials regarding the production application of ML models. He expanded on the importance of version control for modeling in GitHub and GitLab, which helps users to collaborate with their team on a model without losing critical information. And, as anyone who has worked with Jupyter Notebook will know, introducing even the smallest of errors can prevent the entire operation from running!
Harry also explained how data scientists who use notebooks for prototyping, iterating, building, versioning, and training their models must make those modes of experimentation ‘production ready’. Traditional methods demand rewriting models from the notebook into software engineering code. Rewriting can become a stumbling block, as it isn’t always the case that the data scientist who produced the notebook models can translate it into engineering code.
Harry saw an opportunity to automate this process, wrapping the model’s data and the libraries to obliterate this stumbling block for any production task. To Harry, this problem isn’t just an excellent solution for companies’ data scientists. Automating production-ready models has a knock-on effect throughout the company’s hierarchy of needs, saving much production time and effort and improving the bottom line.
Finally, Harry discussed using business intelligence products like Looker and Tableau to monitor those production-ready models. He encouraged users to go beyond typical scatterplot and line chart analyses and spend more time ‘at play’. Getting a data scientist to examine statistical methods that measure the model’s performance according to chosen parameters is one decisive way to drive value. And with Looker, people company-wide can play with the models via a simple interface that helps everyone access and understand critical information for product development and performance.
Listen to Harry and Jon talk through these topics as well as key definitions for CI/CD, load balancing, logging, and why they are all critical to the success of a production-ready ML model.
In this episode you will learn:

  • What the modern data stack is [03:28]
  • Version control for data scientists [13:30]
  • CI/CD, load balancing and logging [20:38]
  • Snowflake vs. Redshift [30:10]
  • How tools like Looker and Tableau help monitor models [35:26] 

Podcast Transcript

Jon Krohn: 00:00

This is episode number 699 with Harry Glaser, CEO of Modelbit. Today’s episode is brought to you by the AWS Insiders podcast. 
00:08
Welcome to the SuperDataScience podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple. 
00:43
Welcome back to the SuperDataScience podcast. We’ve got an information-dense episode for you today with the exceptionally clear and concise entrepreneur, Harry Glaser. Harry is Co-Founder and CEO of Modelbit, a San Francisco-based startup that has raised 5 million dollars in venture capital to make the productionization of machine learning models as fast and as simple as possible. Previously, he was Co-Founder and CEO of Periscope Data, a code-driven analytics platform that was acquired by Sisense for 130 million dollars. And prior to that, he was a product manager at Google. He holds a degree in Computer Science from the University of Rochester. Today’s episode is squarely targeted at practicing data scientists, but could be of interest to anyone who’d like to enrich their understanding of the modern data stack and how ML models are deployed into production applications. In the episode, Harry details the major tools available for deploying ML models, the best practices for model deployment, such as version control, CI/CD, load balancing and logging, the data warehouse options for running models, what model orchestration is, and how BI tools can be leveraged to collaborate on model prototypes across your organization. All right, you ready for this mega practical episode? Let’s go. 
02:01
Harry, welcome to the SuperDataScience podcast. Delighted to have you here. Where are you calling in from today? 
Harry Glaser: 02:07
Hello. From our office in sunny, downtown San Francisco. It’s beautiful today. 
Jon Krohn: 02:11
Oh, is it actually? I thought that was sarcasm. 
Harry Glaser: 02:12
It’s it’s gorgeous. No, no, it’s gorgeous out. It’s middle of summer and we’re right across the street from San Francisco Giants Stadium in south Beach, and it’s absolutely gorgeous day. 
Jon Krohn: 02:24
Nice. Yeah, I have some experience in that area. I was recording, I spent a couple weeks, I guess the year before the pandemic. So like 2019. I spent a couple of weeks recording a bunch of content, like 15 hours of content on deep learning for the O’Reilly platform and they put me up in a hotel right where you’re describing right across from the Giants Stadium. 
Harry Glaser: 02:47
Sounds amazing. Yeah.
Jon Krohn: 02:48
It was a nice area. Obviously lots of great Mexican food and walking distance. 
Harry Glaser: 02:52
Where, where in the world are you? 
Jon Krohn: 02:55
I’m in New York. Thank you for asking. Guests very rarely ask. 
Harry Glaser: 03:00
I’m jealous. Love New York. 
Jon Krohn: 03:02
Yeah. You know, big pros, big cons in the big city. So, Harry, you’re the Co-Founder and CEO of Modelbit. It’s a machine learning model deployment company that makes it very easy to get models out in as little as a line of code without any of the hassles of machine learning operations and Modelbit integrates with the modern data stack. And so that’s what I was thinking would be a really fascinating topic to dig into with our listeners, is what the modern data stack is. So, for example, that easy deployment the Modelbit offers can be done from Jupiter Notebooks, Deepnote, and Hex. So, Jupiter Notebooks, I’m intimately familiar with, any of my teaching materials. I put in Jupiter Notebooks, particularly in Colab Online, Google Colab, which is a very easy for me to be teaching online.
03:56
People don’t need to be doing installation, and you can very easily see, you can be right there in the notebook be getting your plots, lots of interaction just makes it very easy for following along in that kind of lesson content. And I know that a lot of people will even prototype their models or their, their pre-processing, their data pre-processing flows in Jupiter Notebooks. So, I think this is something that probably a lot of our listeners are familiar with, but fill us in on Jupiter and what it’s like to deploy from Jupiter, why somebody would want to do that as opposed to using MLOps, and then we’ll carry on with the modern data stack from there afterward. 
Harry Glaser: 04:34
Sure. So, I think almost all of the data scientists that we work with, our customers, build their models and train the first versions of their models in Jupiter Notebooks or a similar technology. And Deepnote and Hex are just sort of cloud-collaborative Jupiter Notebooks. Maybe not fair to call them Jupiter, exactly, they’re, they’re rewrites, right? So, they’re both in the cloud, in the web browser, but they allow considerably more sort of collaboration and they run the code in the cloud and not on your laptop. But other than that, they’re all notebook experiences, right. And our data scientists will use those notebooks almost exclusively for prototyping, experimenting with their model, building their model, training the first versions of their model. And then there’s this big friction point when it’s time to deploy the model. Because now you have to get all this stuff that’s sort of been built in this very experimental way out of the notebook and into something that resembles something that could go into production, right? And classically, that looks like rewriting, right? Opening up VS Code or your favorite editor and rewriting whatever was in a notebook in a legit production-ready software engineering code. 
05:51
And if the data scientist doesn’t have that skillset, which is common, then they have to give their to a software engineer or a machine learning engineer who does have that skillset. So, big friction point. So, what, what we do and what we hope is useful for folks is in that same notebook where you were experimenting and training and building the first version of your notebook simply called mb.deploy(), and as the parameter to that function call pass in the function that calls your model or the model itself, and we will wrap up the entire code of the model, the data of the model, as well as the entire Python environment, right? Which is very important. The particular libraries, the particular versions of those libraries, the you know, the system packages, those libraries depend on the version of Python itself. All that goes in, into Docker container, put in a production environment behind a REST API. And then you get your version control and your CI/CD and your logging, and load balancing and so forth. 
06:47
But the key insight is the data scientist is almost always in this Jupiter notebook building the model. And if we can just automate the process of taking it from there into production, we can hopefully save them a lot of time and stress. 
Jon Krohn: 07:02
Very cool. That makes a ton of sense to me. And so, yeah, for someone like me who nobody ever lets me near the production [crosstalk 00:7:10] 
Harry Glaser: 07:12
That’s, me too. Me too. And that’s that’s really common, right? You have these data scientists who are really good at these, these big honking models that are, that are awesome, but it’s a different skillset from battle tested production ready software engineering code. 
Jon Krohn: 07:25
Yeah. Yeah. And we have data scientists on my team that definitely have the machine learning engineering background, but they come from computer science backgrounds and kind of moved into data science. As opposed to someone like me who came from the science side, and they’re kinda like, let’s, “we’ll, we’ll take it from here, Jon.” 
Harry Glaser: 07:41
Yeah, yeah. Right. And some, at some companies, they want the data scientists to have that career path into ML engineering, but at many more companies, it’s like, like you said, “we’ll take it from here”, you know? And that provides a bottleneck to how much machine learning that they can ship, which is [crosstalk 00:7:52] 
Jon Krohn: 07:52
Absolutely. In fact, something that I’ve read and that fits well, roughly with my experience having been a Chief Data Scientist now for coming on 10 years, what I’ve noticed is that data scientists can churn out models far more quickly than they can be engineered into production. So, it’s something like, like a one to four ratio that you need of data scientists to engineers, whether they’re machine learning engineers or backend engineers, to be able to put those models into production and have them be performant. So, it sounds like Modelbit is the perfect solution to get that number way down, potentially one to zero. 
Harry Glaser: 08:31
Yeah. I mean, one to zero is our goal. Absolutely. And we have customers that operate at that ratio. And to your point, I mean, I think one to four is probably right in a, in a sort of scenario where you’re staffing MLEs to deploy, you know, to deploy models. I’ve also seen one to two in top tech companies. And you know, we, what, what we see is like, imagine now drag that ratio out to the right as you want to scale up the amount of machine learning you do because you’re a machine learning company, right? Let’s say you’re building a startup or even a big company, and machine learning is the core of your value prop, and you have a team of data scientists, and now you want to do a lot of machine learning. Well, your cost just goes crazy hiring all these machine learning engineers to say nothing of can you even hire them? Are they available? And so that, you know, there’s this hierarchy of needs, right? And so you supply the data scientist with something that makes their life better. And then, you know, up the hierarchy of needs is by doing that, you supply the company with something that saves them a lot of money / provides a lot of business value. 
Jon Krohn: 09:32
Yeah, totally. It makes a huge amount of sense. It’s a big problem that it looks like you’re doing a great job of solving. So, okay, so Jupiter Notebooks, we’ve got a great sense of that. Now, you also can deploy with Modelbit from Deepnote. So, that’s, is that something similar that’s also kind of a notebook similar to Jupiter? 
Harry Glaser: 09:53
Yes, that’s right. So, Deepnote and Hex and notable.io and there are others are just cloud-based notebook environments, very similar to Jupiter, but sort of a modern ground-up cloud-based rewrite. 
Jon Krohn: 10:05
Very cool. Gotcha. 
Harry Glaser: 10:07
And to be clear, I mean, sorry, to be clear, we are agnostic about that, that it’s a customer question, user question what, what kind of notebook environment they prefer. 
Jon Krohn: 10:16
Right? Right. So, Modelbit is a library that can just be imported. It doesn’t matter what notebook environment you’re using. 
Harry Glaser: 10:23
That’s exactly right.
Jon Krohn: 10:25
Cool. Do you ever see people deploying from something that isn’t a notebook? 
Harry Glaser: 10:30
Yeah, I’d say there’s a maturity curve that we see, and I think pretty much all, if not at all, customers have started from a place of wanting to deploy models from a notebook. What’s funny is, even the most senior data scientists in the industry who are deploying the most sophisticated neural nets and deep nets are still building them in notebooks. Because the experimental nature of the notebook and the visual programming nature of it makes it easy to evaluate models as you are training them. But as the model itself gets closer to a software engineering project where the model was initially built months or years ago, it’s been retrained several times in pipelines based on new data. we move to a more Git-based workflow where you check out the code and data, modify it, check it back in, and or run a pipeline inside of Modelbit that reruns the training, automatically or manually evaluates the results of the training and automatically or manually redeploys. But those are typically at the more mature end of the maturity curve after. And to be clear, I’m saying the model’s maturity curve, where, you know, we’ve moved from the experimental stage to the, you know, living in the product stage. 
Jon Krohn: 11:42
Very cool. That makes a lot of sense. So, then in that case, you could be using a, an IDE like PyCharm or like VS Code. Yes. People have all their favorites. And then you get into the big, like Vim, the Vim and Emacs debate. 
Harry Glaser: 11:59
I’m old enough to have coded for a living in Vim. I miss it. I still do :wq sometimes on my keyboard, and then Slack is like, what the f***. But yes, I mean, in a nuts and bolts level, you can deploy to model. There a couple of ways. One of them is from the Notebook, as we’ve discussed. Another one of them is with a git-push. And yeah, if you’re using the Git-based, you know, workflow, because it’s a very mature model that’s been in production a while, and you want to check it out, modify it, check it back in, check it out, branch it, push to the branch, evaluate the branch, and then merge the branch, whatever, yeah, you would use something like a VS Code or a PyCharm or a Beamer Emacs or whatever your flavor is. 
Jon Krohn: 12:36
This episode is supported by the AWS Insiders podcast: a fast-paced, entertaining and insightful look behind the scenes of cloud computing, particularly Amazon Web Services. I checked out the AWS Insiders show myself, and enjoyed the animated interactions between seasoned AWS expert Rahul (he’s managed over 45,000 AWS instances in his career) and his counterpart Hilary, a charismatic journalist-turned-entrepreneur. Their episodes highlight the stories of challenges, breakthroughs, and cloud computing’s vast potential that are shared by their remarkable guests, resulting in both a captivating and informative experience. To check them out yourself, Search for AWS Insiders in your podcast player. We’ll also include a link in the show notes. My thanks to AWS Insiders for their support.
13:26
Awesome, Harry, that all makes perfect sense to me. Let’s dig a bit more into that kind of, that idea of this git version control kind of topic. 
Harry Glaser: 13:36
Yeah. 
Jon Krohn: 13:37
So, version control tools like Git, these were built for software engineers originally, but data scientists are using them regularly today. Do you think that that’s, it’s like ideal for data scientists? Do you think [crosstalk 00:13:51] 
Harry Glaser: 13:52
Well, I do think version control is ideal for data scientists. So, let’s start there. We want to have some version control, right? We, we built a model, we deployed it, then we changed it, then we redeployed it. We need, like, the old one can’t be lost. That’s not the answer. There are a couple of things that are challenging about Git for data scientists. Number one, the models themselves are typically very large objects, right? I’ve seen these big neural nets be hundreds of megabytes or gigabytes. I you could get, you know, and up from there, right? And if you look across the industry at other adjacent industries that have very large assets in their version control, they typically don’t use Git. Git is notably bad at large binaries. And so the video game industry, for example, will use like Perforce, right? But in the tech industry, that is not generally gonna be an option because there’s an army of software engineers at your company and they’re Git and they’re using GitHub or GitLab. And so we want to be in GitHub or GitLab. So, that just means what, you know, what Modelbit will do is it’ll be backed by Git and it’ll interact with your Git account, but we will stub the large files in Git and, and, and replace them essentially with the metadata file that specifies their signed URL in S3. And so that’s just a little improvement that we make. The second and more interesting. Yeah, this- 
Jon Krohn: 15:16
So, so that means, so this idea of stubbing, it means that it looks and feels to the user on the front end, like they’re just using Git. But in the back end, there’s some cleverness that is referencing back to like an S3 bucket in AWS. 
Harry Glaser: 15:32
You literally, you literally just run git-push and we’ll notice a large file, put that metadata stub in upload the large – I should specify these things in order. First we upload the thing, then we get the pre-signed URL information, we put that in the Git, we check it in. Great. And then when you git-pull, the hat process happens in reverse as part of the git-pull, the file gets downloaded and put in the local file system. So, that’s sort of a quality of life improvement that we make. You get to version control your models, you get to put them in your GitHub and GitLab account in the same place as all your other code at your company. Everybody can collaborate on it. You can do all your GitHub actions, you can do merge requests and branching, and it all works, but we don’t bog down your Git repository with huge binary files. So, nice little win. I also think what’s important is, you know, calling back to mb.deploy() from a notebook. You know, you’re developing your model in a notebook, checking in the ipynb file itself is like annoying. They’re just not amazing in-
Jon Krohn: 16:28
A hundred percent. Yeah. Because you end up with very small changes that you make to the Jupiter notebook will end up being huge apparent change. Yeah. Though you’re [inaudible 00:16:39] the whole thing. Yeah, exactly. 
Harry Glaser: 16:40
So, some, some companies prefer to have the notebook itself also in Git just as a management thing, so you can get it, which is fine. But in terms of version controlling the model and the model code, it’s not the right metaphor. And so when you call model that’s out to play from the notebook, the code of the model, usually there’s a bunch of Python code that does like feature engineering and input checking and things like this. That code as well as the model itself get put into Git as source files rather than the ipynb file getting put into Git. And so that’s really nice from a model versioning point of view. So, what you see is a clean diff of the actual model code, and if you roll back, you’re rolling back the model code, you’re not rolling back the notebook. 
17:20
And so, and the notebook persists as this kind of environmental, I’m sorry, experimental interface. So, that’s really cool too. And the upshot of all that is that Modelbit is not actually backed by some multi-tenant MySQL database. It’s backed by your git-repo. So, when you go to app.modelbit.com/you know, your company/whatever, there’s a fetch from a git-repo that happens that powers what you see on the screen. And then when you make changes in the app, that’s a commit that’s made to the git-repo. DBT also notably works way. And we found it’s really nice sort of workflow for data professionals. 
Jon Krohn: 17:53
Very cool. That sounds really good. And so, a while ago, I don’t know if you have now ventured onto this, but you had like a number one, and then I interrupted you and had and dug into the stubbing. And just as you’re saying number two. 
Harry Glaser: 18:05
To bring us back to the table of content, I think number one was the sort of stubbing, which is at the end of the day, just a quality of life improvement, but we’re proud of it. And then number two is the way the notebook and the web application interact with your git-repo, which we think is maybe more strategic. 
Jon Krohn: 18:18
Nice. Very cool. Yeah, that does sound like a nice way of interacting with these notebooks that, yeah, I, I’m, it makes for me because I just take Jupiter Notebooks, these ipynb files, and I, you know, for my teaching stuff online, for example, at these GitHub repos where the Jupiter Notebook is, I make this tiny, like I’ll change a lowercase letter to a capital letter in one of the- 
Harry Glaser: 18:45
Right, the whole chase on representation goes crazy. Yeah.
Jon Krohn: 18:48
Then I end up getting, I, but it makes me feel like I’m making these huge commits. Cause it’ll be like, “you changed 10,000 lines of code.” 
Harry Glaser: 18:57
I like that. Right. I still, I still get to check in a little code every now and then to Modelbit. Probably not a great use of time, but I’m just hoping that five years from now when we’re a big company, the engineers will see me in the git-blame still. I get some credit from them, you know? Yeah. Yeah.
Jon Krohn: 19:12
Nice. So, all right, so that covers version control. You’ve mentioned GitHub and GitLab. So, GitHub probably most listeners are familiar with. It is a place sure that you can be storing your code. Lots of other assets potentially, although that’s not a great idea in the sense of like model as we’ve just been talking about. What’s GitLab? How’s that different from GitHub?
Harry Glaser: 19:35
Yeah, they’re very, very similar. They, they are both sort of hosted Git repositories with a bunch of you know, features that help you with things like CI/CD and maybe some webpage hosting and things like that. GitLab seems to me, I’m not an expert on this, but it seems to be, to position themselves more as an enterprise platform. And that might mean less, GitHub has this whole community-focused aspect of what they do. They host a lot of open-source projects, GitHub stars and comments, and all this stuff. And GitLab seems a little more enterprise-focused in their positioning, which suggests to me that they might have a bunch more enterprisey features. Although, again, I’m not an expert on this, but I will say they, they seem to be broadly similar.
Jon Krohn: 20:13
Yeah. So, we got into this whole conversation about GitHub and version control because you were mentioning how Modelbit allows you to deploy very simply, potentially with just one line of code. And then all of these kinds of things like version control CI/CD – Continuous Integration, Continuous Deployment, things like load balancing, things like logging, these all then come automatically. So, for our listeners who are data scientists who are prototyping models and haven’t done that kind of more software engineering, machine learning engineering aspects of model deployment, what is CI/CD? What is load balancing? What is logging? 
Harry Glaser: 20:52
Yeah, sure. In no particular order, I mean, I think actually backing up, you know, when you call mb.deploy(), and we’re really proud that there’s this sort of one line, now it’s in production. Well, in production means a lot of things. And it has a lot of needs, right? There’s a, there’s a reason why prior to Modelbit, they were full-time ML engineers, full-time software engineers working on this. So, if you get a lot of load to your product, maybe it’s Monday morning, maybe you have a big new marketing promotion, whatever, a lot of people are using your product now that might result in more load on the machine learning model and you need literally a load balancer sitting in front of it, you know, managing that load so it doesn’t crash. So we supply one of those.
21:31
You know, CI/CD is, as you make changes and push them, all of your testing and all of your deployment processes run automatically. And so we certainly supply that, but also maybe more importantly, it’s pretty common for companies to have these set up, let’s say in their GitHub account where, you know, as part of a push or as part of a merge request, these you know, various testing and integration processes run. And because we’re backed by Git in your GitHub account, all of those would run for your model deployments too automatically, which is a really nice win for the rest of the company in terms of getting these data science models in the main line with the rest of your with the rest of the company’s products and processes. And then, you know, you mentioned logging, which obviously once you’ve deployed the thing, you want to know like what, what inputs did it receive, what inferences did it give? And if you have any logging code as part of your model, you know, you want to print out what it was thinking or what it, you know, what, what parts of the code were reached or whatever. All that gets put in the logs, the logs can be integrated with your company’s favorite logging system or just viewed on your model bid application. And so all those things, I mean, there, there’s a bunch more, but there’s a whole constellation of things that get stood up as part of a deployment so that you have a relatively full-featured suite that you will discover you need once you deploy a model. 
Jon Krohn: 22:47
Yeah, that all makes perfect sense to me. The load balancing is obviously critical because you want to be deploying a model that lots of people are gonna use. And so you can’t just have it running on a single server. As you start to have 10 users, a hundred users, a thousand users, ten thousand users, the model is going to need to be scaled up. Your whole application is gonna need to be scaled up. So, it’s cool that Modelbit allows you to do that in an automated way. And so that’s really awesome. And then in addition, this logging point that you’re making is so critical, because when people, once you have your machine learning model in production, you need to understand how your users are using the model with respect to the inputs, because there’s going to be inputs that they provide the result in suboptimal results. 
23:36
And you’re gonna want to be able to dig into those. You’re gonna want to be able to check that the kinds of inputs that people are putting in, or what weren’t anticipated. Issues like drift, which we’ll get into on can be handled that way. And then critically, these, what the outputs were provided a given input, this is so important, especially in this now large language model, LLM world that we’re in, where those inputs and outputs can end up being invaluable for being able to fine-tune your production model or some other model that you might, that you might like to build. So, these user inputs and outputs can end up being proprietary data that provides your company with a defensible moat. Because let’s say, you know, your original production model, you might have used data that you scraped from the internet and your competitor could get those data as well. But once you start to have your users interacting with your platform as they use it more and more and more, you have this proprietary defensible data set that makes- 
Harry Glaser: 24:37
No, it’s an important point because you, you’ll get to a place where, I mean, you mentioned drift, you’ll get to a place where eventually the model will drift in some capacity, right? And you are trying to understand why the model, you know, you might start out by thinking, I just need to log the inferences, or I just need to output the inferences and I’ll figure it out. But then you want to know why the inferences are changing and all of a sudden you discover, gosh, it would be really nice to know if the inputs have been changing over the last month or year or whatever it’s been. And so having access to those inputs as well, and all the logs information, it’s one of those things that you discover you need retroactively. 
Jon Krohn: 25:09
Awesome. Yeah, exactly. And so I think the only one that we haven’t gotten back to yet is the CI/CD. So, what is Continuous Integration / Continuous Deployment? 
Harry Glaser: 25:19
Yeah. It’s, it’s broadly the idea that every time you sort of either push to Git or, or do a pull request in GitHub and, and merge that branch into the main branch, there’s a set of technical processes that run and they’re usually about running the tests. You have your sort of unit tests and then it’s named Continuous Integration. You’ll have your integration tests and whatever, and whatever else needs to happen to ensure that the thing is ready for production, right? And so you might just want, as a, as a mature tech company, you might just want you know, a set of tests that runs every time a change is made to production, that that makes sure that the site doesn’t crash, the core functionality works, you know, users can still log in, whatever. Machine learning models are no exception. If you’re pushing this machine learning model into the product, we want to ensure it’s not gonna break the product. 
26:09
And so the fact that we’re backed by your GitHub repo, you do a mb.deploy() that triggers a git-push and if you’ve been using a branch and you merge that branch, it triggers your CI/CD. The products test will run again before the model arrives in Main() and is therefore being used in the product. Companies, you know, with mature engineering processes, really appreciate that aspect of it, that Modelbit’s pushes are integrated into the same mainline CI/CD as every other part of the company’s, you know, code processes. 
Jon Krohn: 26:39
Nice. Very good explanation as all of your explanations have been, Harry. You’re masterful at it. 
Harry Glaser: 26:43
Thanks. Thanks. Oh, thank you.
Jon Krohn: 26:44
Nice. So, all right, so we’ve talked about deploying very easily using Modelbit. And so we’ve talked about, you know, this kinds of, these Notebooks and other environments that we can be using for prototyping our models, for developing our models. We’ve talked about these software engineering concepts around CI/CD, load balancing, logging. Another topic area is once we have our model running in production Modelbit can be helpful there as well, right? So, there’s a really popular platform these days. Snowflake, so. Even like the Snowflake conference, I, it’s unprecedented how many people I had reaching out to me, people that I know who are VCs or data scientists saying, oh, I guess I’ll be seeing you at the Snowflake conference. And I was like, “well, you won’t, I don’t really deal with that myself.” 
Harry Glaser: 27:39
I just got back, actually, we were there. I wasn’t there the whole time. 
Jon Krohn: 27:42
Of course, of course you were. 
Harry Glaser: 27:43
But some of our, some of our customers are there. So, you know, I will never pass up, especially in these times, an opportunity to shake the hand of a customer to buy a beer for a customer is not to be taken lightly. So, if a customer lets me know, they’ll be at Snowflake conference, the Snowflake Summit, as they called it, then we went. Also would be remiss if I didn’t mention they’re, they’re a great partner to us and actually an investor in Modelbit. So, it was great to see, we were able to catch up with the Snowflake team as well and update them on our business as which they invested in, but more importantly, see our customers. And so we integrate with Snowflake in a variety of ways. I mean, it does tend to have this sort of center of gravity for the company’s data platform and the company’s data team. And so we were talking about logging, and so we sync the logs into Snowflake, which would be really useful. So, we were talking about logging those inputs and outputs. 
28:29
Well, now you want to build a graph of the inputs and outputs, right? Obviously. And you want to compute [inaudible 00:28:34] like statistical methods you might want to use to check out Model Drift. And so you could build all that on top of Snowflake. We also allow you to get inferences directly in Snowflake, which is actually really cool. So, you know, you might as part of your daily, you know, sort of rebuild of data or build of data in Snowflake, want to use a machine learning model to calculate something. Maybe you got a new customer and you want to calculate a health score or something like that. Or you want to re-score the health scores of old customers and you want to save that score in Snowflakes that it’s accessible to the rest of your business systems. You can actually, every time you deploy a model into Modelbit it creates a SQL function in your Snowflake warehouse. We also work with Redshift in your Redshift warehouse for you to get inferences from that model, which people find really valuable in these kinds of back-office scenarios. 
Jon Krohn: 29:25
Mathematics forms the core of data science and machine learning. And now with my Mathematical Foundations of Machine Learning course, you can get a firm grasp of that math, particularly the essential linear algebra and calculus. You can get all the lectures for free on my YouTube channel. But if you don’t mind paying a typically small amount for the Udemy version, you get everything from YouTube plus fully worked solutions to exercises and an official course completion certificate. As countless guests on the show have emphasized, to be the best data scientist you can be, you’ve got to know the underlying math. So check out the links to my Mathematical Foundations and Machine Learning course in the show notes or at jonkrohn.com/udemy. That’s jonkrohn.com/U-D-E-M-Y. 
30:09
Very cool. Why would somebody choose Snowflake versus say Redshift?
Harry Glaser: 30:14
Gosh, this goes back to my last company, I think you mentioned at the, at the start. We had a, we ran a BI company before we started Modelbit, and we spent a lot of time with Redshift and Snowflake. You know, they, they’re broadly similar. Snowflake is a maybe more, more well-known for separating the compute layer from the storage layer. And so that’s really nice if you want to, if say you have a relatively modest amount of data, but you need a lot of computational power or vice versa, maybe more common as vice versa, you want to scale up the amount of data storage you’re doing without that needing to scale up the number of computers you have actually doing the processing of that data. And so that makes it really nice for really, really large data sets. It’s also just modern. It’s got a lot of nice you know, modern web application features and they are pushing the envelope in a lot of, sort of allowing you to do inline Python with their Snowpark feature. They’ve been advertising a bunch of generative AI, generative AI stuff in the warehouse in at their Snowflake Summit this week. So, we like Snowflake and we’re, we’re proud to be partners with them.
Jon Krohn: 31:18
Very cool. Nice. But for all of these positives around solutions like Snowflake and Redshift that people could be using and running their models from my understanding from my conversation with you before we started recording is that for the most part, people actually prefer to run their models within their own product.
Harry Glaser: 31:40
You know, data science and machine learning is just moving so fast right now that it’s, it that, that’s actually my favorite part of being in this space is how fast it’s moving. And so, yeah, something that’s taking off right now and has become the most popular way to use Modelbit is to deploy your ML models into your product. You know, you may have a pro, you may have a product whose main core value, this is really common in like Fintech where the underwriting model, you know, that predicts fraud or anticipates how much credit they should extend to a user is the core product or the core business model. And they will need that model in their product. Your user signs up for the product and say, oh, they’re on a mobile app, they’re signing up for a product. The product needs to predict your, the credit worthiness of the user right then and there. And so they need in their product, their machine learning model to run. And it’s very common to use Modelbit to deploy those models that way. 
Jon Krohn: 32:29
Very cool. So yeah, so this is the idea of somebody already has their app stood up they already have their infrastructure stood up and it’s just yeah, easiest to run right in there. Cool. In the context of Snowflake, we ended up mentioning DBT as well. So DBT is another one of those names that’s taken on a life of its own, you know, it’s the name of a company and yet it is almost like a verb. In our industry. So, I know that Modelbit in particular integrates with DBT for the purpose of orchestrating models. So, there’s another verb. What does orchestration mean? 
Harry Glaser: 33:10
Yeah. So, I think, I think, I’m not sure if DBT would agree with this framing or not. Maybe they would, but one way to think about DBT as is as an orchestration layer for your data in your warehouse, like Snowflake. And so you know what, what DBT will do, and we’re big users of DBT, we love them. What DBT will do is allow you to build the data organization inside of your warehouse on a schedule. So, let’s say you’re getting new users every day or you’re logging new information from your application every day raw into Snowflake. And then DBT will allow you to build your sort of data models. These are using the word model differently than we have been using the word model, but your data models inside of your warehouse so that you have a agreed source of truth for what is a customer, what is a user, how much revenue did we have? How much did we churn this month? Whatever. You’ll build that into your warehouse itself using DBT using commonly agreed-upon definitions between your data and business teams. 
34:05
And you may want to run your ML models as part of that process. Let’s say you write a row of data into Snowflake every time you get a new lead comes into your homepage, right? And signs up for the product. Well, you might want to score that lead, you might want to know, is it an A lead or B lead, or a C lead? And based on the lead score, you might then want to make some business decisions, like is it gonna get routed to the enterprise sales team or the mid-market sales team? And that’s a machine learning model that runs to do that lead score. Might be a classifier or a regressor, and you can run it right from DBT. So, if you, again, if you deploy a model using Modelbit, it is available immediately in your DBT layer and then you can run that function from your DBT layer to get that inference and then write it back to the warehouse as part of the data model so that you can make your business decisions. 
Jon Krohn: 34:51
Very cool. Makes perfect sense. Very nice explanation. And so, all right, so we’ve got, we’ve gone through all of these steps where we’ve got our model deployed, we’ve got CI/CD and load balancing, logging, making sure that our, we’re tracking all the right things in production, that information is being stored in a data warehouse like Snowflake or Redshift. And then we have DBT that’s orchestrating all of this, potentially including running models to create additional data to make inferences. One last key step once we have all this up and running is making sure that our models are behaving well in production. So, fill us in on how tools like Looker and Tableau can be used to monitor our models once they’re up and running. 
Harry Glaser: 35:44
Yeah, it’s really common, you know, once your model is deployed to want to know how it’s doing, right, you’ll want to at, at minimum, you’ll want to look at like some kind of scatter plotter line chart of your inferences over time. But there’s also various statistical methods that you might want to apply to measure more rigorously whether the model is performing, you know, according to the specifications that ought to be performing to. There are dedicated ML observability platforms. We integrate with them, we like them, but we have observed that it’s maybe most common, you already have a business intelligence product like a Looker or a Tableau or a Power BI or something like that. And so we can just make the raw logging information, the inputs and outputs, like we said, available in a business intelligence tool. And it will, it’ll be really common, maybe the head of data at the company or some of the BI analysts or data scientists already have dashboards they look at every morning. 
36:34
And so just a quick glance at how the model is doing. Did it crash? Is it returning all ones or all zeros? You know. Does it seem to be trending in a bad direction? Will just be part of their daily check. But so that’s, that’s the bread and butter use case for business intelligence platforms like Looker and Tableau with Modelbit. But the sort of cool one that I like is you can actually use it also to play with a model. So, this, this, this notion of just poking at it, playing with it, right? It doesn’t sound rigorous, but it’s something every data scientist does. You’ve trained a model in a notebook, you’re trying it out, you’re deciding whether you like it, whether it’s good enough to deploy. You’ll obviously have statistical methods and statistical functions, R2 scores and whatever that tell you whether it is better than other versions of the same model. 
37:19
But I promise every data scientist in the world also just pokes at it, plays with it, gives it their favorite test, and puts in and sees what it does. Does a little Matplotlib output, and just checks it out, right? What does it do on my favorite inputs? That’s something you do in code, but the business people can also poke and prod and play with the model. You deploy it, you get to play with it in Looker, you get little dropdowns that just say like, okay, let’s try it on this lead, let’s try it on that lead. Our biggest customer, what if they had come in as a lead today? Would the model correctly identify them as a good lead? Let’s just try it, right? Let’s just play with it. And so using the business intelligence product as a natural home for playing with and toying with models and giving the business people the same ability to toy with and play with models as the data scientists have is a fun little addition to the product and we’re proud of it. 
Jon Krohn: 38:04
Nice. Yeah, that makes a lot of sense. So, in addition to being able to do monitoring of reproduction systems, for example, looking out for the various kinds of drifts that are out there, like your data changing over time or your model no longer being relevant. These various kinds of drifts that can happen and that we’ve covered in a fair bit of detail in preceding episodes of the show. So, in addition to being able to do that kind of production monitoring, we can also be using these BI tools like Looker and Tableau to able to provide an interface for less technical people. Maybe product people or people in leadership or salespeople, customer success people to be able to play around with models and, and see how changing configurations in them- 
Harry Glaser: 38:52
Yeah, to the extent that we can bring the same, you know, one of the funniest, one of the most fun things about, you know, sort of, sort of getting to play with the technology itself in like a Notebook or in code, just the, the experimental power of it. The ability to stamp something up new today to build something. If we can bring that same sense of experimentation and exploration to the business teams too, and let them just feel the power of the ML a little bit instead of just receiving the business results of it, I think that’s good. I think that’s a good day’s work, you know, we’re proud of it. 
Jon Krohn: 39:24
Nice. Yeah, that makes a lot of sense. And then, so Tableau, Looker, what are the pros and cons of these particular tools? Like, you know, why would you use one or the other? 
Harry Glaser: 39:34
They’re both really popular. They’re both older now and a little bit more of a sort of on-premises flavor. And so I think a lot of the ways that Looker and Tableau get purchased are experienced purchasing managers who already have relationships or, you know, really hardcore on-premises privacy and security requirements. But there’s a, there’s a modern generation of tools and they tend to be a lot cheaper and more innovative preset, which is based on an open-source product called Superset, Redash. You can use Notebooking tools like Hex, which we talked about earlier as dashboards, and that’s fun. Snowflake comes with SnowSight, which is a really cool product. So, you know, if you’re asking, I spent 10 years in BI and so for me, those are the cutting edge of BI and that’s where I would put my recommendation.
Jon Krohn: 40:24
Very cool. Thanks for those. That’s a really nice list. So, in addition to the more established tools like Looker and Tableau, we’ve got things like Preset, Superset, Hex, which we obviously talked about earlier in the context of Notebooks as well as Snowflake’s Snowsight. Crystal clear. All right, so Harry, you clearly know a ton about the data stack today about how people are using data tools, particularly for machine learning. Where’s all of this going? So, where’s the stack going? Where’s machine learning going? I think you probably have some unique insights into this. 
Harry Glaser: 40:59
Well, gosh, like I said, the coolest thing about this is about working in this industry at this time is how unbelievably fast it’s moving. And I think a corollary to how unbelievably fast it’s moving is how unbelievably foolish we all look next week evaluating our predictions from this week. And if you think weekly is not the right cadence, I would challenge you on that because it’s moving really, really fast. [inaudible 00:41:21] at the, as of as of the day that we recorded this, which is like, I think a month before we posted, so you know, some humility here, but the hotness right now is these large language models. And so if it would take a quick history lesson, we have these, this technology that I believe came from Google, but it may have been a collaboration with other research labs as well around the ability to make neural nets much larger than they were before. And we called those deep neural nets. And I think some of the, I originally became aware of these when it was used to like win at Go against top Go players, and that was cool. And then-
Jon Krohn: 41:59
The deep neural net stuff itself goes back to the fifties. 
Harry Glaser: 42:02
Well, neural nets certainly go back to the [crosstalk 00:42:04] 
Jon Krohn: 42:04
Yeah, but you’re, yeah, yeah, yeah. But you’re talking about being able to scale them up, so there’s yeah, key, key individual moments like the AlexNet architecture in 2012 where that’s University of Toronto researchers, but Jeff Hinton is also, I don’t know if he was, I don’t know if he was at Google at that time or not. 
Harry Glaser: 42:25
Yeah. And then, and then, you know, and then building on that tensors and TensorFlow and then, and then Facebook putting it, what is, in my opinion maybe a nicer, easier to use layer on PyTorch and then that, you know, that has led to the development of these language embeddings and then these large language models, and now it’s like hit the public consciousness because a large language model is the first time that, that an average user can like, play with one in real time and like see what it can do. 
Jon Krohn: 42:55
I also thought where you were gonna go with that initially is I thought you were gonna say how large language models depend on this transformer-particular architecture of deep learning, and that was developed at Google. 
Harry Glaser: 43:06
Yes. Sorry, I’m, I’m like, I’m like fast forwarding and- 
Jon Krohn: 43:09
Yeah, yeah, yeah. 
Harry Glaser: 43:10
What I was, that’s absolutely correct and the summary point I was trying to get to eventually when I stopped rambling about history was that this is the, this large language model and the chatbots on top of them and whatever are like the hotness right now. But at every checkpoint step along the way, we’ve seen new results for businesses even in more mundane scenarios as well. And so non-large language models, but still deep neural nets have been, I mean I think deep nets are now the most popular type of model that’s deployed into Modelbit. And you see all kinds of, I mean we, we have these customer use cases that are wild. I mean veterinary technology that looks at high-resolution images of blood cells in order to predict the outcome of blood tests. You know, marketing automation software that can tell you what revenue you will have if you change your marketing spend in certain channels in certain ways. These are just like this top of, top of mind right now, companies I talked to today. Security company that looks at live security footage from security cameras and will actively detect threats. So these, these things are just the coolest and you see things that weren’t possible six months ago, you know, based on these new technologies. 
Jon Krohn: 44:21
That’s cool. It’s interesting that deep learning is the most popular in Modelbit. That’s probably something that’s really recent probably with LLMs, that that’s just happened, because it was often the case that deep learning was useful for some particular use cases, like machine vision stuff, some natural language processing stuff. But there were all kinds of use cases. And still today, I think this is true with like tabular data, you probably wouldn’t go to a deep learning model first. You’d probably, you know, being, you’d probably consider something like a random forest or something. So, it’s interesting that with, it probably correlates with this generative AI, this LLM explosion, this transformer explosion that now deep learning is the most popular model type overall in Modelbit. 
Harry Glaser: 45:09
I think there’s, I think there’s two things happening, and one to your point is that there’s unlocking of, one to your point is that there’s unlocking of new, machine learning on new data types where that would’ve maybe been prohibitively expensive in realtime before. And so that’s your like realtime security detection of incoming video feeds, right? It’s not tabular data, it’s much, much higher bandwidth data. Much larger data. But to your point, we also have the use of incredibly large sophisticated machine learning technologies on what is fundamentally tabular data. Like that marketing spend example I gave you. And yeah, I agree. Six months ago they probably were using regression or a random forest or something and now they’re using a neural net. It’s wild. And I think, you know, there’s also, I also hear from peers and friends, you know, some skepticism about rather neural nets are really the right answer for that. Is this model truly outperforming what a regression would do? But I think the answer is TBD. I think, and I think it’s awesome that we’re all experimenting with these things. 
Jon Krohn: 46:08
Cool. All right, Harry, this has been a fabulous episode. You’ve been able to provide really concise information, dense answers to questions up and down the data stack. It’s been super informative for me and so I imagine for a lot of our listeners as well, before I let you go, Harry, do you have a book recommendation for us? 
Harry Glaser: 46:32
Oh, does do, do we have to stay work related or just any book recommendation? 
Jon Krohn: 46:35
No, could be anything. 
46:36
It can be a “not safe for work.” 
Harry Glaser: 46:40
Well, I, maybe on our next podcast when I know, when the audience and I know each other better, but I think, you know, just, just front of mind, I’m a, I’m a sucker for Kazuo Ishiguro. He had a new novel out in the last couple years. And The Remains of the Day is one of my favorites, Never Let Me Go is one of my favorites. And maybe touching on the prospect of this, this … all right, I’m gonna try, I’m gonna try to, to tie it all in. Let’s see how I do. We were talking earlier about the humility that comes from a fast-moving industry and how all our predictions will be wrong by the time this episode actually airs. And Ishiguro is famous for the unreliable narrator, you know, where it’s, it’s sort of a light tragedy where around the climax of the novel, two-thirds of the way in, three-quarters of the way in, you realize that the person who’s been telling the story is misunderstanding their own world in some tragic way. And so I think it also helps communicate humility to those of us in a fast-moving industry. How about that? I think Ishiguro would tell me to take a hike, but there you go. 
Jon Krohn: 47:38
That was epic. That was, that is the most interesting tie-in from book recommendation to the rest of an episode’s content that I’ve had in the hundreds of episodes that I’ve been hosting this show. So, very nice. 
Harry Glaser: 47:52
This is just so I can tweet about it and say, “how is Kazuo Ishiguro related to machine learning?” A thread, one of twelve.
Jon Krohn: 47:59
Perfect. All right. And so speaking of which my final question for you is how people can be following you. So, it sounds like Twitter is one of those places. 
Harry Glaser: 48:07
Oh, sure. So, first of all, most importantly, Modelbit is at www.modelbit M O D E L B I T .com. We’re also at Modelbit on Twitter, and I am @HarryGlaser, H A R R Y G L A S E R on Twitter. 
Jon Krohn: 48:20
Nice. All right. Thank you so much, Harry. I’m sure lots of folks will want more information from you on this fast-moving data and machine learning world after the episode. Thank you so much for coming on and providing this fascinating episode and hopefully we’ll catch up again with you sometime in the future. 
Harry Glaser: 48:37
Thank you, Jon. It’s a pleasure. 
Jon Krohn: 48:44
Nice. Harry’s crisp delivery made creating this episode a breeze. I hope you took a ton away from it. In today’s episode, Harry filled this in on how Deepnote and Hex are cloud-based Jupyter-like notebooks designed for data scientists to be able to collaborate together easily. How version control, CI/CD, load balancing and logging are essential for effectively deploying an ML model. How data warehouses like Snowflake and Redshift can be leveraged for running your ML models. How DBT can be used to orchestrate the organization of data in your warehouse, including to run ML models that add valuable data to your warehouse. And how BI tools can be used not only to monitor ML models in production for issues like drift, but also to allow stakeholders to play around with ML models as you prototype them. 
49:26
As always, you can get all the show notes including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Harry’s social media profiles, as well as my own social media profiles at www.superdatascience.com/699. That’s www.superdatascience.com/699. And if you enjoyed this episode, nothing’s more valuable to me than if you rate the show on your favorite podcasting app or give the video a thumbs up on the SuperDataScience YouTube channel. And of course, if you have friends or colleagues that would love the show, spread the good word. All right, thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another super informative episode for us today. For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors. Please consider supporting this show by checking out our sponsors’ links, which you can find in the show notes. 
50:25
Finally, thanks of course to you for listening. I’m so grateful to have you tuning in and I hope I can continue to make episodes you love for years and years to come. Well, until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts