SDS 792: In Case You Missed It in May 2024

Podcast Guest: Jon Krohn

June 12, 2024

Jon Krohn shares his favorite moments from his guest interviews in the month of May.

 
In case you missed our previous episodes of “In Case You Missed It”, this is our new end-of-month series where we share our favorite clips from the last month. In May, we heard from Sol Rashidi, Navdeep Martin, and Demetrios Brinkmann, who gave Jon Krohn their thoughts on a great range of topics that should interest anyone working in AI, from the importance of understanding the differences between job titles in AI and tech to learning about the companies that are working to tackle the climate crisis.
You’ll also hear from Luis Serrano of Serrano Academy, who talks about his latest course on semantic search and LLMs, created in partnership with deeplearning.ai, and the one key item crucial to running semantic searches.
You can listen to the full episodes completely for free through the links below.  

Podcast Transcript

Jon Krohn: 00:00:06

This is Episode #792, our “In Case You Missed It in May” episode. 
00:00:28
Welcome to the Super Data Science Podcast, I’m your host, Jon Krohn. This is an “In Case You Missed It” like we’ve been doing in recent months. This one highlights the best parts of conversations that we had on the show last month, in May 2024. 
00:00:28
We begin with episode 785, where I interview fellow YouTuber, Dr. Luis Serrano. Luis runs Serrano Academy, and they recently partnered with deeplearning.ai to create a course on semantic search and large language models. To understand semantic search and LLMs, we need to understand embeddings as well. So, in this interview clip, Luis talks to me about what embeddings are, how they function, and how essential they are for running semantic search queries. 
00:01:09
Let’s talk a bit about semantic search more. Let’s dive into embeddings more.
Luis Serrano: 00:01:12 
Sure. 
Jon Krohn: 00:01:13
So how would you describe embeddings and say how semantic search differs from a traditional keyword search? 
Luis Serrano: 00:01:23
Yeah, yeah, yeah, great question. So to me, embeddings are the most important object in LLMs. And I would go as far as saying in many fields in machine learning is the most important object because it’s really where the rubber meets the road. It’s where we translate our language to computer language. Computer language is numbers and only numbers. And so if we’re working with embeddings, we need an image embedding that turns into numbers. If we’re working with text, we need a text embedding that turns text into numbers or sounds into numbers. Or anything we want, we need an embedding. And if that embedding is not super strong, you’re not doing anything. So the better embeddings get, the better LLMs get. If somebody comes up with a better embedding tomorrow, believe me, all the models are better because it’s just a better way to turn text into numbers. So some problems that were hard 10 years ago, like classification, now are much easier because embeddings are better. 
00:02:15
What’s an embedding? An embedding is just you’re literally associating your text through your words to a bunch of numbers. So a vector, like a list of numbers. I like to see them graphically. I always bring everything down to a simplest example. I like to see sending words to pairs of numbers. What’s a pair of numbers? It’s a coordinate in a plane. And so I imagine all the words flying around and words that are similar get put close. If I have an apple and a pear, they get sent to similar pairs of numbers, which is a similar location.
00:02:50
And I also like to think as each coordinate as a description of the word. So in a very toyish example, I could think of the first one as the size and the second one as the flavor or something like that. But if you have thousands of numbers, then you’re pretty much describing your sentence or your word or your piece of text in a very, very, very detailed way using a lot of numbers. And maybe these numbers mean something to us, maybe they don’t, maybe they mean something to the computer, but it’s really that. 
00:03:24
And then one of the things that embeddings are really useful for is semantic search. When we search, for example, if I am going to search for… I always use this example. Let’s say you search for a visa to travel from Brazil to USA, and I want to find the article that has as many words as possible in common with those. And I find one that says visa to travel from USA to Brazil. It’s the complete opposite. It doesn’t help me at all because it’s different, it’s flying in the wrong direction, but it matched all the words. So that’s keyword search. If I find that the documents that match all my words or as many as my words, it’s decent, but it doesn’t get you there because I could reorganize the words and change the meaning of sentences. That’s when we use semantic search. 
00:04:19
So in these embeddings, you’re locating sentences in a plane or in space. And if I have all my documents and they’re flying around and I send the question to my embedding, then it’s very likely that the answer is going to be close. So embeddings are just a way of mapping, of putting all the text flying in space and then it’s easy too to search in that space instead and find an answer. So we’ve seen a lot of improvement with semantic search versus keyword search. And then when you throw in things like Rerank for example, it’s very useful because see, the closest sentence to something is not the answer. If you ask me a question, you ask me what color’s an apple? And I answer to you, what color is an orange? Then you say, well, what the hell? And I say, well, I just gave you the closest sentence, but it’s not the answer. So Rerank helps you actually locate the answer with an extra training. But yeah, semantics is in a great place. I’ve seen some great results. So yeah, I’ve been happy to see that. 
Jon Krohn: 00:05:29
Very nice. So I’ll try to repeat back to you some of what you said. So embeddings, they allow us to have a language of computing where we speak in say, natural language or math or code. So we represent things in that kind of form. When we want a machine to be able to do something helpful for us with that language, we need to convert it into some kind of numeric format. And I love that idea of explaining the kind of simplest embedding would probably have just two numbers, two float values that you represent some word or some sentence or some document by. And then you can very easily imagine that in a two-dimensional space. 
00:06:17
If you all of a sudden have three dimensions, three numbers that you’re using to describe the location of that word or document or piece of code or whatever it is, however you are embedding that, if you go from two to three, then it’s imagining it in the visual world that we encompass in a 3D space. And then it becomes hard for us to visualize in four or more dimensions, but for a machine to be able to represent this is trivial. The math is all the same, the linear algebra is the same. And so it’s very common with organizations like Cohere to have… Your embeddings have thousands of dimensions. And as you’re saying, that then allows a huge amount of granularity for semantic search. 
00:07:01
So we can embed any natural language code, math, whatever it is that we want to represent. Its meaning can be understood well in this high dimensional numeric space that the machine has and that allows it to fulfill lots of different kinds of things. So you mentioned earlier clustering, but I think one of the most interesting applications that we see today with generative AI is things like being able to answer your questions. And so by using this semantic search, you can get into the right region based on the person’s query, like you said, what color is an apple or something like that, that can bring you to some region of that high dimensional… There’s many thousands of dimensions of space where everything’s about the color of fruits. But then once we get to that space, Rerank allows us to find a great answer to the question as opposed to just some phrase related to the question.
Luis Serrano: 00:08:00
Exactly. Yeah, very well put. 
Jon Krohn: 00:08:03
Nice. Awesome. So that helps us understand embeddings and Cohere’s business model, how they’re helping out the enterprise by ensuring that they have great embeddings, facilitating all kinds of these enterprise capabilities. So what kind of experience… If I’m a user of an application that is powered by Cohere embeddings versus maybe some old or poorly trained embeddings, maybe some embeddings that I tried to make myself that weren’t well-made. As a user, how can my user experience… Are you able to give some examples of how my user experience will be improved by better embeddings? I guess one we already talked about is question answering. 
Luis Serrano: 00:08:54
Yeah. I mean, everything becomes easier. The analogy I used to… Imagine if you had a great book, but a really awful translation. So you can only enjoy it so much. You can only understand so much. And so a good embedding just translates your data really well into numbers. For example, 10 years ago, let’s say we had embeddings that were so-so, or even no embeddings, you can just use every coordinate as a different word and you would have a huge space. But anyway, the fact is, let’s say we have poor embeddings 10 years ago, and you want to do a classification model that tells you, I don’t know, if emails are spam or not, but the region between the spam emails and the non-spam emails is very, very complex. It’s very curvy. In order to separate them well, you have to use a really big neural network to come up with a really complicated boundary. There may be problems or fitting, et cetera.
00:09:45
Let’s say you have an amazing embedding that picks up things so well, that puts all the spam emails on one side and all the non-spam emails on the other side, and it’s really a line that cuts on our plane. I’m exaggerating, but a good embedding makes these problems easier. I’ve done classification problems with three or four examples, so you can do much better now as opposed to before that I needed thousands of examples, a huge model. So that’s just classification, but you know, clustering is the same thing. If I have a really good embedding, the clustering becomes much easier because things are located in the right place. So the embedding is the most fundamental thing, and many companies just want the embedding. They’re like, “Okay, we have a bunch of machine learning stuff. I just want your embedding and I can do wonders with it.” 
Jon Krohn: 00:10:38
Our next clip is from episode 781, with Sol Rashidi. Sol is an executive who has held C-suite data positions across a range of well-known Fortune 100s, so I was keen to learn how she builds teams that work with data. In this clip, you’ll hear how important it is to her that she keeps distinct the roles of data scientists and data engineers on her teams. 
00:10:57
When you are building out teams, you keep funding for data science and data engineering separate. Do you want to tell us about that? 
Sol Rashidi: 00:11:04
Yeah. I think the field of data engineering is probably one of the most underappreciated spaces and probably the one that I adore, revere, and protect the most. And I wrote a post about this in LinkedIn once. I’m like, “Nothing functions, nothing happens without a data engineer.” I don’t care if it’s a digital campaign. I don’t care if it’s a new product launch. I don’t care if you’re standing up a system so that you could sell in a new market. It could be anything. It does not happen without a data engineer because no business process can be enabled and no growth can be achieved without information, and information is derived from data, and it’s all about making sure that data’s in the right place at the right time for the right people. And who makes that happen? It’s the community of data engineers.
00:11:55
I only have that appreciation because a long time ago, I was a data engineer. I accidentally fell into it. I think you and I were chatting about me going into professional sports and saying, “Okay, I need to grow up and take on a real job.” I found my tribe and my community, but apparently I was a horrible coder. They were like, “You’re not allowed to touch a code ever again.” I was like, “Okay, so what do you want me to do?” They’re like, “We don’t know, but how about you communicate what we’re doing because we want to do the work. We don’t want to talk about the work.” And I was like, “Oh, okay.” So that’s how I migrated from being a data engineer to translating what the data engineering team was doing. And that’s where I gained an appreciation from it. 
00:12:31
Moving the data, protecting the data, making sure it’s of the right hygiene, the orchestration of it, the availability of it, like everything is because of a data engineer. I don’t need a big team of data engineers. I need a lean and mean team of really great data engineers, and I protect them very, very much so because then when I hire data scientists, I don’t want them doing data-engineering work. Their job is mostly focused on the modeling, the algorithms, the predictions, next best actions, recommendations. It’s what they can do with the information once it’s in the state it needs to be in, but I don’t want them playing with the mining, the cleaning, the curating, all that other stuff. 
00:13:11
So I’m very respectful of crafts. So as a data engineer, you’re a master of your craft. As a data scientist, you’re a master of your craft. Now, of course, with the world of full-stack developers, we have back-end developers and front-end developers. Now we have full-stack and the two blend. And you can be a good full-stack developer, but how can you really be a master of both sides of the fence unless all you’re doing is learning 24/7? I think it’s very difficult, so I respect the division of labor and I let people do what they love to do without having to do the grunt work of things that they chose not to do. So I always try to keep the two separate. 
Jon Krohn: 00:13:48
Defining a tech role is evidently no mean feat, so we spend a little extra time on this with another clip on AI roles. Taken from episode number 787, I explore the differences between ML engineering and MLOps positions with my guest and MLOps community leader, Demetrios Brinkmann. 
00:14:05
The thing that I was trying to distinguish there, which I’m not sure we got much insight into, was this difference between ML engineer and MLOps. What do you think about that one? 
Demetrios: 00:14:18
Yeah, my fault on being muddy in my answers and going on total tangents, I digress. 
Jon Krohn: 00:14:25
No, that was super interesting. I’m glad we got all that out of you.
Demetrios: 00:14:30
So I would say that is again, the job title. It’s so funny that you mentioned that because an ML engineer can be someone that works on the platform and does MLOps, and it feels like that’s more of what is expected from ML engineers, but it’s not always. And sometimes ML engineers, you’ll see a job posting for an ML engineer, but really it’s a data scientist who does machine learning. 
Jon Krohn: 00:15:01
So yeah, so the ML engineer might in a larger organization with enough people might sit between the data scientist and the MLOps person. Where, so the data scientist might be working completely offline and training a model, training the model ways offline. Then the ML engineer is tinkering with that model, maybe writing it, using a completely different library to make it perform it in production. And then the MLOps person takes that production code that the ML engineer put together and puts it into production and make sure that the way that happens is repeatable, that there’s great operations around it. And a key term here, which I think to you as the leader of thee MLOps Community, there’s a term that you’ve used a few times that I think is so obvious to you, but a bit nebulous even to me is you said, “Someone in MLOps is working on the platform.” What does that mean working on the platform? 
Demetrios: 00:16:05
That’s just the velocity of being able to take an idea and put it into production. That is that platform that is there for you to go from exploration mode in a Jupyter Notebook to battle hardened code that is being monitored and can be retrained. And when it starts to drift, it can be retrained. So that is what I consider the platform, platform engineers. ML platform engineer is another term that you’ll hear thrown around. So you get a little bit of everything, I think, and that’s because it is still so new that people are trying to figure out, okay, what do we need and what kind of person is going to help us get there? Let’s try and create a job title for that. 
Jon Krohn: 00:17:01
So platform is kind of like the system. It’s setting up the system, setting up the operations. Cool. Cool, cool.
00:17:07
Job titles are so important for companies to know who to hire, and also those of you who want to keep your career moving at the same momentum as the AI industry. Our final clip is from episode 783 with Navdeep Martin. No topical discussion about using AI for good in the world would be complete without getting my guest’s thoughts on tackling the climate crisis, especially as Navdeep is building a product that aims to contribute to Climate Tech in a major way. In this clip, I ask Navdeep how companies are using AI systems to identify best practices that help safeguard our natural environment. 
00:17:39
Beyond Flypower and this particular use case that we’ve gone over, what are other ways, maybe to get the brain cells of our listeners going on other ways that they can be thinking about AI to tackle climate change, which is probably something of concern to many of our listeners, and I think a lot of our listeners also want to be making a positive social impact with the data science that they do and the companies that they start, the products that they build. So to get those juices flowing, to get those neurons firing, what are some other examples of ways that AI can be used to tackle climate change? 
Navdeep Martin: 00:18:09
Yeah, absolutely. Prior to Flypower I worked with a disaster resilience nonprofit. So what they were doing, their name was IBTS, and they were basically advising municipalities on how to prepare for a natural disaster. So in climate tech, you have the prevention companies, which is where Flypower sits. Can we prevent climate change from happening? And then you have the basically the climate change is going to happen, how do we deal with this? So IBTS was on this other side of how are we going to be ready when there’s a natural disaster? 
00:18:53
So municipalities that they’d work with would be towns or cities that wanted to be ready for a natural disaster in their area. So they would engage with IBTS. The IBTS team would take 40 hours on average to produce their initial assessment of that area. So let’s say it’s Fairfax, Virginia, they want to understand, “Hey, what do I need to do to be ready for natural disaster?” There’d be two people producing this report. They’d be doing extensive research all over the web to be able to produce this. So their finished product was actually 26 pages of content conducted by these two individuals. 
Jon Krohn: 00:19:31
Very cool. So this was IBTS. What does that stand for? 
Navdeep Martin: 00:19:35
IBTS stands for Institute for Building Technology and Safety. 
Jon Krohn: 00:19:39
Nice. So while Flypower is doing its best, its utmost to avoid climate change happening as much as we can, there’s other folks out there IBTS that understand that the world has already warmed, we’re already seeing disastrous effects, some regions are more impacted than others, and IBTS is looking to be able to come up with solutions to prepare regions that are likely to be affected. So my apologies here, but the AI here is that there’s a generative AI element in what they’re doing? 
Navdeep Martin: 00:20:16
So IBTS, as a part of that research effort, they would go to the county’s website first off, so orlando.gov, whatever the county dot gov typically. And those websites are massive. So what they would be trying to do is ascertain whether they had taken the best practices themselves to be ready for that natural disaster. So in addition to their website, they’d go to these other ones, like CDC and FEMA, and really trying to understand that particular area. 
00:20:51
So they came to me and said, “Hey, can you make this easier for us? We want to try out generative AI. We think this is a really good use case.” So we went down that path and said, “Okay, well give us your finished reports.” And they didn’t have that many, actually. They had maybe 10… I think it was less than 10 finished reports. So again, this is great use of generative AI. I don’t need a ton of finished reports. I just need to generally understand the questions. So their reports were questions and then here’s the answer that they had produced from these websites. So yeah, taking that and being able to do that for a county it’s never seen before. 
Jon Krohn: 00:21:30
Gotcha, gotcha, gotcha. So this is actually… There are quite a few parallels between what you’re doing at Flypower and what companies like IBTS are doing. What’s different is the underlying data that they’re using for their generative AI systems and for the kinds of reports that they craft. 
Navdeep Martin: 00:21:45
Yeah, that’s right. I think it’s like a… This example is symbolic of what many other companies are going through right now, which is how do I store this data so that when asked a question, it picks the right thing on the retrieval augmentation side, but also what questions should I be asking? And that’s the fun part. But yeah, how do you tweak this so that it’s not hallucinating, that it’s not coming up with the wrong stuff?
00:22:16
All right, that’s it for today’s In Case You Missed It episode. To be sure not to miss any of our exciting upcoming episodes, be sure to subscribe to this podcast if you haven’t already but most importantly, I hope you’ll just keep on listening. Until next time, keep on rockin’ it out there and I’m looking forward to enjoying another round of the Super Data Science podcast with you very soon. 
Show All

Share on

Related Podcasts