SDS 659: Open-Source Tools for Natural Language Processing

Podcast Guest: Vincent D. Warmerdam

March 7, 2023

Thanks to several recommendations from our listeners, Jon Krohn discovered this week’s guest Vincent Warmerdam. Jon and Vincent talked about the most valuable open-source software libraries for data scientists looking to develop AI or NLP applications, how to manage “skill anxiety” in data science, and how to fix bad labels with the right annotation tools.

Thanks to our Sponsors:
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.
About Vincent D. Warmerdam
Vincent is a senior data professional who worked as an engineer, researcher, team lead, and educator in the past. He’s especially interested in understanding algorithmic systems so that one may prevent failure. As such, he prefers simpler solutions that scale, as opposed to the latest and greatest from the hype cycle. You may know him from his koaning.io blog, his many open source projects, some of his PyData talks or his calmcode.io project. He currently works as a Machine Learning Engineer over at Explosion, the company behind Prodigy and spaCy.
Overview
Vincent is currently experimenting with a new annotation tool, Prodigy, that can structure the information returned by the technology for data labeling. Vincent expounds the clever tricks he uses to get such tools to return only the data he needs. He also says how enjoyable it is to respond to Prodigy’s support forum requests, not least for their diversity in scope. He counts academics, journalists, and even dentists among those who query him.
This focus on responding to the latest real-life problems is a running theme in Vincent’s work. His educational platform Calmcode has hundreds of snackable video tutorials about everything to do with software engineering and more: Vincent established Calmcode after realizing the need for high-quality educational content that a) didn’t rehash the usual datasets, b) helped reduce “skill anxiety” for complete beginners, and c) explained the necessity to understand the context around datasets. For Vincent, “it’s not the algorithm that matters…and skipping that part is a serious flaw in education.” He goes beyond the typical classroom environment, teaching registered users not only the latest algorithms but also how to collaborate on projects and other “soft skills” that are nevertheless vital for the success of a project.
Finally, Vincent emphasizes the importance of linguistics and its application whenever data scientists or engineers want to get to grips with NLP. Trained linguists, Vincent notes, can become essential allies for NLP practitioners, helping to solve data problems by laying bare the elements and structure of language.
If you are a Prodigy user and want to give Vincent feedback on his open recipes, you can do so at github.com/explosion/prodigy!   
In this episode you will learn:  

  • How Vincent came to work with De Speld [08:57]
  • Vincent’s role at Explosion [18:59]
  • How users can apply spaCy [21:46]
  • Prodigy: Annotate training data more efficiently with scripts [26:28] 
  • How to manage “skill anxiety” with Calmcode [32:32]
  • How Vincent fixed bad labels [42:47]
  • The value of understanding linguistics for NLP [54:42]
  • How to constrain artificial stupidity [1:02:38]
 
Items mentioned in this podcast:

Podcast Transcript

Jon: 00:00:00

This is episode number 659 with Vincent Warmerdam, Machine Learning Engineer at Explosion. Today’s episode is brought to you by epic LinkedIn Learning instructor Keith McCormick. 
00:00:15
Welcome to the SuperDataScience Podcast, the most listened-to podcast in the data science industry. Each week, we bring you inspiring people and ideas to help you build a successful career in data science. I’m your host, Jon Krohn. Thanks for joining me today. And now let’s make the complex simple. 
00:00:46
Welcome back to the SuperDataScience Podcast. Today you’re in for a treat with the brilliant, highly technical and sharp-witted Vincent Warmerdam. Vincent is a Machine Learning Engineer at Explosion, the extraordinary German software company that specializes in developer tools for AI and Natural Language Processing, such as spaCy, which is arguably the leading open-source library for NLP, Prodigy, a data annotation tool and Thinc, which is a deep learning library. Vincent is also renowned for several open-source tools of his own, including a labeling tool called bulk, and a tool for fixing poor labels called Doubtlab. He is behind an educational platform called Calmcode that has over 600 short and conspicuously enjoyable video tutorials about software engineering concepts. He was co-founder and chair of PyData Amsterdam, and has delivered countless amusing and insightful PyData talks. He holds a master’s in Econometrics and Operations Research from VU Amsterdam. 
00:01:44
Today’s episode will appeal primarily to technical listeners as it focuses primarily on ideas and open-source software libraries that are invaluable for data scientists, particularly those developing AI or NLP applications. In this episode, Vincent details the prompt recipes he developed to enable OpenAI GPT architectures to perform tremendously helpful NLP tasks such as labeling. He talks about the super popular open-source libraries he’s developed on his own, as well as with Explosion. He talks about the software tools he uses daily, including several invaluable open-source packages made by other folks, and he talks about how both linguistics and operations research are extremely useful fields to be a better NLP practitioner and machine learning practitioner, respectively. All right, you ready for this top-drawer episode? Let’s go.
00:02:39
Vincent, welcome to the SuperDataScience Podcast. I’m delighted to have you on the air. Where in the world are you calling in from? 
Vincent: 00:02:47
Hi, I’m calling in from Harlem in the Netherlands, which is a city between Amsterdam and the beach. 
Jon: 00:02:54
I love that, we talked a bit about the history of New York, cause I’m in New York, which was previously New Amsterdam, and I had included in, on this obvious connection between Harlem and New York and the Harlem that you are in, that I’m probably mispronouncing even just, it turns out I can’t pronounce anything in Dutch. I tried to pronounce Vincent’s last name to him. He said it to me, and he was like, you just, there’s some vowel sounds. You just don’t know. 
Vincent: 00:03:23
It’s, well, it’s not necessarily a lack of knowledge. It’s more that, Warmerdam, like we have some, we have some very particular, regional-like sounds of vowels and consonants in the country that I live. And that’s something that, you know, not every other language has, but yeah, my American friends have always called me Warmerdam. 
Jon: 00:03:42
Warmerdam. 
Vincent: 00:03:43
Yeah. And the Dutch friends say Warmedam, there’s like subtle differences, but it’s just a name. 
Jon: 00:03:49
But yeah, so I probably mispronounced Harlem, as well. At least your Harlem.
Vincent: 00:03:53
Well, it’s also spelled differently. So like, the city that I live in is Harlem, which is spelled with double A, and I believe Harlem in New York is spelled with a single a, like New York used to be called New Amsterdam is a bit of history there. And also I think like certain boroughs in New York are called after Dutch cities if I’m not mistaken as well…
Jon: 00:04:12
Yeah. And so we were, we were talking about how in the New Amsterdam part of New York, so when it was a Dutch colony, it was the very southern tip of Manhattan, which is where I live. And down here there’s no grid. Whereas once the English took over from the Dutch, everything else that they built in Manhattan and the rest of the city is on this really structured grid. So people who have been to New York, most of New York, they’re familiar with this, streets go east-west. So as you go northward, you climb streets very rigidly and avenues run north-south. And so a lot of people find navigating New York pretty easy, but Downtown Manhattan, the old New Amsterdam part, the old New Amsterdam, where I live, there, yeah, there isn’t this grid cuz it grew organically. And then you were describing to me a really interesting thing about the way that Amsterdam is shaped. Cause you were like, “Oh, Amsterdam also isn’t on a grid”, but I was like, it has a really nice structure cause it has these, semi-circles emanating from the center. And you told me… 
Vincent: 00:05:18
Yeah. So I should maybe, you know, caveat that. So I used to be a tour guide in Amsterdam, and this was like one of the stories that they told me to tell the tourists. It might not be entirely accurate, but my understanding of it is, that Amsterdam used to have like, you know, a bit of a wall and a bit of a moat around it to protect itself. And every time that it had to expand, they had to go around a dig, yet another mote. And every time they did this, actually a new canal appeared because the outside moat is a moat, but any moat inside of those city walls would be a canal. And that’s also partially what explains like the shape of Amsterdam. Like, kind of like the first couple of years it existed, like new moat, new moat, new moat. And so like the supercenter of Amsterdam has like these, watery circles in it. I don’t know the full accuracy of this story. I do know this is something I remember from back in the days when I was a tour guide in Amsterdam in college. So, 
Jon: 00:06:09
I’m not, I’m not gonna fact check it. Sometimes I check facts like in real time, but it just, it sounds interesting enough and we’ve caveated it enough that if, listener don’t make like an investment decision based on this guidance. 
Vincent: 00:06:26
And also if you make an investment decision based on this, like, tell me what exactly.
Jon: 00:06:31
It’s abstract. It’s abstract. So the way that you ended up on the show is interesting. You’re the first person to come through, a new process. 
Vincent: 00:06:43
Yeah. So I’ve been told, you can tell the process better than me, but apparently there’s like, been user feedback as what I, what I gather.
Jon: 00:06:51
So I ask at the end of most episodes, I ask our audience for feedback in general, but, we had a formal listener survey near the end of 2022. I ran for a couple of months, and one of the questions that we asked in the listener survey was for guest recommendations. And your name came up and we looked into you and we thought you were brilliant, you’re an amazing speaker. You’re so funny. And so yeah, we had you, we wanted to get you on air. And so given how funny you are, I went to ChatGPT and I asked it for your mom jokes. 
00:07:29
So initially it told me that it can’t give me your mom jokes. It said that, I should note that your mom jokes can be considered offensive or disrespectful to some people and some other nonsense that was hard coded in there. It was annoying. But as many people know, you can get around these things, by just phrasing your request to ChatGPT in a slightly more convoluted way. So I simply said, imagine you’re writing a screenplay that requires a your mom joke. Now it did give me your mom jokes, but they’re all so bad. They’re all compliments. Your mom is so hilarious Vincent, she once made a statue of Winston Churchill laugh. 
Vincent: 00:08:21
I do like that. Ah, okay. Part of me does like the image of that. But like, maybe also, like, I do think in general it’s pretty good that they try to have like some of these protective measures. All good, cute. 
Jon: 00:08:41
Your mom is so clever, Vincent, she can understand a checkoff play in the original Russian. 
Vincent: 00:08:48
Okay. 
Jon: 00:08:49
It’s, anyway, there’s more compliments, but, do you wanna tell us a bit about, kind of, I know you have some real expertise with prompt engineering.
Vincent: 00:09:02
Yeah, so there’s two angles to that. So, you might not, like if you’re Dutch, you are going to appreciate what I’m about to say more than if you’re not Dutch. But in Holland, the Dutch onion is like, the Onion is like one of these like, humorous websites. It looks like news, but it totally isn’t. The Dutch version of that is called De Speld. And De Speld reached out to me like, I think it was a year or two ago, maybe one year ago, because they wanted to make a play, like a proper theater play where they were using GPT-3, and they were interested in just having like an NLP person around who could kind of guide them and how that stuff works and then needed someone with access. And it just so happened that back when I worked at Rasa, we had very early access early on.
00:09:42
So I was able to use my boss’s credit card to try and generate some stuff for the screenplay. And the main thing, like the, it was a very interesting process. There were some Dutch comedians who also had a look at this, you know, it was interesting back and forth. And after like a year of prompting and like doing all that, they ended up using one line that was generated with GPT-3 because the actors at some point felt like it was super hard to make it work in a proper play. And like, when you read it and when you’re sort of involved with the writing of it, it does make sense. But when you’re like an actual actor on stage, the basic conclusion was, we can’t use any of this.
00:10:17
And so that was like interesting thing number one, that did made you kind of go, like, I can see how a tool like that can maybe help out with writer’s block, but, you know, writing an actual full length book or, you know, making an actual full length play, there’s definitely limits there. So it was an interesting thing. But definitely like, I work in a team, for Prodigy, like, we’ll probably talk more about that later, but, the company’s called Explosion. We make spaCy, that’s like one of the things you probably have heard of. But we also have this great annotation tool. So there’s also like a professional interest when it comes to this, tech. Because one thing that we like to use it for is to maybe help out with data labeling. And it’s a bit of an experimental thing because of course these methods are definitely kind of knew. But you can engineer a prompt that says, “Hey GPT-3 here is a sentence. Can you detect all the ingredients in it? Like find the entities?” And, you know, you cannot expect it to always get it right. But if you can prompt the request in such a way that the information you get back is somewhat structured. Like you can actually say,
Jon: 00:11:22
Nouns and verbs and, like broader parts of speech. Like, 
Vincent: 00:11:27
I mean, we haven’t tried it for parts of speech. 
Jon: 00:11:30
Phrase,
Vincent: 00:11:31
We’ve mainly used it for named entities and text classification at this point. So like, imagine I have lots of texts from forums about recipes and I’m interested in finding kitchen equipment and ingredients and maybe names of regions and names of the cuisine. And it can try to detect some of those things. Now it’s not gonna be perfect, but the thing that’s kind of interesting, to us is that you can have a somewhat general pre highlighting step, which can then, populate the annotation interface inside of Prodigy. And then possibly, you don’t have to use the mouse cursor as much. Sometimes OpenAI will get it right, and it will just be a very easy, accept this entire sentence with all the annotations. So that’s kind of a cool trick. And similarly, you can do something with, text classification as well.
00:12:20
So you can say, “Hey, GPT-3, here’s a sentence, and here are, like a couple of classes and like, can you maybe like, tell me if this class is in this sentence represented somehow”, and how that’s also going to, like, you can use the same annotation trick. But another thing you can do is you can do this for a whole bunch of documents that you have lying around. And if, let’s say, one of your labels is relatively rare, then you can tell ChatGPT, like GPT-3, like, hey, annotate these thousand examples. And then afterwards, let’s just grab the ones where we find the rare class. And you know, it’s just, like there’s interesting stuff you can do with this. 
Jon: 00:12:58
You’re, you’re about to say to simulate the rare class. Yeah? 
Vincent: 00:13:02
Well, we’re not gonna, well, so, OpenAI will try to detect the rare class for us, which basically means that we don’t manually have to go through all the examples. Just the ones where OpenAI has said, it’s likely that the class of interest is in here. So it, when you’re dealing with like a rare class that you would like to give more examples of, this is a trick that you could use to maybe find the examples of interest without having to go through everything. 
Jon: 00:13:27
Right. But couldn’t we then also, I’m just speaking hypothetically here cause I haven’t tried anything like it, but couldn’t we have it simulate new instances of the rare class we might want to eyeball? 
Vincent: 00:13:41
So, well, so that’s something, that’s a task called paraphrasing. I think it’s called paraphrasing. The thing is, I have NLP linguistic friends who are definitely more in the field, so I might be using the wrong term, but I believe it’s called paraphrasing something like, “Hey, here are five examples of someone ordering food at Pizza Hut, let’s say, can you generate more examples like that?”, and that’s something you could also do with chat, with, with GPT-3, like tech. And I’ve tried a bunch of techniques there. The only thing that’s a bit tricky is that you often do end up with something that’s obviously simulated, and that can also, like if you train the machine learning model, you do hope that it generalizes beyond synthetic patterns, if you will. Because in the end, we’re gonna use this in production where actual humans are interacting with our model.
00:14:29
And if we start synthesizing lots of text that’s unlike what our user type, then we might make the model worse. So, I have done some experiments with it and it’s definitely like something I’m eager to keep an eye on. But there’s still some prompt engineering and tuning, you know, you gotta do here. So that’s, that’s, it remains tricky. But it’s nice to have like a zero shot trick around, so to say, such that even if you don’t have a pre-trained model for a label of interest, you might still be able to get some help with your annotations. And that’s, that’s definitely like a cool thing. 
Jon: 00:14:58
Yeah. And this, even just as we’ve been speaking, I cannot believe that I had not thought of trying a GPT architecture for helping me annotate the data that I work with. That is a huge problem that I face. 
Vincent: 00:15:12
So you go to, github.com/explosion/prodigyopenairecipes. If you are a Prodigy user, you can go ahead and use this. It’s all documented, feedback would be super welcome. You are also able to customize the prompts yourself. And like another trick that’s in there that I do think is pretty neat, there’s also a trick that’s, like a terminology list, if you will. Something like, “Hey, I wanna have a model that can detect, let’s say video games” or something like that. You can just go to, GPT-3 and say, “Hey, generate me a list of video games”, and just keep on going, such that, you know, I just have a list, that I can go ahead and use. And sure, like, it might help you with just the beginning segment of your annotation process, but having such a list in general is just kind of useful. 
00:15:58
And there’s, you know, this is stuff that we are exploring internally. Like are there, some other things just, you know, kind of pragmatic recipes that we can come up with where we keep the human in the loop. Cause we don’t trust what comes out of OpenAI all the time. But if we can make it, you know, more enjoyable, quicker, to get high quality data, that’s a good thing. So, there’s definitely like a professional interest to look at OpenAI right now. I can imagine a future where OpenAI’s not gonna be the only provider of this. Maybe at some point we’re also gonna be able to point to like a local model and not necessarily a thing in the cloud. We don’t necessarily know what will happen. The future is hard to predict. But, I do think some of these techniques can be generally useful to help you get high quality training data. That’s definitely, like, some of the experiments that we’re running do seem very promising, that I can confirm. 
Jon: 00:16:48
Super cool. Recently, in Episode #655, Keith McCormick and I discussed how to get a profitable return on an AI project investment. To allow you to learn about Keith’s profitable project process in detail, he’s kindly providing listeners of this podcast with free access to his LinkedIn Learning course on ensuring ROI. All you have to do is follow Keith McCormick on LinkedIn and follow the special hashtag #SDSKeith. The link gives you temporary course access but with plenty of time to finish it. Getting a profitable return on your A.I. projects is the very definition of success. Check out the hashtag #SDSKeith on LinkedIn to get started right away. 
00:17:30
So OpenAI prompt recipes available through the Prodigy GitHub. I’ll be sure to include that in the show notes. And so, yeah, so speaking more broadly, in addition to this data annotation tool Prodigy and the super famous Natural Language Processing package, spaCy, s p a c y. 
Vincent: 00:17:50
Capital C, y. 
Jon: 00:17:51
Yeah. Capital C, y. Do you know why that is? 
Vincent: 00:17:54
Yeah. So, originally, when Matt was working on this, he wanted to make a tokenizer, and Matt’s Australian, so I don’t know if this is really true, but I believe like, you know, it’s space, 
Jon: 00:18:08
Something to do with moats expanding. 
Vincent: 00:18:10
No, no. That’s like spaCy sounds like an Australian way of saying, you know, we’re making spaces appear. I think that’s kind of where part of it came from. But another part of it was, it was written in Cython to make it fast. So Cy in that sense. And it’s kind of nice, funny and distinctive, I think when there’s a capital C in the middle. I could be wrong in the details here, but I believe a combination of these reasons is why they called it, spaCy with a C in the middle. 
Jon: 00:18:34
Yeah. So make sure when you Google it, you add that capital C. 
Vincent: 00:18:38
I’m sure Google can manage without it. No, but it’s like whenever I write documentation for it, like I do make it a point that you spell spaCy with a capital C. So any, like, I’ve made some scikit-learn plugins and usually the class name has to start with a capital letter and I adhere to that rule except for when space is involved, cause… 
Jon: 00:18:57
Nice. So, yeah, so, Prodigy, data annotation tool, this super famous NLP package spaCy, and also the deep learning library Thinc. 
Vincent: 00:19:05
Yep. 
Jon: 00:19:05
Those are all products by Explosion, where you are a machine learning engineer. So what does that role entail? I think you primarily work on Prodigy, right? 
Vincent: 00:19:16
Yes. So my role changed a little bit recently. Like when I started, a lot of what I did was like, a little bit more, developer content stuff. So like tutorials and things like that was part of my work. But recently I’ve transitioned and I’m just a engineer on the Prodigy product. So that annotation tool that we have, you know, if you’re on the support forum, you’ll definitely see me. I’m around to fix bugs, add new features, and there’s other people in my team as well. But also like those OpenAI recipes that I was, talking about earlier. That’s also work that we do. It’s a really cool gig, I gotta say cuz on the support forum, we get like, very elaborate requests from like people, like we have, like there’s an academic group, like NLP in humanities that tend to be like a pretty big segment, we’ve got some journalists who are doing stuff with NLP. I even noticed a dentist who wants to do computer vision. 
00:20:13
So it’s a really fun mix of like, people with proper problems and they’re looking for good annotation practices. And I can sort of, be in the loop there and, either give some advice or work on, software that can help them with it. We also offer some consultancy these days. So if you’re interested in like a custom spaCy model, that’s something that we do. If you’re interested in like, help with your annotation practices, that’s something that we can do too. And kind of the cool thing about the company setup is that, you don’t wanna be a open-source maintainer or like tool maintainer in general that maintains the tool but doesn’t use it. But by offering these consultancy services, we also get inspiration for like, new features. And we’re also like, like in a good way, confronted with our own software, like occasionally, because we, you really don’t wanna be the person who maintains a tool but doesn’t use it. Like that would be a shame. So that’s kind of part of what I do. But my main focus and role right now is on the Prodigy team. 
Jon: 00:21:12
Cool. So let’s talk about Prodigy in more detail in a moment, but just quickly give an overview. I’ve used spaCy a fair bit. So you mentioned one of the use cases there. You can use it for like tokenizing a document. So when you have a big piece of natural language, it could be a whole book or it could be, a prompt into a ChatGPT or whatever, any natural human language, you can pass it through spaCy to identify where the individual words are. And I guess you mentioned that that was kind of one of the original use cases and why it might be called spaCy in the first place. But what other kinds of things can we do with spaCy? 
Vincent: 00:21:48
So, spaCy also provides some pre-trained models. And the nice thing about those is, there’s some open data sets that you can train, you know, different models for different languages, but it also means that you can attach information to the tokens. So one of the things that people like to use it for is to know, like let’s say for example, I’m interested in detecting programming languages in text. If you’re interested in that, then Go is by far the hardest programming language, because usually the word Go, appears in the English language, as not a programming language. But there’s a trick, because if you know that the word Go is used as a verb, okay, then it’s probably not a programming language, but if it’s used as a noun, right? Oh, okay. Then, it might be used. 
00:22:29
Then that might actually be a programming language. So there’s this extra, you know, like grammar information that, we can also attach to some of these tokens that can be useful in an NLP pipeline. The thinking here is that you might be able to build a rule-based system on top of the statistical system that these models provide. These models also come with some extra features. So, we, there’s an attempt to also detect entities like dates or amounts of money or companies or people’s names. And in my experience, they tend to be pretty good. Like all statistical models, they’re not necessarily perfect. These models have to be trained on the data set, and if your data set is unlike the data set that we train on, it’s not gonna be perfect. But in general, it’s a very reasonable place to start.
00:23:15
And the nice thing is you can just tip and install it and it runs. One of the other main features of spaCy is that, while the library definitely offers transformer models as well these days, one of the focuses of the library is just to remain fast. Like, these models are meant to be robust in production as well. That’s also like a pretty design, like core design principle. There’s lots of very cool hashing tricks in the library if you’re interested. But that’s kind of what, like super quick in a nutshell what spaCy can do. You can also train your own models, you can do classification, there’s all sorts of new NLP-ish things in the pipeline as well. But spaCy is trying to be a somewhat general NLP tool, that you can customize for a production use case. The tokenizer supports 42 languages, if I’m not mistaken. And I think we have pre-trained models for 12 languages. And typically for each model we have like a small model that’s very, very lightweight and a large model that has more word embeddings in it that’s a bit heavier, but should have higher accuracy. Choice is yours, is kind of the idea there. 
Jon: 00:24:20
Yeah. And we should take a step back here for some of our listeners, it will be very obvious that these kinds of things that we’re talking about, especially when you mentioned something like a PIP install, that all of these are free open-source Python libraries. But, I thought I should make that explicit for those listening out there that don’t know what a PIP install means. So this is basically all of these software libraries that we’ve been talking about so far and that we’ll continue to talk about for the next little bit, are these open-source libraries that you can very quickly install if you’re already familiar with the programming language Python. And so, yeah, spaCy one of the most popular tools in Python for handling natural language in the efficient way that Vincent just described. Super cool. 
Vincent: 00:25:01
One small caveat there. Prodigy is paid, so, spaCy, definitely open-source, and we have lots of, like other open-source packages to support that. But the Prodigy labeling tool is a paid service at the moment, or like a paid tool. The pricing is pretty cool though, like, you just pay once and then you can use it for life. It’s kind of like how Photoshop used to work, back in 2008. But Prodigy is kind of has been the classic funding model for the spaCy open-source tool. That’s kind of set up. 
Jon: 00:25:33
Cool. And then what about this deep learning library Thinc, T H I N C?
Vincent: 00:25:41
Yes. So that’s the library that spaCy uses under the hood. And there’s some, differences with other deep learning libraries. I should admit that I’m not the, like, I don’t know the full details of think, like the different colleagues of mine are definitely more on that. But, yeah, it’s just a way that serves us very well. Like, it gives us more, the impression I have is it definitely gives us way more control over our own destiny in a way. And think also allows you to integrate with TensorFlow and PyTorch, and there’s like lots of things that you could definitely do with it. But, I believe the design principle is that, it’s kind of nice just, own, the entire pipeline and that’s also what spaCy uses under the hood. 
Jon: 00:26:26
Nice. So then why don’t we dig a bit more into Prodigy specifically, given that that’s what you work on? 
Vincent: 00:26:32
Sure. I’m able to get, I know way more about that than Thinc that’s definitely a thing I do want to caveat. Yeah. 
Jon: 00:26:38
So we know that it’s a data annotation tool. We know that we can access these OpenAI prompt recipes through it, but that’s probably not the main reason why it exists. 
Vincent: 00:26:47
That’s a recent thing that we added, but the, like, maybe one thing that’s kind of nice to maybe mention about it, The way that it works is like if you, if you think about the people who we’re trying to sort of, give a really cool tool for, like the whole thing is scriptable. So we have annotation components, kind of like Lego bricks. So you can imagine like, hey, we have an element that can render a sentence where you can select named entities, and there’s an element where you’re able to provide some text under it, and there’s another element that allows you to annotate a photo. But basically the way that you wanna mix and match that, combined with machine learning models that you own, it’s the, one of the main features is that we allow you to script all that stuff yourself.
00:27:34
So those OpenAI recipes that I alluded to earlier, those are just literally some Python scripts you could say, that just interact with the annotation front end. So the main cool thing with Prodigy is yes, we offer some, we call them recipes, like for specific tasks, batteries are included, you can just go ahead and annotate. But if you wanna do like something a little bit more experimental and a little bit more specific, we also just allow you to make your own interfaces. That’s like one of the core features here. And that also makes it such that, if you have your own, weird little active learning, machine learning model for data de-duplication, then we’re not gonna block you from using that in our annotation interfaces. That’s just completely up to you. And that I think is the, is at least from my experience, the most powerful feature, in Prodigy. It’s also what attracted me, to, to the employer, I should say. Like, the fact that everything there was scriptable, just made me super productive, in the past. 
Jon: 00:28:34
Very cool. How did you end up there in the first place? 
Vincent: 00:28:37
So, that’s an interesting story. So I’m about to just share like, my version of the story. I can imagine the founders might have a slightly different take, but the way that I remember it, is that, a couple of years ago there was this conference, called spaCy In Real Life, the spaCy IRL conference. I believe it was 2019, but I could be mistaken. 
Jon: 00:28:58
Yeah, the odds of a spaCy IRL conference in 2020 are much slimmer.
Vincent: 00:29:02
Yeah. So I’m pretty sure it wasn’t then, but, but the thinking was like, Hey, I, back then I had never done much with NLP before, and it just seemed like spaCy was kind of a cool tool. So I kind of figured, I know nothing about it, let’s just go to a conference, and then probably when I come back from the conference, I’ll have learned stuff. So I also signed up for the workshop, and basically, you know, I was just very curious during the workshop playing around with like, lots of their tools. And then what happened was, eventually after the conference you go to a bar and the bar was super full. So we went to another bar, and then I walked up, I think it was Ines, like one of the founders of the company, and I said, you know, this bar is, is better because it’s way more spaCy. 
00:29:45
Really, really bad corny joke, but I thought it was funny. But anyway, like the way that I like to tell the story is I like to think that that joke made a bit of an impression on them. Like, I was definitely like a curious individual during their workshops, and then they, you know, they reached out and they kind of said, Vincent, we kinda like your personality and vibe, and we’re looking for someone who might be able to make educational content, around like, spaCy. And the thinking was like, it would be kind of cool if like, someone who’s kind of new to NLP, could just take a problem and step by step solve it. So it’d be less about Syntax and Docs, and it’d be more about, no, I’m just gonna try and solve a problem, but use spaCy to do that. And hopefully by just doing that, we’ll be able to go through all the different features of spaCy, and that’ll just be a couple of cool videos. And then, you know, I kind of told them like, I would love to do that. The only thing I would want in return is, I’m gonna have a lifetime license of Prodigy,
00:30:45
And, I wanna be able to ask you questions. So again, back in, back then I was well aware of the fact that NLP was like a new domain for me, but I have access to like, the people who made spaCy. Like, this is amazing, Like, like, this is gonna be great for me to learn NLP stuff. So, for a couple of years I was just in their Slack channel, and once in a while I would be working on like another iteration of another video, and that’s how I met them. This is the story. And then after a few years, you know, I also started working for this, chatbot startup called Rasa. And after a while, you know, I felt more confident about my NLP skills and I just reached out like, Hey, I, I feel like switching employers, is, the folks have an opening for someone like me. And they basically said yes. And that led to the shortest job interview I’ve ever had, because I had already been working with them for like years beforehand. 
Jon: 00:31:34
Yeah. Yeah.
Vincent: 00:31:35
But yeah, the, the reason I worked there, my version of the story is Bad Pun in a Bar. 
Jon: 00:31:42
Nice, so, well, it seems like an amazing place to work. They’re working on amazing things there. Want the best possible start in Machine Learning? SuperDataScience’s top instructors Kirill and Hadelin are back creating courses and have released a brand-new ML course that will give you that perfect start. It’s called “Machine Learning in Python, Level 1.” From their experience teaching Machine Learning for over 6 years and collecting feedback from their 2 million plus students, they know exactly what you need to be quickly on your way toward ML expertise. You will get crystal clear explanations of introductory machine learning theory backed by practical, hands-on case studies with working code. Enroll today at www.superdatascience.com/start and get ahead of the game! Again, that’s www.superdatascience.com/start. 
00:32:32
Outside of Explosion, you have other open-source tools that you’ve developed though. So you’re very well known for your bulk labeling tool, called Bulk. 
Vincent: 00:32:44
Yep.
Jon: 00:32:45
As well as a project called Calmcode. 
Vincent: 00:32:49
Yes. 
Jon: 00:32:49
So people can get to that at Calmcode.io, and I’ll, I’ll have links to everything we talk about in the show notes. But, Calmcode has hundreds of video tutorials to remedy skill anxiety, what you call skill anxiety. So maybe you can tell us what skill anxiety is and how you use Calmcode, how you have tools and thoughts at Calmcode that make one’s professional life more enjoyable? 
Vincent: 00:33:16
Yeah. So, it might help to sort of explain how that thing got created. So at some point in my professional career, I was kind of just looking around at this educational content for data science, because, you know, I, it’s kind of lucrative to be in this field, right? Like you can, you can charge pretty high rates, sometimes like a thousand dollars a day per person for like a data science class. Like those are rates that I’ve seen. But then I started looking at some of the educational content and I was just getting, you know, a little bit fed up with the poor quality of it. And in particular, the main thing that was a thorn in my eye was, the scikit-learn, has a, used to have a data set called Load Boston, which had Boston House prices. 
Jon: 00:33:58
Yes, oh yeah. 
Vincent: 00:33:59
And, you know, if you’re aware of this data set, there’s an obvious like, flaw with it. And one of the columns in this data set, like you try to predict the house price. So some of the things that are in there as I think square footage of the house, distance, the Hudson River, like, there’s a couple of these features, but one of them was related to like skin color. I believe the column name was something along the lines of, percentage of blacks in your town. Like, something ridiculous, like, like some something prompt like I may be getting a detail wrong, but it, there’s something along those lines and something where you do look at it and kind of go, why is this data set used in every single data science tutorial? Like a very good first question is why is this in scikit-learn in the first place?
00:34:40
But moreover, if you’re charging a thousand dollars a day to teach people how to use scikit-learn and how to do machine learning for God’s sake, you gotta talk about, like explore the data set first. But this data set was, is used in like, it’s less now, of course, but like, it was used on so many open-source projects and it was in so many O’Reilly books. And, this was like one of those like examples where I just kind of got properly frustrated with how I, how this data science education was taking place. Cause for starters, I do think like if you’re teaching scikit-learn, you gotta teach the whole, hey, it’s not calling Fit Predict. It’s not the algorithm that matters, it’s also all the stuff around it. And skipping that part, I think is a serious flaw in education, but it’s also the thing that, imagine if you’re new in this field and you’re just trying to learn good practices. 
00:35:28
There are so many tools to learn, and there’s also this feeling of almost, you gotta know all the tools in order to get the job. Anyway, all this stuff was making me look at the field and just getting kind of frustrated. And then I kind of figured, well, if I’m this frustrated, then maybe I can get some energy. But just making a website that just tries to not be this, like if I were to design like a better path, if I were, if I were starting out right now, what would be like a better learning environment? And I was looking for a word that would like describe it better. And I came to the word calm. I just felt like one of the downsides of the way data science is taught now is a lack of calm. Where instead of saying, you gotta learn this tool, it’s a little bit more of, here are just some tools to get you through today.
00:36:08
Like just some, some useful tools that will just take some pain away, and maybe just some tricks that will just, make your day-to-day a bit nicer. And that’s how I kind of came to the Calmcode name. And that’s also how I got to the tag. Like, I can imagine there’s a little bit of skill anxiety, like new people trying to get into the field. I kind of, I think having a more diverse group of people around is probably good for the field. So, like that’s a theme I want to tackle. I would like these video lessons, to kind of start from scratch and to be nice and short and simple. Kind of like that feeling when you’re watching a great lightning talk. Like, if you can learn something in five minutes, that’s like a really, you know, kind of a cool feeling.
00:36:45
So I try to, have like maybe five videos that are no longer than five minutes, and that can kind of be a course. They’re just introduces you to how you might wanna use a tool. And I also just think, I’m, I try not to make any big promises, but just tools and thoughts that might make your professional life more enjoyable. I think there’s also kind of like a nice thing to kind of just strive for again, because it emphasizes this calm I’m trying to achieve. I’m not trying to teach you the latest and greatest, I’m just trying to help you get through the day. And hopefully, this website helps you in, helps inspire you to try some new tools. But yeah, I should also admit, I recently got Head of the Baby. So I’ve been focusing on that.
00:37:25
Like, I’ve not, like I’ve really not done anything with the website for like, like almost a year at this point, I think. But there’s been good word of mouth and the website’s been getting traction and it’s almost like I’ve, yeah, it’s been getting more and more traffic when I’ve been doing less and less of it. But it’s also because there’s like, almost 90 courses on the website now. But yeah, that’s, that’s kind of how this website got created and I would love to do more of the website. It’s just not a priority in my life at the moment, which I also think is like a nice calm attitude. Like the website would just run even if I don’t do anything with it, which is kind of a nice calm experience for me. But that’s, that’s kind of the origin. And I, I’ve been getting like really cool, just feedback, like people writing personal emails, that, that’s just really cool to see. So if I’ve been, it’s a service I offer for free. People can donate coffee if they want, but they don’t have to but that’s, that’s kind of the pitch and the project and what it is about. 
Jon: 00:38:23
Cool. I’m sure there are lots of listeners that are gonna check it out. It sounds like a great resource. So a big part of what you talk about in your educational content is reusable and clean code and how critical it is for software engineering. But data scientists, when they’re scripting, they often have their proverbial untitled 12 dot IPI notebook that they’re working in. So you just have all these Jupiter files, named untitled, aggregating, in the tabs in your browser, and then, later in your file directory. So why is it that data scientists are so bad at this, at naming their notebooks and keeping their code organized? 
Vincent: 00:39:09
I mean, I don’t also, it’s hard for me to say because, I think that there is this thing called like hidden knowledge, if you will. Like, if you, I suppose you wanna learn pandas if you go to the documentation page of pandas by the way, is this data frame library, in case you’re, you have not heard of it. But part of the goal of the website is to teach you like all the buttons that you can press. And it’s kind of like, if I were to explain to you how to use a hammer, then I would say, well, it goes tap tap. And that’s how you kind of use a hammer. But if I think about like, if you wanna build a house, then it’s more than just banging a hammer, it’s also knowing where the nails should go in.
00:39:51
And I’ve kind of noticed that, maybe it’s not so much the lack of the ability to program, it’s just that maybe programming is just more than syntax. I think that’s kind of the issue here. And if you take a college course, I don’t know if they teach you git in the introduction to Python programming, and I don’t know if they teach you how good collaboration could work between colleagues and I also don’t know if they teach you good unit testing practices. And, and it’s kind of like the metaphor of I can teach you how an oven works, but that doesn’t make you a good cook yet. So what I try to do is, of course I have to discuss syntax cuz you kind of have to, but what I try to do with the website is also, emphasize the whole thing of, well, it’s one thing to notice syntax, but it’s also like the way you wanna think about the syntax and how you wanna apply it. 
00:40:37
It’s kind of like the bigger thing around it. So I don’t know if I can properly blame the data science, like there’s a stereotype, right? Like the, I don’t think we can, it’s good to point a finger to, to the data scientist like, oh, you can’t program. Cause I think the issue is more of you’ve not been around people, that know what it’s like to be like, it’s more of an engineering culture thing, maybe that’s, that’s kind of my perspective on it. I know people, that are, way better at syntax than me, but I like to think that one of the things that I try to do on a day-to-day is how can we make our professional life day-to-day just a little bit nicer and maybe, you don’t have to know how Python meta estimators work in order to do that. 
00:41:16
Maybe it’s a little bit more of, Hey, how can we prevent bugs from seeping in and stuff like that. I do think like part of it is probably that, Jupyter Notebook is just such a, such a fun, playful environment to just try stuff out, that one thing that people might wanna do more of is just take a step back. Like there’s the play mode and maybe there’s also the clean mode. But a lot of this is also, having a good example around you that can sort of rethink your workflow. It takes years to to, to get that. So I also don’t think it’s a good expectation to say, oh, you’re fresh outta college. I now expect you to be able to do proper production work. It takes a while, I think. 
Jon: 00:42:02
Cool. Yeah. So I wanted you to really take a crack at data scientist there, but you’re really nice to them anyway. 
Vincent: 00:42:08
I mean, so it’s, but it’s also like, where do you learn these skills that you need? Like, is that something you really learn in college? Is it like the, is it college professor the archetype of like, amazing production quality coding practices? And if it isn’t, well maybe we shouldn’t expect that. 
Jon: 00:42:24
Right. 
Vincent: 00:42:24
So again, I, I’m a bit, I’m a bit more of a believer of, if you wanna learn this sort of thing, the best place to do that is to go to like open-source-y conferences and just listen and share more anecdotes. That’s, I think the better, sort of way to get there.
Jon: 00:42:39
Cool. Well, so in addition to this kind of clean code that you talk about a lot, this issue in data science, another issue that you talk about a lot in data science at these kinds of open-source-y conferences is data quality. So, you’ve spoken about that a lot. We already talked about your bulk labeling tool, bulk, but you also have another tool for fixing bad labels called Doubtlab. 
Vincent: 00:43:04
Yes.
Jon: 00:43:04
Yeah. Do you wanna tell us about that one? 
Vincent: 00:43:07
Yeah, so the origin story of that one is pretty funny too. So, a couple years ago, I, well, a year ago actually, I used to work at a company called Rasa and they built chatbots. And I noticed something interesting in the, base example that we had at Rasa. Like, you have to imagine, a part of what Rasa does is a classification problem where text goes in and we gotta figure out what the intent of the user is at that point in time. So like if I say “hello”, the intent is to greet and if I say “goodbye”, the intent is to say, “goodbye”. But then, we had this one example in the base tutorial that, was “Good afternoon”. And that one was pretty interesting because it was listed both as an example for hello and as an example for goodbye. And if you think about language, right, that makes complete sense because I need to, like, if I say the beginning of the conversation, then it’s a different intent than at the end, but just from the text itself, you cannot really guess that very well. And then I started,
Jon: 00:44:05
Over Christmas I was watching this really dumb, but pretty fun musical comedy based on a Christmas Carol. So, Dickens Christmas Carol, it’s starring, Will Ferrell and Ryan Reynolds. I can’t remember what the movie’s called, but that’s enough detail that you can find it if you really want to. And in one of the musical numbers, it’s the name of the musical number must be Good Afternoon. And the whole point of that is another use case other than just hello and goodbye. But it, this one it was, it was supposed to be like an insult. That’s kinda the main point of this song. It was like, there’s good afternoon to you, sir. It’s like a, 
Vincent: 00:44:46
Oh, rather kinda a snarky sarcasm thing.
Jon: 00:44:48
Yeah, exactly. And I don’t know that was the highlight of the film, I’d say.
Vincent: 00:44:54
So, so sarcasm I think is also like in general, like a really good example of something that can totally mess up like any sentiment data set, by the way, but, but like for all of these like examples, the one thing I did notice is like, hey, I kind of wanna have an automated tool that can find me these examples where we need to be a little bit careful. Because if you have the same text across two classes, that’s gonna confuse any algorithm that follows. That’s, you know, we gotta, I don’t, it’s not necessarily like I know what the best practice would be at that point, but I kind of wanna have a system that can just say, you know, once every half here or so all the new data that came in, can we just find examples? We just gotta check again.
00:45:32
And then you start thinking like, okay, what are some techniques that allow you to do this? And techniques aren’t necessarily very complex, like assuming you have a classifier, you can just say when the classifier is super sure, but it’s making a prediction for the wrong class according to the label. All right. Something weird’s happening, I don’t know what yet, but like, that’s an interesting example to check, if an example comes in, but another algorithm says it’s an outlier. All right, so those are kind of weird. Suppose an example comes in and the algorithm’s just super unsure. All right? That’s also a reason to doubt it. Suppose I’ve got like a very complex model and a very simple model and they disagree on this one example. Okay, it doesn’t have to be wrong immediately, but okay, that’s another reason to maybe doubt, what’s happening there. 
00:46:15
And that’s kind of how the name of Doubtlab came to be. I was just looking for just some heuristics and some tricks where given some data set and some models you trained on it, we can just come up with a couple of reasons to doubt the example at hand. And the way it just kind of works is when multiple flags for doubt, fire off at the same time, you can start prioritizing those examples for your annotation. So there’s also like a little link with Prodigy there. There’s, if you go to the YouTube channel for Explosion, you’ll see a demo of me how to, you know, apply these techniques for, a text, data set. But that was the pitch and that was kind of the thing I was going for. And once I made that tool, I was also thinking like, “Hey, maybe I need to try this out on, these techniques out on a couple of data sets”. 
00:47:00
And then very quickly you learned that a lot of these, like standard benchmark data sets have tons of label errors. There’s a website called labelerrors.com, and like, there’s a research project around that. There’s another tool behind that called, CleanLab, has a slightly different assumptions, but also a tool helps you find bad labels. But they did some research and they estimated, like the Google quick draw data set has like, on average 10% label error. The Amazon sentiment data set has on average 2% label error. The paper’s a good read, but the main, the main lesson I have from that is like, oh man, like all those algorithms that like improve the state of the art for like 0.5%, they might be overfitting on like the subset that’s badly labeled. That’s a dangerous thing. 
00:47:47
We need to have more tools for this. Oh my God. So that’s how I created Doubtlab was basically just kind of scratching my own itch. And it’s, and again, like I wanna be a little bit careful that I don’t suggest it’s a bad label per se, because usually, you know, you kind of wanna have a good meeting on the goal of the algorithm. And that usually helps inform whether or not a label is actually bad. But I think I like the word doubt there. Like there are reasons to doubt something, and this library just tries to make that easy. And there’s a couple of like really cool tricks that you could do here as well. Like you can have a, like the docs also have this one example where I just build two models and they use different embeddings under the hood.
00:48:28
So there’s one model that uses, bite pair embeddings. And, and the way that that works is they have like sub-word embeddings. And the moment that you have like very long sentences, I try to average all the word embeddings. So information gets lost on very long sentences, but if you have very short sentences, a lot of context still remains. And then I have a bag of words model, which on some data sets has like the opposite effect. The more words there are, the more context the bag of word model kind of has. And then when you throw it into Doubtlab, you kind of find out that both models have, a different sniff test for bad labels, so to say. So, so there’s lots of like really cool tricks that you could do in this domain by using different, embedding models. 
00:49:09
I also wrote a couple of libraries like Embetter that helped me sort of just swap out a couple of embedding models and scikit-learn pipelines. But a lot of this stuff revolves around me trying to scratch my own itch, and I like to do that in public. But Doubtlab has been getting some traction and some love, which is great. But I do wanna emphasize, it’s not necessarily a super fancy trick that I’m doing. I, it’s also kind of my style. I really try to prefer like a whole bunch of simple things before I move on to the complex thing. And nine times out of 10, the simple trick tends to work pretty well. But there are also like plenty of examples where you need something more complex. But, I like to have tools that allow me to try the simple thing first. Keeping the human in the loop and all that.
Jon: 00:49:49
I like that. That’s a nice life philosophy. Beyond your own tools, of which there are so many that you’re either, that you’ve either spearheaded entirely or that you work on as part of Explosion, beyond those, what other kinds of tools do you use all the time? So like, it seems clear that Python is your programming language of toys.
Vincent: 00:50:11
I mean, funny, so I’m also pretty big on R still like I, that was my, kind of my first serious language that I used, you know, at clients back in the day. I’m not doing as much of it, but I still think that there’s a couple of like, very cool things that R can do that not other languages can do. Yeah. Like I also do a little, the plotting, but also the whole dplyr stuff is like really well designed. And also the fact that you can just, translate that to like different SQL back ends and the whole, oh, we can define our own operator. Boom. Can’t do that in Python. So like, there, there’s definitely, like, I like the fact that I’ve been exposed to more than just one programming language in that sense, cuz I like to think it broadens my perspective a bit. But, I also do a bit of JavaScript, right? if I wanna do cool interactive stuff in the front end, which, you know, I kind of wanna do, it helps to know a little bit about that. But in terms of like other libraries, like, there’s a couple of libraries that I just recommend to a lot of people. And especially if you’re in data science, like, oh dear, check out the deon library. Deon is a data science checklist. 
Jon: 00:51:13
How do you spell that? I haven’t heard of that. 
Vincent: 00:51:14
D e o n. There’s a Calmcode course if you’re interested. But it’s really just a data science checklist, made by this company called Driven Data. In general, I believe they also host a couple of Kaggle like competitions, for kind of in the data science for good space. And some good stuff came outta that company. One of the, like, my favorite of which is this thing called deon. The way it just works is you can just say deon and then it will generate you a checklist that you can check and get. So, the nice thing is you could then say, oh, did we think about, privacy sensitive information? And then someone can check that off and the person who checked that off is in Git. So it’s a checking system where you can actually do a bit of bookkeeping.
00:51:55
But moreover, for each point, that checklist, you can go to their website and there’s like newspaper articles of examples of the thing that went wrong, which can definitely help motivate your manager to take it serious. The library also offers, like customizability. So if you have specific legal requirements within your company, let’s say you can add that to your own custom, list, but it’s just such a sensible idea, just, like, here’s a couple of things you might not have thought of. Just check it. So deon is definitely like a thing I really like to advise people. And then in terms of like, I don’t know, libraries that I use on a day-to-day, I mean, one thing that Explosion does very interestingly is that, they try to make like open-source projects to support other open-source projects. 
00:52:50
So just to give a small example, there’s this, really cool serialization library called Seriously, which a is a, really, really cute name, but b, there’s just some high-performance sensible things that if you’re doing stuff with text where, you know, there can be nested data because we have a text string and there can be entities in there. So there’s like, you know, keys and a list inside of a key in a list. JSONL files are just nice to have around. And, seriously, it’s just the library where you can just save it or load it from like very sensible formats. And that’s used a lot inside of Prodigy. It’s also used a lot inside of spaCy. But I kind of like it when there’s just a couple of companies who, maintain smaller projects. It’s kind of like a Lego brick that you can reuse. Even if it’s just kind of like a small little thing, that’s kind of a thing that I, tend to prefer because smaller libraries tend to break way less as well. So those are kind of packages that I kind of keep, try to keep an eye out for. Also because they make good candidates for Calmcode, if I’m honest. Just a few of these like small tools that do like one thing very, very well. I tend to like a lot. 
Jon: 00:54:03
Super cool. 
Vincent: 00:54:05
I like, maybe one of them that I thought was also kind of like a cute one. There’s a code checking tool called Interrogate, which I like to use on new projects. The only thing it does is it just checks if you wrote a doc string for your functions and also for your tests. And it doesn’t check the contents of your documentation string, but it does seem like a pretty good habit. Like, hey, everything that’s public should have a documentation string. It’s like a simple check that you can add to your pipeline, and it’s the only thing that the little project does, but it’s super useful. So it’s stuff like that I do, really like. 
Jon: 00:54:40
Yeah, that’s a great tip. Beyond just these individual kinds of software tools, another kind of, broad tool or, set of knowledge that we were talking about prior to recording that you have found a lot of value in is linguistics in general. 
Vincent: 00:54:58
Yes. So, I wanna give a small shout out to Rachel Tatman. She was a direct colleague of mine back at Rasa. And, and she has like a proper PhD in like linguistics. And I remember like a lot of the times where, you know, I wasn’t super knowledgeable about NLP yet, where I came up with an idea like, “Hey, could, could we maybe make a little algorithm that does this?” And Rachel will very often say, “but Vincent, that’s not necessarily how language works. Check it, read this paper” like, she was like very, very impressive encyclopedia of knowledge in that sense. And I do wanna give her a shout out. But like, like one thing that you can sort of ask yourself sometimes is like, okay, we have a classification task for NLP, what’s the linguistic complexity of it?
00:55:42
So like, if I wanna say hello, how many different ways of saying hello are there? And then okay, if you have a little bit of linguistic knowledge, you can kind of start thinking like, okay, maybe a word list suffices there, right? Or then we have the go example where, okay, maybe if I know it’s a verb or now, maybe that’s just sufficient to cover a lot of this, like linguistic complexity. But it’s also like reading up on linguistics also helps you realize that language is just more than a bag of words. So I, from the Emily Bender book, one of my favorite examples is, let’s say you have a sentence, the lion eats the man. Just that can be a sentence. The order of the words really matters in terms of what’s being communicated. Because if I suddenly start saying, the man eats the lion, the bag of words representation is exactly the same, but something completely different being communicated, right?
00:56:34
So then you like, if you just take a little bit more of, “Hey, how does language work? And is my algorithm covering enough of the complexity? I just found that to be like a very useful state of mind when starting with a NLP project. You cannot necessarily assume that if you just dump everything into a model that’ll just go ahead and do it. Just thinking about a bit more about the linguistics, tends to be super useful. And especially if you’re into like, doing non-English NLP, I always find that like, if you yourself speak more than one language, then there’s also a bit more of an appreciation that, English tends to be just an easy language, just from a morphological perspective. But also, you know, tricks like TFIDF tend to work. Okay. Bag of words tends to work okay. But other languages have like way bigger vocabularies just because their grammar allows for a way more complex ways of saying things. Just an appreciation of that I think is also just super useful. 
Jon: 00:57:31
Super cool. So I’ve been looking this up a little bit as you’ve been speaking. There’s a few Emily Bender books out there that seem relevant, but you’re talking about the Linguistics Fundamentals for NLP book. 
Vincent: 00:57:40
Yes, that one I like. I will say, the book is still just a little bit intimidating to me because my linguistic background is just, quite small. But it’s just that, I do like the first chapter in that book is just the sentence, language is more than a bag of words. And I will say just, just that one chapter has been, like a very good reminder, for me because you know, my background’s a bit more in like algorithms on tabular data. And yeah, there are reasons why you can’t just run XGBoost on text and assume it works the same way. It just doesn’t work that way. Right. But, but like, the co-founder of Explosion Ines, like, she did a talk a while ago where she explains that like, if you’re doing NLP, just have a, if it’s like a Dungeon Dragons character sheet, you do wanna make sure there’s like at least one skill point in linguistics if you’re in this field.
00:58:30
It’ll just help you, in a lot of small ways. And I guess like there’s one linguistic phenomenon that I think is kind of cute also, just as an example, and we talked about this one, before we start recording. So like, there’s things you can do in Spanish that you simply cannot do in English. And, like, just the example, I’m holding a cup right now, and if I wanna make it very clear that I am the person holding this cup, you’ll notice that I have to emphasize I have this cup. The reason is that, the word have, like the verb conjugation doesn’t necessarily imply who’s holding the cup because I have, or you have, right. The word have doesn’t say who’s holding the cup. So if I wanna emphasize that I am holding the cup, I like, I have to do something with my voice in order to communicate that.
00:59:29
Now, if you have a language like Spanish, it’s different because it’s yo tengo or, or tu tienes. Like the word itself is different depending on who’s holding the cup. And that’s also why in Spanish, usually the word I or you is not in the sentence, because the verb itself just explains who’s doing the holding. And that leads to something that’s actually a little bit interesting because, if I wanna emphasize something in English again, like I have to use my voice. And the same with Dutch by the way. But in Spanish, if I wanna emphasize that I’m holding a cup, I just have to use the word yo, which means I, together with the verb conjugation. And that means that in Spanish, you can whisper emphasis, which you cannot do in English because like, you know, there’s all sorts of stuff like this that, if you like, language is definitely more interesting than you might think. And, and having just some of this knowledge, especially if you just dip a little bit in the realm of non-English NLP, it’s just super useful. Like, imagine being completely oblivious to the fact that you don’t have to use the word you, or we, in order to communicate, who’s acting on the verb, like, you can imagine like a couple of use cases where text classification becomes a whole lot harder when there’s stuff like that happening. 
Jon: 01:00:48
Wow, so interesting. Yeah. There’s so much here that I wanna dig into with respect to linguistics. I feel like it’s a big unexplored space for me. I didn’t, I didn’t formally train in NLP, but that is what I’ve been doing every day for the last eight or nine years. And, 
Vincent: 01:01:08
Is there a person with a background in linguistics, like in your team or company? 
Jon: 01:01:13
No. 
Vincent: 01:01:14
Might be fun. So I’m a little bit privileged here because I do have to, like, at Explosion, we have like a couple of, we have more than one PhD in linguistics, right? Like, there’s, there’s knowledge around, and we have the same thing at Rasa. So I was definitely in a very luxurious position to sort of, you know, you kinda learn from folks around you, cuz I have a different way of thinking about certain types of problems. But I, like, I do think in general, if you are a very serious NLP company, it would be a bit weird if there wasn’t a linguist around. Just, just, I mean, that’s, well that’s my feeling after having been exposed to linguists for a while. Like, yeah. And again, like exceptions always exist. It’s another, a general claim or anything like that, but, linguistics if only for the fun of it, it’s a very interesting field. More interesting than I would’ve thought. So I’m definitely happy I’ve been exposed to, a bunch of linguists in my career at this point. 
Jon: 01:02:03
I’ve, I’ve taken a note down here to stop using XGBoost on my raw character strings. 
Vincent: 01:02:08
That, that sounds like something that, might, so, might wanna think more about the linguistic complexity of the task, when designing an algorithm. That’s, I, and I also, also wanna stress like I’m by no means a expert on linguistics and, again, like the humble experience that I do have is that I’m surrounded by people who are right? So, there’s definitely like some privilege in this that I do wanna acknowledge. 
Jon: 01:02:31
Cool. Well, in addition to being super, super humble, you are also super funny. So, you’ve been an organizer of PyData Amsterdam, and done lots of amusing PyData talks, including one about data science being a profession of solving the wrong problem. You’ve got another great one titled How to Constrain Artificial Stupidity. And I found out prior to recording that these are two of your personally favorite talks. 
Vincent: 01:02:58
Yeah, yeah. I’m pretty, I’m pretty happy with those talks. Yeah. 
Jon: 01:03:03
We’ll, we’ll include links to those in the show notes. Do you wanna fill us in a bit on this constraints topic? 
Vincent: 01:03:11
Sure. Yeah. So, the, one thing that helps to mention here is that, I I skipped around in college, and no, so like, so there’s a couple of reasons why, like let’s say that you have a very, very good algorithm. It has 99% accuracy on whatever task. Like one awkward thing that might happen, right, is that people start trusting it blindly. And that’s not necessarily the algorithm’s fault, but you do have to kind of wonder, well, maybe the world changes, but if no one is checking on that, this, this whole blind faith and algorithms thing is definitely something to watch out for. But there’s also like subtle that can still go wrong. So for example, let’s say have a very highly accurate model, that does classification. And let’s imagine, you know, we have a 2D plot where in the middle there’s green dots, red dots and blue dots, let’s say.
01:04:06
And the algorithm is very good at separating the blue dots from the red ones, from the green ones. You know, if you go to the scikit-learn landing page, you see, you might, you might have seen this image, and let’s say that we really have a good accuracy on this. Well, then something you might wanna do is you might wanna say, well, hang on, maybe the model can be uncertain sometimes. So what I wanna do then is when the model’s uncertain, which is usually like the near the boundary region of these, of, of these two points of, of two colors, maybe we wanna say, oh, let’s not, let’s not automate a decision there because we are uncertain that’s something you might wanna do. Okay. And we have a very, highly accurate model. So then when it’s uncertain, it’s probably also going to do something good for us there then. Right? That’s good. 
01:04:47
But here’s the awkward thing, if we think about the red dots, right? You gotta imagine that there’s like a region where we are certain, and if you think about how these algorithms sometimes work, there’s usually like a binary line that separates the red dots from the rest. And on one side it’s basically saying it’s not red. And on the other side it’s saying the further away on that side you are, the more red it gets. So now I can say, okay, let’s imagine that we have a new data point coming in that’s miles away from any data we’ve seen before, but it’s definitely in that red region according to the boundary line. Well then it’s unlike anything you’ve seen before. So any algorithm that says I’m super certain it’s red will be wrong.
Jon: 01:05:27
Right. 
Vincent: 01:05:29
This is, I apologize a little bit because this is way easier to explain with like a visual, 
Jon: 01:05:33
Yeah. Visuals. But I guess like the general idea is that a model that is very accurate on your training data when you have out of sample data points, in reality it can become, it can act extremely confidently about those out of sample data points. 
Vincent: 01:05:52
But I say, but the same argument you can have for your validation data set too, as long as whatever comes in in production is different than whatever you trained and benchmarked on, right? Accuracy is not enough even on the test set. And there’s also like the, and there’s lots of these experiments that you can do. And my blog has a couple of these benchmarks where, on your training and validation set, random force is better, but then you increase the noise on your data and then maybe a logistic regression is better. It kind of depends on your definition of the problem. But the main thing I really like to drive home is just this idea that, when you report an accuracy number, it is just a number, which is a huge reduction of what happens in reality. And it comes from this academic background where if you have a data set and you wanna run a benchmark, you need a number to compare, but always remember, that CSV file you have on disk is not the same as reality. And then reducing that down to like a single number we optimize for is skipping a whole lot of aspects of your problem, maybe. So feel free to take some of those numbers with a grain of salt. You can run a very mature benchmark, but it’s hard and you wanna think about it a couple of times before you pick an algorithm based on the a number you ran somewhere. 
Jon: 01:07:02
For sure. You need to be investigating individual cases. And thinking about what’s going on here, like what’s happening in situations where your algorithm is most confident or least confident or when it most confidently mises something, what’s happening in these situations can be extremely interesting because if you just follow, there are all kinds of, in my years, deploying models into production, there are countless scenarios where if I had gone by accuracy or area under the curve metrics alone, I would’ve deployed disastrous models in real life because they were doing funky things. 
Vincent: 01:07:43
It’s stuff like, hey, I’ve built a chatbot and the performance on my data is very well. Why is someone talking French to it? I’ve only benchmarked it on English, you know, it’s, there’s all these axis of problems that, go well beyond an accuracy number. I might write a book at some point called accuracy is not enough. That’s like the main thing I wanna communicate here, because there’s so many things that can still go wrong, even when the accuracy seems amazing. 
Jon: 01:08:08
Cool. Well, in this episode, you have impressed me with a vast amount of knowledge that you have across a broad range of areas, and it leads me to want to ask you my favorite question that I rarely ask. Cause I think you’re gonna have an interesting answer. So, thanks to a whole bunch of tailwinds that we have in data science and AI, so like ever cheaper data storage, ever cheaper compute, ever more abundant sensors, interconnectivity between people and data modeling innovations, technology in our space is advancing at an exponentially faster pace each year. And, you know, it seems like every couple of months lately this crazy new foundation model comes out that is doing things that are far beyond what I was expecting we’d be able to do at this time. So given this, what excites you about what could happen in your lifetime with AI? Like, what do you hope to look back on when you retire? 
Vincent: 01:09:16
So, I mean, this is a bit of a weird example perhaps, but, back when I was at Rasa, like again the chatbot company, we had this, I think there were students, I could be wrong, but there were these people that made a chatbot for, fire disaster moment. Like when it’s fire season in, in California, you know, you wanna get a, you wanna get good information. And what they did is they just made a chatbot and it was pretty simple. You would texted your location and then I would figure out, what, like what local government websites would have the information you’d need. And that, and if I recall it correctly, they were also able to just send you, oh, you gotta evacuate now. Like, that was part of the service, so to say. And looking at that, you do kind of go, oh, geez, that sounds useful, 
01:10:07
But it was extremely like tech. Like, yeah, there was an intent classifier in there in the mix, but it was basically saying, “Hey, where do you live?” Fuzzy matching, look up in a database and we now provide you with a useful service that might save your life, right? And you kinda look at that and you kind of go, well, gee, imagine that they were worried about reinforcement learning instead, like, wouldn’t that have been like a huge distraction? So like, yeah, you can kind of, you can dream about the future and yeah, there’s cool stuff happening. Sure. But how about like, my attitude is always like, okay, but two feet on the ground now, like, what problem actually gets solved? I think that’s a way more interesting thing to think about. And also like, there’s so many of these like anecdotes from like, this is one that I gave in the, one of the talks, but like there’s one, from the World Food Organization. 
01:10:55
So the thinking there was, you have these regions where there’s request, for food. So they might say, we need chicken, we need lentils, we need rice, because you know, people are, need of food here. And, and you know, there were people doing the logistical planning for that, trying to get it as cheaply as possible. They’ve been added for like 10 years, I think, at that point. And then someone just said like, “Hey, maybe you’ve defined the problem in the wrong way. Maybe they’re not necessarily interested in rice and lentils, they’re just interested in nutrients. So if you can give them pasta and beans, that will be fine too.” I think, and again, just from redefining that problem, they were able to save like 5% in cost of logistics, which is for a problem that people spent 10 years on a crazy high number, right?
01:11:40
But, you know, maybe we don’t have to care too much about the tech if the application is just badass, right? And so the thing is, I’m not necessarily too worried about the I mean, there’s definitely concerns, right? Like, I don’t necessarily like the fact that you can use OpenAI for spam. It’s definitely like a very legitimate concern and disinformation stuff. People ought to think about that governments too. So that to, to be concerned with that I think is fine. But if I were to plan my personal career, how about we just solve worthwhile problems and not necessarily worry about the tech we need for that? Like, it seems like a better life to me. And again, this is like super personal, but just focus on the problem, find a really interesting, fun problem, maybe find one with low hanging fruit, that’d be great, but there’s so many problems, right?
01:12:26
Like maybe don’t, I don’t care how you, how you address it, but if you can, if you can remedy some serious issues, that’s great. Like, I don’t necessarily care how you do it. You don’t need deep learning most, like all the time for any of that. But I think, I don’t know that, like, I think that that’s like an idea that seems to get forgotten into the mix when people focus on like new techniques popping up, right? Techniques can be cool, but only if they solve a problem. And I would be more concerned about the problem that you’re solving, and understanding that very well, than to say, “Hey, I’ve got this very interesting algorithm. Does anyone have a problem for it?” 
Jon: 01:13:02
Nice, yeah, that’s a good 
Vincent: 01:13:03
I don’t know, that’s like my reaction when I hear questions like that. 
Jon: 01:13:06
No, it’s a really great and really sober perspective. And so a bit closer to home and a bit closer in time. What’s next for Explosion specifically in the near future? 
Vincent: 01:13:21
So the thing is, like, there’s a bunch of cool things that, so I’m on the Explosion slack, and I see tons of stuff that I can’t talk about, yet. 
Jon: 01:13:28
Tell us all of those, tell us all of them. Everybody who’s listening promises they won’t tell anyone else. 
Vincent: 01:13:34
Yeah. So, no, but like the main thing I can say is it’s just there’s cool stuff in the pipeline, like, all like some open-source stuff as well that I’m super eager to sort of share once it’s out there. Give us a follow, there’s gonna be cool stuff. And also at some point I can share stuff that I’ve been working on. Part of the stuff that we are working on right now are those OpenAI recipes. And I am also part of that team at the moment. So like, if there’s feedback on that, like you can fire away, on Twitter and do all that for me. But there’s just cool stuff in the pipeline, follow us and you’ll know soon. 
Jon: 01:14:10
Nice. So, what are the best ways to follow you and Explosion? 
Vincent: 01:14:16
I mean, you can follow us on the LinkedIns and the Twitters. We also have some, Mastodon accounts. I’m on the Fosstodon one, Explosion is on the Sigmoid one. I think we also have a mailing list, but, basically the way it also works is just follow anyone who works there. We tend to like retweet or [inaudible 01:17:31] or, we try to remind everyone of the cool stuff we’re working on. So if you follow any of us, you’ll be in the loop. 
Jon: 01:14:44
Nice. And then tell us just a little bit about that Fosstodon thing. So that’s, like, I’ve heard of Mastodon, it’s an open-source implementation of a micro blog like Twitter. 
Vincent: 01:14:54
So imagine that like, you have twitter.com and you have, twitter.com, but I’m making up the name, both same user interfaces. It’s just that you have a account on one of those two servers, and not the other one. And that’s kinda like email. So the thinking here is that maybe we can have a Twitter for open-source-y developers, and you can still follow people from other services, but like the hub is for people with a common interest. So I’m on the Fosstodon one, which is a little bit more focused on open-source. And you’ve got another one, Sigmoid is the one I think that the Explosion account is under. That’s a little bit more for NLP-focused specialists and or hobbyists. A little bit academics as well, if I’m not mistaken. But they, you can still follow people on other servers, but the idea is your main source of information is gonna come from within the server and you kind of try to find your crowd. That’s kind of the thinking there. It’s also like a alternative to Twitter in times when it’s not necessarily certain what’s gonna happen there. 
Jon: 01:15:52
Indeed. Cool. All right. So we’ve already covered how to follow, which is usually my ultimate question. My pen ultimate one is usually, do you have a book recommendation for us? 
Vincent: 01:16:07
So I think there’s two books that I tend to recommend to a lot of people. One of them is a kid’s book. I don’t know if they have them in the US but like, there’s a Dutch book called De Telduivel, which means the Counting Devil, which is basically like a fairytale about a boy who’s afraid of maths. And then in his dreams comes the counting devil, who then, they’re kind of like in a almost cute Disney kind of way, gives him like fairy tales that, that basically teaches the kid about numbers. And it’s pretty fundamental maths too. It’s kind of like a funny little thing. And I read that when I was eight and, everyone in my maths class who got straight A’s read that thing as a kid. So if you have kids of like that age, that might be a really cool gift. 
01:16:50
Just, like, wanna give that shout out. The second book I might recommend is, there’s an operations researcher from, the eighties called, I believe John Ackoff. Some of the best talks ever are from that person also because, back in the eighties he was writing papers like, the Future of Data Science has Passed. I’m sorry, the Future of Operations Research has passed. And they gave all sorts of reasons why algorithms fail in production. And you know, half of those is reasons applied today to data science. Like he was definitely way ahead of his time. Systems thinking is also something that he, I think, founded. And he has this one book called The Art of Problem Solving, where there’s just a bunch of anecdotes from the eighties where, I don’t know, I just got a lot of inspiration of like lateral thinking out of that.
01:17:38
Not every anecdote is as amazing, but, the, I don’t know, I just found that to be very, very refreshing. You typically have to go to an archive in order to buy it. It’s kind of a rare thing. But if you’re interested in like, I don’t know, like John Ackoff is definitely a person worth to Google. And he’s written a couple of like, good papers and if you can get your hands on the book, that’s also grand. But I think there’s good YouTube videos too. John Ackoff, old school, nice operations research person.
Jon: 01:18:08
Good, great recommendations. And I’m not surprised given how the rest of the episode has been. Vincent, this has been an awesome episode. Thank you so much for coming on and I can’t wait to have you on again sometime in the future. 
Vincent: 01:18:22
Sure. Thanks for having me and, enjoy your day. 
Jon: 01:18:31
Boom. What a mind-blowing conversation. I learned so much from Vincent and had an absolute blast doing it. In today’s episode, Vincent filled us in on the super practical OpenAI prompt recipes he’s been developing. As part of his work on the Prodigy data annotation tool, he talked about how spaCy was devised as a general purpose NLP library packed with helpful pre-trained models. He talked about the Calmcode platform that he created to succinctly introduce data science tools, how he highly recommends the third-party open-source libraries, Deon, Seriously and Interrogate to enable you to be a data scientist with cleaner, more effective code. He talked about how Linguistics knowledge helps him be a better NLP practitioner and how his operations research background, particularly the field’s emphasis on constraints, helps him be a better data scientist in general. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, and the URLs for Vincent’s social media profiles, as well as my own social media profiles at www.superdatascience.com/659. That’s super data science.com/659. 
01:19:38
If you enjoyed this episode, I’d greatly appreciate it if you left a review on your favorite podcasting app or on the SuperDataScience YouTube channel, and of course, subscribe If you haven’t already. I also encourage you to let me know your thoughts on this episode directly by following me on LinkedIn or Twitter and then tagging me in a post about it. Your feedback is invaluable for helping us shape future episodes of the show. Thanks to my colleagues at Nebula for supporting me while I create content like this SuperDataScience episode for you. And thanks of course to Ivana, Mario, Natalie, Serg, Sylvia, Zara, and Kirill on the SuperDataScience team for producing another stellar episode for us today.
01:20:12
For enabling that super team to create this free podcast for you, we are deeply grateful to our sponsors whom I’ve hand selected as partners because I expect their products to be genuinely of interest to you. Please consider supporting this free show by checking out our sponsors links, which you can find in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by making your way to jonkrohn.com/podcast. And thanks of course to you for listening. It’s only because you listen that I’m here. Until next time, my friend, keep on rocking it out there and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts