SDS 710: LangChain: Create LLM Applications Easily in Python

Podcast Guest: Kris Ograbek

September 1, 2023

This week, host Jon Krohn and guest Kris Ograbek take an unexpected twist! After discussing the intricacies of LLM-based projects and how they come to life with LangChain and Hugging Face Transformers library, the tables turn as Kris ends up quizzing Jon on some curious questions.

About Kris Ograbek
Kris is a builder and content creator. He got into the Large Language Models field in May 2023. His current tech stack includes LangChain, OpenAI API, Hugging Face, and Streamlit. He believes the fastest way of learning new things is through building and teaching. His motto: “Build something once, make it count 10 times.” That’s why Kris keeps creating small projects and teaching others how to do the same. He turns every project into several pieces of content for LinkedIn, Medium, and YouTube. One of the projects has caught Jon’s attention which led to recording this podcast episode with Jon. So he made 1 project count 100 times 🙂
Overview
In another riveting episode of the Super Data Science podcast, Kris Ograbek, a content creator with expertise in LLM-based projects, takes the spotlight. Kris introduced listeners to the power and potential of the LangChain framework, renowned for the development of LLM applications. Through a step-by-step description, he broke down the construction of a chatbot that answers questions about the very podcast’s episodes. As a fun twist, Kris, a longtime listener, took a turn as an interviewer, posing thought-provoking questions to the show’s usual host, evoking answers to questions that listeners might have pondered over.
Kris highlighted the efficiency of LangChain, which boasts over 80+ data loaders. This ensures data readiness for any natural-language application. He stressed the significance of segmenting large natural-language datasets into manageable portions. These are then converted into vector embeddings using tools like the OpenAI Embeddings API. The objective is to ensure the language fits snugly within the context window of an LLM.
As they wrapped up the episode, they spoke about off-the-shelf LLMs, such as GPT-3.5 or GPT-4, which are equipped to address any question interactively and intuitively.

Items mentioned in this podcast:
 
Follow Kris:
Did you enjoy the podcast?
  • How are off-the-shelf LLMs like GPT-3.5 and GPT-4 reshaping our understanding and interaction with natural language processing?
  • Download The Transcript

Podcast Transcript

Jon Krohn: 00:00:06

This is episode number 710 with the Large Language Model specialist, Kris Ograbek. 
00:00:27
Welcome back to the Super Data Science Podcast. Today I’m joined by the intrepid Kris Ograbek. Kris is a content creator who specializes in creating LLM-based projects, large language model-based projects. And so he does this with Python libraries like LangChain and the Hugging Face Transformers library. And then he uses the projects to teach these LLM techniques to whoever wants to learn them. Previously, he worked as a software engineer in Germany. He holds a Master’s in Electrical and Electronics Engineering from the Wrocław University of Science and Technology.
00:00:58
In this episode, Kris tells the exceptionally popular LangChain framework for developing LLM applications. Specifically, he introduces how LangChain is so powerful by walking us step-by-step through a chatbot he built, that interactively answers questions about episodes of the Super Data Science Podcast. How cool is that? And having been a listener to the Super Data Science Podcast for years, at the end of today’s episode, Kris flips the script on me and asks me some of the burning questions that he has for me. Questions that perhaps many other listeners also have wondered the answers to. All right, let’s get to it. Let’s jump right into our conversation.
00:01:37
Kris, welcome to The Super Data Science Podcast. It’s so cool to have you here. You can fill the audience in on more of the details, but my understanding is that you’ve been listening to the Super Data Science pPodcast since before I became host. So it’s about three years now that I’ve been hosting the show. Kirill Eremenko was host of the show for four years before that. And yeah, I guess you were listening back in the Kirill days. I think you’ve listened to most episodes, and so we connected on LinkedIn when I took over as host three years ago, and we’ve exchanged some messages. I’ve been seeing your content, and over those years, you’ve become more and more of a content creator and really disciplined about creating content on a schedule. And lately you’ve been focused on creating content related to large language models, kind of hands on, which I thought is really cool. So yeah, so fill me in on what I got wrong about that or maybe fill in some more detail for the listeners.
Kris Ograbek: 00:02:45
You got nothing wrong. Yeah, I’ve been creating content for years. I think I have started in 2021, but during that time I really struggled to stick to one niche and I’ve had 10 different niches in that time. But yeah, since May I’ve been focusing on large language models and this time I hope that’s the last niche I ever have, I’ll ever have. 
Jon Krohn: 00:03:17
Well, I guess it can evolve, but yeah, you want to have that be the, because you’ve been talking about, you’ve had posts over the years kind of about habits in the brain. That was a focus for a while and they’re always so well written. It’s, it always shows up very high in my newsfeed because I always click to read more and I often react, so it shows up. And for our listeners, we’ll be sure to include a link to Kris’s LinkedIn in the show notes and you can look back and you’ll see that it’s stretching back years. Every post is great quality. You have a real knack, Kris, for clearly explaining something and turning it into something that’s very digestible, especially in this, people expect things in this fast-paced, easy-to-consume format in social media, and you do that very well while also conveying real information and often information that I haven’t seen anywhere else. 
00:04:22
So I think you’re doing a great job of it, and as you and I both know, it’s about sticking to the process and eventually things just grow. You learn from iterating and you get better and better as things go on. Yeah, super cool. Can’t wait to see how that continues to evolve. Also, this reminds me as I’m having this conversation, I think you kind of recently posted at the time of recording at least, you had recently posted about why LLMs, in particular, is your niche. So I remember you kind of have this framework for it’s that you’re passionate about it and you’re able to, you have some expertise in this area. Do you know what post I’m talking about? 
Kris Ograbek: 00:05:09
Yes, I know which post you’re talking about. And yeah, this comes from Naval because he talks a lot about- 
Jon Krohn: 00:05:19
Naval Ravikant.
Kris Ograbek: 00:05:21
Yeah, Naval Ravikant is one of those people who really, really influenced me as a creator. And he talks about having a specific knowledge, which is something that you specialize at, something that you’re probably the best in the world at, because he also says, be the best in the world at what you do and keep redefining what you do until it becomes true. And it is very hard to have this specific knowledge at the beginning because it takes time, it takes a lot of time. And at the beginning, we are only copy cuts, I’d say. Because when you start something new, especially when, because this field is so huge and actually hard to entry, I mean, you need to learn a lot at the beginning in order to build something on your own. And for me, also, building is the best way of learning. So yeah, I basically have this natural curiosity to learn those things, and I just believe large language models or some sub-niches of it can become my specific knowledge, but first I need to do those repetitions.
Jon Krohn: 00:06:53
And so you specifically in this post, and I’ll try to remember to include it in the show notes specifically, but basically it starts off with you saying how you’re not interested in creating just ChatGPT content. There’s a lot of that out there. A lot of people are creating ChatGPT content. A lot of people are providing prompt engineering guide PDFs or whatever, but you’re like, there’s a lot of people that can do that, but you have more unique skills. You can do more than just call, well, I mean you can call an API, you can do more than just type into a user interface and get prompts back from GPT-4. You love Python, LangChain, which we’re going to talk about in this episode, focus on in this episode, the Hugging Face Transformers library, these are, compared to the number of people in the world who can prompt GPT-4 in the ChatGPT interface, that’s a relatively unique data set or a relatively unique skillset.
00:07:51
And then you’re really curious about it, open source LLMs, you’re really excited about it, passionate about it, and so that makes it easy to be focused on this area. So, so I think that this building LLM power projects, you being developing that as something in which already you have some specialization in, but as you do it more and more and more, and as this field evolves more and more and more, eventually some years from now, you will be truly an expert at it.
Kris Ograbek: 00:08:21
Either in a very deep niche or the combination of some niches, because there’s probably two ways of specializing. Either you combine your skillset into something useful. As a generalist, I’ve got many things I enjoy doing, and I know about more than the average person, about 20 different things. But the other challenge is to combine them into something valuable. And I believe that large language models is a field where you can really, really combine this prompt engineering and your domain knowledge and some other interests into something really valuable for others. 
Jon Krohn: 00:09:18
For sure, no question. Yeah, I mean, I also have no doubts about your other point about just being knowledgeable about a lot of topics from covering your posts over the years that you obviously doing tons of reading, have tons of different, have tons of content out there on lots of different interesting topics related to people kind of performing at their best or having their best habits, that kind of thing. Awesome. So let’s actually jump into the technical topic now for this episode now that the audience is kind of familiarized with you and your connection to the show. So the immediate reason why I wanted to have you on the show was because you created this YouTube video, so it was published August 3rd, and so I’ll have a link to that video in the show notes. It’s on your personal YouTube channel, Kris Ograbek, and the video is called “Chat with Your Favorite Podcast: The LangChain & GPT-3.5 Guide to QA Over Documents”. And so you can explain this project to us and the main parts of the video, what’s covered in it. And hopefully at the same time you can be explaining what LangChain is. This is something that we talk about a lot on the show, but I haven’t had an episode dedicated to it. So I think that this episode is perfect for that.
00:10:46
So the reason why this video particularly caught my attention of the videos that you’ve been creating is because the podcast, it’s all about creating a large language model tool for being able to chat with a podcast episode. And the podcast episode that you picked was a Super Data Science episode, and it was the one with Harpeet Sahota as my guest recently, right? So it was episode 693, the YOLO-NAS one, right? 
Kris Ograbek: 00:11:16
Yeah. Do you know all the numbers? 
Jon Krohn: 00:11:20
No, I have a spreadsheet in front of me. Okay. No, no, yeah, no, I always got the spreadsheet in front of me. That’s the secret when I do the citation. And so I have to really quickly scramble, because I’m speaking and at the same time I’m like, “Oh man, it would be great to be able to cite this episode number.” And so I’m quickly scrolling in my spreadsheet to find. Recent episodes are typically easier. I don’t have to scroll back so far or quickly using the find function in the spreadsheet or something. Yeah. So tell us about LangChain. Tell us about just this project of using LangChain to be able to chat with a podcast, how the idea came about and how it all works.
Kris Ograbek: 00:12:08
Okay, so the idea came from two sources. First is the Deep Learning AI course, you know the Deep Learning AI from Andrew Ng, right?
Jon Krohn: 00:12:24
Yeah, yeah, Andrew Ng.
Kris Ograbek: 00:12:24
And the website, the co-founder of Coursera, he started creating those small courses because usually this is a paid platform, but he started creating those one-two hour courses with collaboration, in collaboration with people from OpenAI and from LangChain. And this case, he’s created two courses from LangChain with the founder of LangChain or co-founder. And one of them is really Chat with your data. So that’s where they explain really step by step how to do that. And I get into detail. And the second is that me as a creator, as a content creator, and then also a heavy content consumer, I want to have a tool or a way of interacting with what I consume in order to basically save time. Because in this case, Control+F is not enough. And so I basically follow the step from the course in order to create my own project with the data from your podcast. And I’ve experimented with some other solutions like having multiple episodes in one and then feeding the vector database with all of them. But for this video, I just decided to stick with a single podcast episode and test the results I get from it. 
Jon Krohn: 00:14:23
Yeah, so basically the premise is, so for every Super Data Science episode, our podcast manager, Ivana, she painstakingly goes through, we use an automated tool to generate a transcript from the episode, but there’s all kinds of things that it gets wrong because there’s so many technical terms in data science that aren’t in its vocabulary. Ivana goes through and fixes all those things. And then we have guests from all over the world, and so guests have different accents, so sometimes it doesn’t do perfectly transcribing people with accents and stuff. So she’s like, so Ivana has told me it’s like for a Tuesday episode, those are often over an hour long. And so she’s like, it takes her two or three hours to go through the transcript, but then if the person, if they have an accent, it can be like four hours, five hours of the time to go through and clean everything up. But that kind of thing is getting better all the time. I wonder if something like the OpenAI Whisper API, if we should be using that instead. I think that’s supposed to be really good with accents. Anyway-
Kris Ograbek: 00:15:29
Do you know what’s running behind YouTube autogenerated?
Jon Krohn: 00:15:37
I don’t know offhand, but yeah, I don’t know [crosstalk 00:15:42]. Yeah, yeah, I mean, Google’s pretty good at machine learning stuff. 
Kris Ograbek: 00:15:46
They’ve got even some models, right?
Jon Krohn: 00:15:50
Yeah, I think they have a few. So yeah, it actually, it is pretty interesting. That’s something that you can do as well is so you can get the transcript in any YouTube video, there’s a way that you can view the transcripts. Yeah, so that’s cool. But we also on the Super Data Science website, so for every episode, so everything related when I say that, you can get everything from the show notes, all the links that we mentioned in this episode, you can get it from the show notes. I say that all the time. What I mean specifically is that, so this episode is number 710, so you go to www.superdatascience.com/710, and that’ll take you to the show notes for the episode. And we have the transcript as well that Ivana has gone through, and so we’ve got this big transcript. Yeah. So I don’t know, what did you use for your project? Did you use the YouTube transcript or the Super Data Science one? 
Kris Ograbek: 00:16:49
I’ve used yours because I knew it was more accurate. So yeah, it was also easier. 
Jon Krohn: 00:16:56
So you grabbed the transcript from the episode that I did with Harpreet recently, 693, and then, yeah, so then you take that and then what’s the next step, you need to pre-process the data for LangChain or how does that work? 
Kris Ograbek: 00:17:10
Yeah, that’s where the convenience of LangChain comes, because in the first step when you want to chat with your data in the end, you need to load the data. The LangChain offers 80+ loaders, and they are for various types of inputs, like PDFs, text files, databases like Wikipedia, Slack messages, you can think of anything. And they were documented on their website on the documentation. So we can find the whole list of those loaders. And what’s the biggest advantage of that? Loading the data takes you two lines of code. You just import the loader and pass the source either the path to your file or a link to a website or anything. And it converts this data into standardized LangChain document, which is already, so you take any type of input and as the output, you’ve got LangChain document, and from this point, every step, it will be the same regardless of the type of data you’re choosing. 
Jon Krohn: 00:18:39
Nice. Yeah. So one of the great advantages you’re saying of LangChain is that regardless of what kind of data you have as an input, there’s 80+ of these different kinds of data loaders. So for example, for this one, you use the PDF of the transcript from episode number 693. So there’s this pypdf loader that you used, and then boom, it’s in there. And then what, you need to split up the document into smaller chunks? 
Kris Ograbek: 00:19:06
And then you use the splitter, and it’s also from LangChain and LangChain offers several types of splitters. But yeah, the one I’ve used was the recursive, I don’t remember is the longer name, but recursive character- 
Jon Krohn: 00:19:23
Recursive character text splitter, I’m looking at your code. 
Kris Ograbek: 00:19:27
So it takes care of that that you don’t, because you split by characters. So let’s say every thousand characters, but this one ensures that you don’t just cut a word in half. The separators are basically new lines or white spaces. And then you’ve got this chunk and the chunk overlap so that you actually, let’s say you also don’t want to cut a sentence in half when you finish the split. So you’ve got 150 or 200 characters overlapped. So it means the last sentence or two sentences will be the first sentences in the next split. But in your case, it’s splitted really equally every page into two chunks. So like 40 page PDF is 80 splits of 1000 characters. 
Jon Krohn: 00:20:32
And the purpose of this splitting is so that I guess each of these chunks, so you’re talking about 80 chunks created from a single podcast episode of Super Data Science, and each of those chunks then gets vectorized. Is that the next step? 
Kris Ograbek: 00:20:48
Yeah, that will be the next step. Like doing vector embeddings and then feeding them, having this vector database where you store- 
Jon Krohn: 00:21:02
Yeah, so we have 80 vectors. So each one of these 80 chunks gets mapped into a different location in a high-dimensional space. So that then whatever question somebody asks, we map their question also into that same high-dimensional space, and we say, okay, what are the most nearby chunks? Because those are the chunks that are most likely going to contain the answer to their question or the next response to their chat prompt. 
Kris Ograbek: 00:21:28
How hard is it to understand how it works? Because I, I had this breakthrough moment when I went through that LangChain course that what’s actually happening. And I think up until this point, I just knew those vector embeddings and vector databases do the magic, but without actually understanding what’s happening. Is it hard? Because you sometimes teach those things, right, or not? 
Jon Krohn: 00:21:59
Yeah, I mean, it is a lot easier to kind of explain visually. So I often do something that’s kind of easier to help people understand how these embeddings work. It’s kind of hard to visualize on a document level because the document has so many different words and concepts of this. So the location in this high-dimensional space is, it’s quite abstract because it represents the meaning of everything in that chunk, that half-page chunk that you’re describing. But an easy way to imagine it is that you can do the same kind of thing on an individual word level, exactly. You could, so something that I often do in my teaching when I’m teaching natural language processing as well as in my book, Deep Learning Illustrated, is I have Jupiter notebooks where we take a corpus of books. So I can’t remember now exactly off the top of my head what I usually use, but it’s like a, there’s online book repositories of free books.
Kris Ograbek: 00:23:03
Gutenberg. 
Jon Krohn: 00:23:04
Gutenberg, Project Gutenberg, yeah, thank you. Exactly. So with Project Gutenberg, you can download books and you could download a relatively small number of books for a fast demo. So I think I use something like 20 or 30 books. These are classic books that are no longer under Copyright, so you have the right to access them. And so yeah, I grab 20 or 30 of them for the purposes of class demo, and then we create vector embeddings of the individual words. And again, yeah, this is something that I use a lot of visuals, and maybe it takes me like an hour or so to really explain in detail how this really works, but at a high level through processes, like there’s this very popular algorithm now from Thomas Mikolov that’s over 10 years old now, but called Word2vec, and it converts all of the words in your corpus and your body of books. Corpus is just Latin for body. So these 20 or 30 books, you create word vectors.
00:24:13
And it’s interesting with word vectors because the way that we’re able to map them into a high-dimensional space is proximity to other words, because it turns, this is something that we’ve known for about a century. So there’s an Austrian philosopher Wittgenstein who had the philosophy of this idea already a century ago that a word tends to be the average of the words around it. And this is something that’s kind of weird to imagine. And when you only look at it like some small examples, it doesn’t seem to be true. But when you do this over a very large number of words, it turns out that you can kind of predict what a word, not kind of predict, you can predict what a word is going to be based on the words to the right of it and to the left of it.
00:25:10
So using that kind of idea, you’re able to convert words into this high-dimensional space, and you decide how many dimensions there are. So for the purposes of being able to visualize this, it’s kind of easy to imagine creating a three-dimensional space. Now that’s not practical, that doesn’t have enough nuance to be useful really as a actual word vector or a document vector embedding space. But you could imagine that it’s two dimensions or three dimensions. And then, so once you converted all of your words into this vector space, you’ll have, if you find the word “pants” somewhere in your vector space, there’s likely to be words like “shirt” and “shoes” and “hat” that are nearby. If you find the word “Monday”, then other days of the week are going to be nearby. And not far away from those days of the week, you’re also going to find all the months. So the months will all cluster together nearby the region where you find all the days of the week.
00:26:17
So then over the whole space, for the purposes of visualizing it, you’re imagining a three-dimensional space. But in practice, this is a many dimensional space and how many dimensions you have as a hyper-parameter that you decide as the creator of the vector space, the more dimensions that you have, the more computational complexity there is, but potentially the more nuance there is captured in there. So there’s this kind of trade-off between nuance and compute complexity. So yeah, I’ve gave you a very long answer to that question.
Kris Ograbek: 00:26:50
I know. 
Jon Krohn: 00:26:51
But yeah, it’s the same, so with what you’re describing, so you have these 80 chunks, and it’s the same kind of idea. It’s not a word. It’s actually it’s kind of the average meaning of all of the words in that chunk get embedded into a location. So at the beginning of this episode, you and I were talking about habits and that kind of thing, and so those chunks, they’ll be in one region of this document embedding that you would create if you use LangChain on this very episode. Whereas now this conversation that we’ve been having about vector embeddings would be in this vector embedding region of the vector embedding space. That’s a [crosstalk 00:27:30] meta. 
Kris Ograbek: 00:27:30
Yeah. So in this case, if you had a question about habits, if you wanted to chat with this podcast, and if you had the question about habits, then your query will be also embedded, and then it’ll find the most similar chunks, and they will be on the first pages of this transcript. Unless I just confused the- 
Jon Krohn: 00:28:03
Yeah, now we’re talking about habits in the context of embeddings and we’re really making it complex, but still it should kind of work. Then we didn’t talk about habits that much at the beginning. So maybe it brings out so then, and that context can be relevant. So then you could chat with this podcast transcript and be like, what example did they use to describe embeddings, and did they use habits? It could be a question. 
Kris Ograbek: 00:28:36
Yeah, yeah, exactly. 
Jon Krohn: 00:28:36
And then it’ll find that region, that portion of this transcript, and then it can respond. Yes, they did talk about habits as an example of one of the kinds of spaces in the vector space. 
Kris Ograbek: 00:28:52
Yeah. And I think in my project there was, because I’ve used OpenAI embeddings, like the default one. I think the vectors are over 1000 dimensions, over 1000 elements long. I haven’t checked exactly, but they’re pretty long. 
Jon Krohn: 00:29:13
Yeah, I think that’s right. I think the OpenAI embeddings by default are just a little over a thousand. From memory, I think that’s correct. So yeah, so you end up with this thousand-dimensional space, which probably for this number of documents is probably overkill for putting 80 documents into a thousand-dimensional space. But yeah, I guess with the way that these embeddings work, you’re going to get a lot of nuance. It’s going to work very well, for sure. 
Kris Ograbek: 00:29:49
Yeah, because there’s I think many hyper-parameters that I don’t really know about, I mean, how to optimize them yet, really this split length, I just didn’t know. I just said this 1000 or 1,200, I don’t remember exactly right now, but maybe I could have used the whole page having less, to increase the size of the chunk, but decrease the number of chunks. So that’s something definitely that we all should experiment with because I always encourage people to take what I talk about and just experiment by themselves. I always also give all the code on my GitHub and so on. 
Jon Krohn: 00:30:39
Yeah, and we’ll be sure to include obviously a link to your GitHub as well in the show notes so people can check out the code for this themselves and check out the video themselves. But so this kind of gives the idea, so LangChain makes it very easy to load the data in, to split the data into chunks of a reasonable size. And again, yeah, that’s a hyper-parameter that people need to figure out exactly what size to have based on their particular use case, and then you convert each chunk into a vector embedding space. For example, with the OpenAI embeddings API, which is a great choice-
Kris Ograbek: 00:31:10
Only you have to pay. 
Jon Krohn: 00:31:13
Oh, right, right. But I mean, yeah, I guess you would’ve to pay. But for this kind of document size, it would be a fairly small cost. It must’ve been dollars. 
Kris Ograbek: 00:31:26
Yeah, not at a dollar. No. 
Jon Krohn: 00:31:29
Yeah, not a dollar. Yeah. But yeah, if somebody is a data scientist at a law firm and you’re going to have to convert every clause of millions of documents that the law firm has, then it could start to get pretty expensive. So yeah, you might want to develop your own in-house embedding for that, which I guess is another story for another day. So then, once we have these embeddings, then I guess LangChain also just makes it very easy to retrieve the relevant documents. And then that’s where the LLM finally comes in, right? It’s at this final point. So you can explain that. 
Kris Ograbek: 00:32:11
Yeah, so then we need to have our query, because that’s the whole point, like a question or a query that goes into that vector database. And LangChain has also these dedicated modules for that, and they call it Retrieval QA, I think chain, I’m not sure. I don’t have it in front of me. 
Jon Krohn: 00:32:35
Yeah, yeah, I’ve got that exact line of code coincidentally up on my screen right now. So it’s in the LangChains module, it’s called Retrieval QA. 
Kris Ograbek: 00:32:47
Yeah, exactly. And that’s where you define some, you just initialize it with some parameters where you just talk about the, I think you give the splits, the vector database you’ve created previously and the prompt that you want to ask, that you want to use. And in this prompt, you basically inform your large language model that you’ve given him some, that you’ve got a question. And please answer this question based on the context I’m going to provide you. And the context comes from the vector database from this similarity search, which is probably a huge topic itself, but based on your query, you get the most similar splits. And the number of splits is also a parameter that you give. It can be 3, 4, 5. Also something to test, and it is Retrieval QA. You also define the large language model you want to use, but LangChain works perfectly fine with OpenAI. So you can use GPT-3.5 or 4 if you’ve got access. 
Jon Krohn: 00:34:07
Yeah, that’s awesome. So very cool. So yeah, in this case, I know that you used a new video. You used GPT-3.5. Why did you choose to use that instead of GPT-4? Was it because faster and cheaper? 
Kris Ograbek: 00:34:22
I have waited for so long for the GPT-4 access, really. 
Jon Krohn: 00:34:29
Oh. Yeah, you don’t have API access yet. That’s wild. Okay, so- 
Kris Ograbek: 00:34:33
Now I’ve got it, but I haven’t at the time. 
Jon Krohn: 00:34:36
Yeah, yeah. Oh, cool man. Well, all right, so this kind of approach, I imagine based on everything you’ve told me that this is extensible, we could theoretically do this over all of the Super Data Science transcripts. We would have more chunks, so we’d have more points in our embedding space. But you could have a chat with all of the episodes of Super Data Science.
Kris Ograbek: 00:35:02
Yeah. But then the challenge is that if you want to aim at a particular episode, you probably can’t because in the separate chunks you don’t have really information. I mean, you do have information which episode it was, or maybe which file you’ve used for that, which transcript, but in the query itself, you don’t really talk with the metadata of, I mean, there are options for that, but- 
Jon Krohn: 00:35:40
So that is an extra complexity. Well, I think you and I, Kris should talk after the show because maybe we could have a project and I’d have to maybe Kirill and I would be interested in funding a project where you try to do something like that. We create an application where our listeners could come in and talk to any of the episodes of the Super Data Science Podcast or all of them all together. I think that could be a fun idea. So let’s definitely talk after the show and see if that can happen. Yeah, I would love to do that. So yeah, I’ve made a note. 
Kris Ograbek: 00:36:13
Cool. 
Jon Krohn: 00:36:14
Awesome. All right, so this has been great. This has definitely been a detailed introduction to how to use LangChain and the kind of how Q&A works over a large document. Very cool. So hopefully our hands-on practitioners now have a sense of how they can go and do this themselves and how it works. And even folks who are listening who maybe aren’t data science practitioners or machine learning practitioners, but hopefully now with the kind of visual explanations that we provided of embeddings and that kind of thing, it helps people understand how these kinds of large language model Q&A tools work over a very large document or a very large set of documents. So Kris, before you and I started recording, you said that you had some questions for me. You wanted to turn the tables on me, having been listening to the show for so many years, you had some things you wanted to ask. 
Kris Ograbek: 00:37:12
May I have a small addition into what we talked about? Because I think- 
Jon Krohn: 00:37:15
Oh yeah, for sure. 
Kris Ograbek: 00:37:16
For listeners who are not so, don’t have so much experience with large language models, it’s important to say why we even need vector embeddings. And I mean just about the mentioned few words about the context length, because that’s our bottleneck, right? Because GPT models like 3 and 4, they have pretty long context length for 4,000, 8,000 tokens, which is really decent. But if we’ve got really many, many pages or large data, we can’t feed our large language models with that because it exceeds the context length. And we can think of the context length as the memory of the model. So if we feed 20,000 tokens, which is 15,000 words I think, basically the ChatGPT itself, it wouldn’t allow it. And your large language model would basically forget what you talked about at the beginning. That’s why we need to overcome this bottleneck with vector embeddings. So that’s the smart way of taking our data, but feeding only the relevant data to the large language model. 
Jon Krohn: 00:38:49
Yeah, exactly. Very well said. And I’m glad that you mentioned that that was a key aspect that I glossed over. Why do we need to be creating all these chunks anyway? And this is why, because the bottleneck for this Q&A is whatever the context window is of the large language model. So yeah, I think with GPT-4 now, it’s like an 8,000 sub-word token, which is probably something like 6,000 words. And it’s interesting, there are things now like Claude from Anthropic, it apparently has a 100,000 token context window now, which is, I mean, you could just put this whole transcript in, but there are also, I mean I haven’t experimented with context windows that large myself personally, but I think that there are still trade-offs, like the attention over that entire 100,000 token window is not going to be as good necessarily as breaking it up into chunks and focusing on the most relevant parts specifically. 
Kris Ograbek: 00:39:46
Okay. Okay. I’m glad you said that because I’m not privileged enough to use Claude 2. It’s available only in the USA and UK. 
Jon Krohn: 00:39:55
Yeah, well, I haven’t mean, yeah, it’s a full disclosure. I haven’t used it yet either, but that’s kind of, yeah, I wouldn’t be surprised and don’t quote me on this, but I wouldn’t be surprised if still breaking it into chunks and finding the most relevant chunks and providing smaller context windows or something like Claude 2 could lead to better results. But who knows, I could be wrong. Yeah. Do you want to do those questions for me now, Kris? We covered all the key technical items for this episode. 
Kris Ograbek: 00:40:29
Let’s start with the podcast itself. Okay. Because usually when you start something, like myself, I’m a small creator because like I said, I’ve changed my niches so many times. I’ve got 50 subscribers on YouTube, so if I create bad-quality content, nobody sees it. My videos have 100 views. And usually you need those repetitions, those iterations to become good at something. And people start noticing you only when you are already good, and then they call you the overnight success or something, you know what I mean? But you’ve taken over a very, very popular podcast. So it means you had no chance to be bad because I mean, you had to be great from the start. How was it?
Jon Krohn: 00:41:35
Yeah, there was certainly, I guess I had some butterflies, especially the first few times. So the reason why Kirill asked me to be host of the Super Data Science Podcast was because I had piloted my own show, which was called the Artificial Neural Network News Network, A4N, and I’ll include a link to that in the show notes for people who are curious for historical reasons. And so we created A4N in February, 2020. The first episode was recorded in February, 2020, and the idea for the show was that it was a newsroom. So I was the anchor of the show and my colleagues, the data scientists that I worked with, whom I still work with today all these years later, so I had three of the data scientists from my team, were all sitting the four of us around a big table, and we were talking about the news and it was deliberately cheesy. It was kind of supposed to be like an 80’s newsroom, cheesy music. And we’d have our headline story and then I’d say, “All right, and now over to Andrew with sports.” And Andrew would talk about this cheating story and a Kaggle competition. And then I’d go, “All right, over to Vince with weather.” And Vince would talk about how climate change is being tackled with machine learning. So this kind of thing. And I really loved that format. But then in March, 2020, the pandemic hit in New York. And so we actually were supposed to record an episode, the second episode was supposed to be the four of us again in person doing the newsroom thing. But then we were going to have this special guest of Ben Taylor, now goes by Jepson Taylor.
00:43:16
And with the pandemic hitting Ben’s wife was like, no, you’re not allowed to fly to New York. Well, there was a global pandemic and filmed this episode and then eventually no one even wanted to come into the office and record this episode in person. So we went to this more standard podcast interview format like you and I are doing right now. And so Ben was our second, in the second episode of the show, Ben was our first guest host and we did five episodes in total. And on one of those episodes, Kirill Eremenko was the guest. So Kirill had asked me to be on the Super Data Science Podcast around that same time. So I’d had my book, Deep Learning Illustrated had recently come out. So Kirill reached out to me and said, “Hey, I know about your book, do you want to be on my podcast?”
00:44:07
And I was like, of course I want to be on the Super Data Science Podcast. One of the most-listened to in the industry. And so I came on his show and at the end of it I asked, I just started this podcast, “Ben’s been a guest on the show. Do you want to be a guest on the show?” And Kirill said, “I’ve made over 400 episodes the host, but I’ve only once before been a guest on someone else’s podcast.” And so that was an honor for us to have him on, and I think he had a good time. We also had a funny thing where we impromptu, we made an ad and put it in the episode as a joke. So there’s a fake ad in the episode of A4N with Kirill in it where it was this thing about an app for finding toilet paper because in the pandemic- 
Kris Ograbek: 00:44:56
Oh my god [crosstalk 00:44:57]
Jon Krohn: 00:44:56
You couldn’t find toilet paper. So yeah, it was this really silly joke ad, which Vince Petaccio did an amazing job doing the main voice in that fake ad. So yeah, so I don’t know. I guess from that, it was something like six months later, Kirill reached out to me and said, I have something that I want to talk to you about. And him and I hadn’t been talking since this, so I didn’t know what this was going to be about. And he was like, yeah, I think “I’ve been hosting this show for four years and I’d like to pass it off to somebody and I’d like that person to be you.” So I guess he really liked that fake ad. He knew that I could read sponsor messages or so I don’t know. So I kind of auditioned involuntarily and it goes to show this process of creating content, same kind of thing. At that time, we only created a few episodes, obviously there weren’t a huge number of listeners, but it gave me some practice and I just happened to have these interactions with Kirill and for whatever reason that led to Kirill feeling like, “You know what, Jon Krohn is the guy to take over as host of the show. I trust him to do it.” And then Kirill and I kind of ramped up, Kirill created this huge document. I actually still have it open in front of me right now because it serves as a checklist and a template for all kinds of things. But Kirill wrote this 16-page handover document for me, and then we had meetings where we went over it and went over the process and he had to introduce me to our podcast manager and our editor and all the other people involved in the show.
00:46:40
And then we co-hosted an episode together. So there’s an episode with Syafri Bahar, which I’m quickly looking up in my spreadsheet. So episode number 427, Kirill and I co-hosted it together. And then from then on I started just hosting it on my own. And the first guest that I ever had on my own, on the Super Data Science Podcast was Ben Taylor. Yeah, [crosstalk 00:47:07]. Yeah, that was a coincidence. Like Kirill suggested, he was like, who’d be great for your first episode, be a really easy guest? Ben Taylor. And I was like, that was my first guest on my own show. It’s crazy. Yeah, I love Ben. He’s been on so many episodes of the show now. Anyway, so that was a very long-winded answer to your question. Hopefully some audience members and maybe even you, Kris, found that kind of the context interesting. 
Kris Ograbek: 00:47:31
Yeah, I like the context. I love it because of course, probably nobody heard about the story, but how did you handle the pressure? Because you were clearly being compared to Kirill and you know that his podcast was so successful and you didn’t have really too much room for mistakes. 
Jon Krohn: 00:48:00
Yeah, I mean, well, we did lose, some audience members definitely left. So there was something like a 30% dip in audience when I took over as host. So some people, I obviously wasn’t what they were looking for. Kirill, he’s an amazing personality, obviously has a distinctive style that is different from mine. I listened to a bunch of his episodes before I became host, but it doesn’t make sense for me to pretend to be Kirill and try to be like Kirill. I’ve got to kind be myself. And so that means that some people left because I wasn’t exactly their cup of tea, but then by the second quarter of me hosting, we were kind of back to Kirill’s level and through things, through his guidance because him and I still meet regularly and we figure out how to be adapting the show, he’s constantly giving me advice. In the beginning, he was giving me tons and tons of advice. He’s listening to every episode and helping me out with transitions between topics and do, is this the best idea for the audience or is that the best idea? And so this kind of Socratic discursive approach that he has really of helping me kind of figure out what’s the best thing for me and for the show. So yeah, so I really had to hit the ground running. 
00:49:23
I suspect that now several hundred episodes later, I hope that I’m doing a better job than I was in the beginning. But yeah, definitely. Yeah, so I’m not sure. I wasn’t overwhelmed with anxiety. I certainly had some hesitation, but I did my best. I spent a lot of time preparing for those first episodes. I still make sure that I’m well prepared for every episode today, but in the beginning it was making sure that I’ve crossed all the T’s, dotted all the I’s, and I’m really well prepared to be able to do it and just dive in feet first, do my best, and iterate and improve. There you go. Anything else, Kris or those were your key questions? 
Kris Ograbek: 00:50:13
I’ve got many about the AI space, so I don’t know how much time can I steal from you? 
Jon Krohn: 00:50:20
Maybe let’s pick, I don’t know, pick one or two more of your favorite AI questions. Maybe if I have shorter answers to them, then we can- 
Kris Ograbek: 00:50:27
Okay, so I try not to ask a tricky question. Do you use ChatGPT every day? 
Jon Krohn: 00:50:37
Yeah, I use the ChatGPT user interface most days for sure. Yeah.
Kris Ograbek: 00:50:43
Is there any unique way you use it? Because most people just-
Jon Krohn: 00:50:51
Mostly I’m constantly blown away by the things that it can do well. Yeah, there’s all kinds of things I put in there. I’m looking right now at my chat history and so there’s like mathematical questions, there’s programming questions. I recently, at the time of us recording this episode, I had just recorded the episode from last week, last Friday’s episode where I do a tour of the ChatGPT Code Interpreter. And it’s like as I was creating that episode, I hadn’t used the Code Interpreter much before. I’d been relying on some of the, I’d been using GPT-4, the standard model to be providing me with a lot of help on coding in the past. And it does an incredible job. It’s absolutely mind blowing for me the things that it does and how accurate it is. And then this Code Interpreter just takes it to the next level because you’re executing the Python code right there in the ChatGPT browser, you’re uploading files, it’s creating machine learning models for you and running them and coming up with strategies. That’s wild. I mean it’s like magic. We now have as data scientists, we have, and people in general with these kinds of tools, these state-of-the-art LLMs, we are supercharged, superpowered. It makes everything so easy. 
00:52:32
If you are ever with any kind of mundane tasks like writing an email to somebody, oftentimes when I am writing the email, I have an idea immediately in my head as I start to reply exactly what to say. And so that’s easy. I just do it. I just type that into Gmail and go. But maybe one in every four, one in every five messages, I’m kind of like, I’m not exactly sure how to start. And so just ChatGPT just gets you started. It just makes it so easy to get that first draft to iterate and it makes it more fun. It is interactive. You feel less alone in a way. And it’s so nice, it’s so pleasant to you. It has this really friendly way of being helpful. And I think I come out of interactions with GPT-4, a happier, more positive person. I’m more likely to be nice to people that I’m close to as well as just random people that I see in the street because it’s kind of the way that they’ve fine-tuned this with RLHF. It really embodies the best of human nature and brings out the best in me in return.
Kris Ograbek: 00:53:52
Oh my god, I’ve got 10 follow-up questions, but I could share my screen right now. I mean I can, but I’ve got the question. Have you used ChatGPT as a coach? 
Jon Krohn: 00:54:02
Oh yeah. I mean, haven’t explicitly, except- 
Kris Ograbek: 00:54:07
That’s what you meant right now, having those interactions. I think that’s because the coach is like this meta work, but- 
Jon Krohn: 00:54:15
Yeah, it’s implicit. So I guess you don’t have, most people probably don’t have an email writing coach, but with ChatGPT, all of a sudden you kind of have that on hand and there are this very positive, it’s like they’ve taken all these coaching courses and they know exactly how to frame constructive feedback to you in the way that is the most validating and the most helpful. So yeah, I guess in that sense is kind of the coach. And then I do, I also definitely with business situations, and you have to be careful with what kinds of business information you put in to the ChatGPT interface. Although with GPT-4, you don’t. So if you use the GPT-4 API or the GPT-3.5 API or whatever, they automatically delete all messages after 30 days. 
00:55:08
I think you can read, I mean, I’m not a lawyer, read the terms and conditions yourself, but my memory of the Terms and Conditions for the OpenAI API is that after 30 days, the data are deleted and the data are only stored for that short period of time. They say to make sure that you’re not misusing the platform, using it for a nefarious activity or something. So yeah, so you could write probably a lot of our listeners are technical enough that you could have a Jupiter notebook or CoLab notebook or whatever, or your IDE set up in a way so that you could be having a conversation with GPT-4 without using the ChatGPT interface. And then you can worry less, a bit less about having proprietary data in there. But yeah, for general business questions where I know there’s not going to be some issue, just kind of common kinds of circumstances, “Oh, there’s this complexity with this particular prospective client or coworker and help me work through what the best approach would be or what I should say ” is definitely a way that you can use the tool so you can use it as a coach in that way for sure.
Kris Ograbek: 00:56:19
Yeah. Oh, that’s so interesting. I feel that we could talk for hours. I really don’t think I should steal more of your time. 
Jon Krohn: 00:56:30
Yeah, I don’t actually mind my time. I’ve got time right now, but maybe for our listeners, since they didn’t necessarily sign up when we started this episode, they probably weren’t expecting the Q&A the other way. So I hope that they enjoyed this little bit of it. And yeah, Kris, thank you so much for coming on the show yourself. This was really, you were the guest on this episode and introducing us to LangChain, giving us this awesome, practical introduction with lots of examples. And yeah, we’ll have links to everything for our listeners to follow up on themselves so that they can be deploying LangChain, LLM Q&A applications. And you’ve got tons, your YouTube channel has lots of other ideas for ways that people can be using LLMs and LangChain hands-on, so lots to check out there. Kris, before I let you go, I know you read tons of books, so this is probably going to be a tricky question, but do you have a book recommendation for our audience?
Kris Ograbek: 00:57:30
Yeah, of course. It depends of course from where you are or what you’ve read, but I mean, Atomic Habits is just a must-read, just if you want to stay consistent with what you do. But I think the book that helped me a lot is called The Long Game by Dorie Clark, because like I said, I’ve really, really struggled to stick to one niche, and I’ve read it recently for the second time, just to ensure that I won’t change my mind in six months again. And in this book, there is a very interesting question how to, that you should ask yourself how to build once, but how to create something once but make it count 10 times. And I think that’s really influenced what I do right now because when I learn something, I immediately create projects. So I actually, I quit learning something new without a project, just always implement. And then from this project, I create a video and then I write a Medium article. And in the case of the project we talked about today, my article also got promoted by let’s say a famous person on Medium with 90,000 followers. And now I’m talking to you. So I made a single project count 20 times right now, which is I think a very interesting concept to think about because I always jumped from one thing to another, and right now I finished a project and slow down, slow down, teach others and share everything you learn. 
Jon Krohn: 00:59:28
Yeah, great recommendations. Dorie Clark is someone we’ve actually been trying to, we have some connections to her. We’re thinking about getting her on the show, hopefully getting her on the show soon. And then, yeah, Atomic Habits. Yeah, a hugely influential book for me as well. And I’d been following James Clear’s blog for years before he wrote Atomic Habits. And so yeah, his own Atomic Habits approach was critical to this kind of iterative stick with the process and keep creating and yeah, eventually you learn from your mistakes, you iterate and improve. You recognize that nothing’s going to be perfect the first time, and yeah, and then magic happens, like you’re saying, you get that 20x impact from your one project, this LangChain project or yeah, this podcast or whatever. We’re following this approach definitely heavily influenced by James Clear and yeah, he’s someone I’ve known since 2013, so maybe someday he’ll want to do an episode on the show as well. He doesn’t do podcast episodes as much as he used to and the run up to the release of his book. But yeah, maybe someday we’ll have the honor of him being on the show as well. 
Kris Ograbek: 01:00:47
That would be awesome. 
Jon Krohn: 01:00:48
Yeah, for sure. All right, Kris, thanks so much. So other than your YouTube channel, which we’ll definitely have in the show notes, your LinkedIn profile, are there any other ways that people should be following you? I guess Medium. 
Kris Ograbek: 01:01:02
Yeah, yeah, I do the same. I mean, once a week you can expect an article that’s pretty much the same, like I said, based on the same project, but some people just prefer this way.
Jon Krohn: 01:01:19
Nice. All right, Kris, thank you so much for this awesome episode and for also turning the tables around on me, which has never happened in the several hundred episodes that I’ve been hosting of this show. That has never happened. So that was kind of a fun thing to do for a bit. I hope the audience enjoyed it as well. This has definitely been a longer Friday episode. It feels more like a Tuesday episode at this point. But yeah, I hope it was awesome for everyone out there. And Kris, so great to be able to meet you on air finally. And yeah, make that connection. I hope to keep listening and maybe we’ll have a chat with Super Data Science app up at some point soon. Let’s follow up on that. 
Kris Ograbek: 01:01:58
Yeah, sure. 
Jon Krohn: 01:01:59
Yeah, catch you again soon, Kris. 
Kris Ograbek: 01:02:01
Thanks for having me. 
Jon Krohn: 01:02:03
Nice, well, that was something different, eh? Hope you took a lot from that discussion with Kris and hopefully you didn’t mind me being the guest at the end for a bit there. In today’s episode, Kris covered how LangChain has 80+ data loaders making it easy for you to have your data prepared for a natural language application. He talked about how large natural-language datasets need to be broken up into chunks and then converted into vector embeddings, for example, by using the OpenAI Embeddings API, so that the relevant language fits into the context window of an LLM. And then he explained how an off-the-shelf LLM such as GPT-3.5 or GPT-4 can then interactively and intuitively answer whatever questions you might have about whatever, including Super Data Science episodes, if that’s what you want to have a chatbot to talk about, to talk to about.
01:02:49
All right, that’s it for today’s episode. Support this show by sharing, reviewing, or subscribing, but most importantly, just keep listening. Until next time, keep on rocking it out there my friend. I’m looking forward to enjoying another round of the Super Data Science Podcast with you very soon. 
Show All

Share on

Related Podcasts