Podcastskeyboard_arrow_rightSDS 847: AI Engineering 101, with Ed Donner

69 minutes

Machine LearningData ScienceArtificial Intelligence

SDS 847: AI Engineering 101, with Ed Donner

Podcast Guest: Ed Donner

Tuesday Dec 24, 2024

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


Ed Donner co-founded AI-driven recruitment platform, Nebula.io, with The SuperDataScience Podcast’s host, Jon Krohn. Ed and Jon reminisce about how they launched their company, the growing opportunities for data scientists, how to choose an LLM, and today’s top technical terms in AI. 


Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

About Ed
Ed Donner is a technology leader and repeat founder of AI startups, and has had the immense privilege of working with Jon for a decade! Ed is the co-founder and CTO of Nebula.io, the platform to source, understand, engage and manage talent, built with proprietary LLMs. Previously, Ed was founder and CEO of AI startup untapt, acquired in 2021. Before that, Ed was a Managing Director at JPMorgan Chase, leading a team of 300 software engineers, after a 15-year tech career on Wall Street. Ed is happiest when he’s tinkering with LLMs or demonstrating their astonishing power to others; he runs Live Events on the O’Reilly platform and currently has a best-selling, top-rated Udemy course on practical LLM engineering.

Overview
“Networking events are always useless,” said Ed Donner, days before meeting Jon Krohn at just such a mixer. He and Jon would go on to cofound the startup Nebula, which applies AI to the field of talent. Instead of looking at a person’s career and CV by keyword, Nebula finds patterns in skillsets to find the best job fit. As the founders of this AI-driven recruitment platform, Ed and Jon have a keen eye for where the latest AI and data science jobs are headed. Ed notes that it’s currently a great time to be looking for a career in LLM engineering and data science, with a cumulative 8800 job openings in the US. He says LLM engineering is a hybrid role encompassing data science, software engineering, and model deployment. LLM engineers are tasked to find the best LLM for any given business problem by understanding the data and the outcome metrics, all within the company’s outlined budget.

Ed gives listeners a little insight into his process for finding the right LLM for a task. To limit the overwhelm, he uses leaderboards. LLM leaderboards help engineers quickly sort and filter down to the most relevant LLM for their purposes, and Ed recommends that engineers get to grips with multiple leaderboards (which we’ve listed in the show notes) to hone their approach. Ed himself has created a fun test of LLM quality with his game, Outsmart LLM Arena. In the game, models compete with each other in matches that test their powers of negotiation and collaboration. The bonus is that users get to learn about each model’s private strategy in plotting and planning!

Ed also offered definitions for emerging technical terms in AI: RAG, fine tuning, and agentic. For the former, RAG (retrieval augmented generation) helps users sort through their database and find the most relevant answer to their questions. RAG also has many variations. Perhaps the most complex is the query-conditioned RAG, which Ed says is similar to a technique he uses at Nebula. With query-conditioned RAG, users can ask a complex question involving multiple queries, such as planning an international holiday, and RAG will return responses that have been augmented by a vast store of documents.

Listen to the episode to hear more about how to use RAG, Ed’s definitions of fine tuning and agentic, and why it’s a great time to be an LLM engineer.

In this episode you will learn:
  • (11:15) What an AI engineer does
  • (19:23) Defining today’s key terms in AI: RAG, fine tuning, agentic.
  • (27:09) How to select an LLM
  • (49:41) Pitting LLMs against each other in a game
  • (53:14) What to do once you’ve selected an AI model  

Items mentioned in this podcast: 
Follow Ed:
Jon Krohn: 00:00:00
This is episode number 847 with Ed Donner, co-founder and CTO of Nebula.

00:00:11
Welcome to the SuperDataScience Podcast, the most listened to podcast in the data science industry. Each week we bring you fun and inspiring people and ideas, exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, Jon Krohn. Thanks for joining me today and now let's make the complex simple.

00:00:33
Welcome back to the SuperDataScience Podcast. After working alongside this brilliant mind and brilliant communicator for a decade, it is at long last my great pleasure to introduce you all to the extraordinary Edward Donner.

00:00:58
Ed is Co-Founder and CTO of Nebula.io, a platform that leverages generative AI and encoding AI to source, understand, engage, and manage talent. Previously, he was co-founder and CEO of an AI startup called Untapt that was acquired in 2020. Prior to becoming a tech entrepreneur, Ed had a 15-year stint leading technology teams on Wall Street, at the end of which he was a managing director at JP Morgan Chase, leading a team of 300 software engineers. He holds a master's in physics from Oxford University. Today's episode will appeal most to hands-on practitioners, particularly those interested in becoming an AI engineer or leveling up their command of AI engineering skills. In today's episode, Ed details what an AI engineer, also known as an LLM engineer, is.

00:01:44
He fills this in on the data that indicate how AI engineers are in as much demand today as data scientists. He talks about what an AI engineer actually does day-to-day, how AI engineers decide which LLMs to work with for a given task, including considerations like open versus closed source models, what model size to select, and what leader boards to follow. He provides tools for efficiently training and deploying LLMs, and he fills us in on LLM-related techniques including RAG and agentic AI. Are you ready for this magnificent episode? Let's go.

00:02:21
Ed, welcome to the SuperDataScience Podcast. This is a long time coming for me. We've been working together side by side for a decade.

Ed Donner: 00:02:30
It's, Jon, a total joy to be on the podcast. It's also super surreal because I've seen so many episodes. I've interacted with you so much and now it feels like this is really happening. I'm really on the podcast.

Jon Krohn: 00:02:43
Well, it's going to be a great one. So, you're here to talk to us about AI engineering and we have some amazing topics planned as usual with you. You've gone over and above in terms of preparation. I think it's the most prepared I've ever been for a podcast episode, so I know it's going to be great. For people listening, you probably get to enjoy great sound quality. If you're watching the YouTube version, you get great video quality, because we are here together in person in New York where we have worked together for 10 years. So, tell us a bit about your background, how we met, and what led you to the content that you're creating now on AI engineering and LLMs.

Ed Donner: 00:03:22
Oh, for sure. So, I guess the main thing about me is that I am a nerd. I'm a tech guy. I'm a software developer and I'm one of these software developers that I'm also competent at people management as well. I started out my career, I worked most of my career at JP Morgan in risk management technology. I started out as a coder and that's something that I absolutely loved. But very quickly, I found myself getting into the ranks of middle management and there I was in a world of PowerPoints and Spreadsheets and there are some people who say I got to that point in my career and I realized this is what I was destined for and I was never very good at coding and I felt very much the opposite. I felt that this is not what I was destined for. I was meant to be coding. So, I left JP Morgan.

00:04:11
At the time, I think I was running a team of about 300 people and doing nothing but PowerPoints and Spreadsheets. I left and went back to coding in my apartment and started a small AI startup called Untapt. Somehow I managed to convince you to come and do it with me. So, we built up this startup, which was about taking AI models and applying them to the field of talent. We were working with some of the deep neural networks at the time and thinking to ourselves, "These models are so effective at understanding the nuance of language. Can we use this to encode the thought process of a really good recruiter, takes something which looks at someone's career and doesn't look for things like keywords, but looks at patterns and understands what sorts of skills they bring to the table and matches them up for jobs?"

Jon Krohn: 00:05:03
Exactly.

Ed Donner: 00:05:03
So we used that. We built up Untapt. We were fabulously acquired a few years ago, which was a wonderful moment for us. Then with the parent company, we spun off our second startup and we are co-founders of Nebula.io, which is also applying AI to the field of talent.

Jon Krohn: 00:05:25
Exactly. I do love telling the story of exactly how we met, which was we went to the same constituent college of Oxford University. So, Oxford University is made up of 39 constituent colleges, and if you go to Oxford, you must also be a part of one of these colleges. So, as an undergrad, you actually apply to the college itself. As a graduate student like I was when I went to Oxford, you get admitted to the university and then you pick your top two choices for college and hope that you get one of them. I got my number one choice, which was Magdalen College. You are also a Magdalen College alumnus.

00:06:05
People are trying to spell that. It looks like it should be Magdalen, but it's this old Latin pronunciation. Although this is now a bit of an aside, but something that I often think about is how do we know how people pronounce things, you know? How do we know how classical Greek or classical Latin was pronounced?

Ed Donner: 00:06:25
The mystery.

Jon Krohn: 00:06:28
So anyway, that is really an aside. But yeah, so we were both alumni of the same Magdalen College and Magdalen College has a very active alumni community in New York. A decade ago, I think it must've been about 2014, we met at this little alumni event in a garden in the East Village of Manhattan and I loved your energy. You just left JP Morgan at that time and you were so excited about what you were doing. I knew that you were this brilliant technical founder that I wanted to be working with and it's been a brilliant 10 years since.

Ed Donner: 00:07:01
As you know, the side story there as well is that I'm a horrible introvert and I absolutely detest going to these kinds of networking events. My dad had seen that there was a networking event for Magdalen College in New York and he called me up and he said, "This is the thing that you should really go to." I said, "These networking events are always useless. Nothing good ever comes from a networking event." He said, "All right, I'll make one deal with you. Go to this one networking event, and if nothing good comes of it, you never have to go to another one again, but just go to this one." I went to that networking event and the first person I met was Jon. Then Jon comes and joins Untapt a few months later and now I have to go to every networking event.

Jon Krohn: 00:07:43
Yes, exactly. I wasn't necessarily going to go into that part of the story. If you are going to divulge that on air, also, you're going to have to tell you your middle name on air as well.

Ed Donner: 00:07:54
Never happening.

Jon Krohn: 00:07:56
The mystery continues on that front. Okay. So, yeah, this is exciting. Most recently you've been creating content, which is amazing. I've gone to some of your live trainings in the O'Reilly platform rather. You've developed a Udemy course, which we have linked to in the show notes called LLM Engineering: Master AI and Large Language Models, and it's a bestseller.

Ed Donner: 00:08:22
So the first thing I'll say on this is that I was encouraged to do this by somebody I know who is a masterful, masterful educator, who just so incredibly has such a talent for explaining super complicated things in ways that everyone can understand in really accessible ways. Of course, yes, I refer to you. You're really amazing at this and it's something that I've seen you do over the years and I've been so amazed by it. I remembered that I used to actually do tech training myself at IBM many years ago, and I used to love it. It was really great fun. It's really enjoyable explaining things and you encouraged me to have a shot at this. I did and I ran some O'Reilly events and I absolutely love them.

00:09:11
So, now, yes, I've got this Udemy class. It's interesting, I had this fascinating realization that there is this new type of job that's come almost out of nowhere. It's a job that I call like an LLM engineer, but it's also something that's called an AI engineer or an AI architect. If you go onto LinkedIn and you do a search, you'll see there's about 4,000 job openings for LLM engineer right now in the US, compared to 4,800 for data science. It's very similar.

Jon Krohn: 00:09:44
Oh really, I did not know that. That is wild.

Ed Donner: 00:09:45
It's a kind of hybrid job. It's a job that is part data scientist, part software engineer, and part perhaps what we would've called like an LLM engineer or an ML engineer, someone that's deploying models into production. It's got a bit of all three and this new role, someone that knows how to select models, how to apply them with things like RAG and agentic AI and then how to deploy them into production. This is a whole new category. So, I thought, "Is there already training that covers that sweet spot?" I couldn't find anything there. So, I decided to build this Udemy class and yeah, it's been amazing. It's really taken off. It's been there for about six or seven weeks, and yeah, 14,000 people have taken it so far.

Jon Krohn: 00:10:35
Wild.

Ed Donner: 00:10:36
I mean by your exponential amount of use.

Jon Krohn: 00:10:40
I don't know if I would've had numbers like that in that same timeline. I think it's just a matter of time before you overtake me. It's such a hot topic and it's such a good course. You also were able to by offering it as live trainings, you were able to get real time feedback on the content, figure out what works well, figure out how to explain things even better. So, people get this nicely polished material now in Udemy.

Ed Donner: 00:11:02
Yeah, for sure. The O'Reilly platform has definitely given me that opportunity as has watching much of your training, which as I say, it's absolutely phenomenal.

Jon Krohn: 00:11:15
So yeah, I think I interrupted you as you were explaining what an AI engineer is. You talked about how it's a hybrid of data science, software engineering, and ML engineering. What does it involve in terms of day-to-day tasks? What are the responsibilities of this AI or LLM engineer?

Ed Donner: 00:11:30
So the first thing that an AI engineer has to do is select which model, which LLM they're going to be using for a problem. It turns out, this is probably the most common question that I get asked, and you probably get asked a lot too, which is like, what's the best model? What's the best LLM? Of course, the answer is there isn't one best LLM. There's the right LLM to use for the task at hand. You have to start by really understanding the requirements. The first step is to drill into the business requirements and use that to guide your decision process. Usually at least there are three major categories of things that you're looking at. First of all, you're looking at the data. What's the quality and quantity of data? Is it structured? Is it unstructured?

00:12:21
You really get a sense of the data you're working with. Then you look at the evaluation criteria. What are you going to be using to decide whether or not this model is fit for purpose, if it's solving your problem. I'm not so much thinking there of model metrics like cross entropy loss. I'm thinking of business outcome metrics. In our case, are the right people being shortlisted for the right job for Nebula, but thinking about what you're trying to accomplish with your commercial solution and finding the metrics for that. Then the third category is the non-functional stuff, the budget, how much can you spend on training? How much can you spend on inference? What's your time to market? Do you need something next month or can you spend six months building this?

00:13:04
This really will help steer whether you're working with closed source or open source and help you make a lot of these decisions. But often the first step before you do any of that, before you build any LLMs is building a baseline model, something which often isn't an LLM at all. I don't know if you remember back in the day at Untapt, before we were working with deep neural networks at the very beginning, we actually started with a heuristic model that was just like janky code with a bunch of if statements, but it gave us a starting point. I don't think we ever put that in production, but it gave us something we could use and measure against our outcomes. Then after that, I remember you built a logistic regression model, if you remember that.

Jon Krohn: 00:13:49
Yeah, yeah. It's Interesting because the time at Untapt that we were building these natural language processing, these NLP models to figure out who was a great fit for a given role, that time at Untapt covered the same period of time that deep learning burst onto the scene and became easy to use. So, while it is a good idea potentially to be building a baseline model and testing that before using some big LLM, which might be overkill for some use case that you have, another constraint that often a lot of people have is it depends on how much data you have. Although LLMs have turned that on its head because LLMs can be quite performant with a small amount of data. You can even fine tune them with a relatively small amount of data.

00:14:36
But it used to be the case historically that if you had fewer data, you would use a simpler model. So, at the very beginning of the Untapt platform, before there were any users, you don't have any real data to work with, it makes sense to, okay, let's use a heuristic platform. Now today it's interesting because you could actually just ask an LLM write the code or rate this profile. It was feature engineering that you had to do at that time where you wrote functions to pass over whatever document was being passed into the model to pull out, okay, is software engineer is that character string mentioned in this description? Then okay, we'll give a binary yes into the software engineer column of this. So, obviously, that's really simplistic, but it did actually go some of the way, even that heuristic model.

Ed Donner: 00:15:25
It helps you, it gives you that baseline, it gives you the sense of this is the low bar, and then as you work towards building a more nuanced LLM, you can see the benefits. You can see the improvement you make on that baseline.

Jon Krohn: 00:15:38
I don't think today, you would recommend building a heuristic natural LLM.

Ed Donner: 00:15:42
No, no, but a traditional machine learning model perhaps to start with. It's good to start with-

Jon Krohn: 00:15:47
Logistic regression model.

Ed Donner: 00:15:48
... logistic regression. Right, right. For sure.

Jon Krohn: 00:15:51
Cool. All right. So, yeah. So, you said, I interrupted you as you were saying before you even select an LLM, you have a baseline model that you test and see you give yourself an easy baseline and then selecting the LLM is the next step.

Ed Donner: 00:16:03
Well, so you first have to choose whether you're going to go the close source route or the open source route. That's a big decision point and I would say that almost always the first answer is start close source.

Jon Krohn: 00:16:17
Exactly.

Ed Donner: 00:16:17
Like begin of course with a model GPT-4o mini with something that-

Jon Krohn: 00:16:24
I would recommend beginning with the most expensive model to start to see because in the beginning, to do a bunch of prototyping on your own, the costs are going to be so trivial to be using the most beefy, use that full GPT-4o, see if you can do it there, and then maybe check once you're thinking about going into production, you think you're going to have a lot of users. But actually our interview with Andrew Ng recently, he said just leave it on that really expensive model because almost all proof of concept that you build even you deploy them into production, you are going to be very lucky if it costs you tens of dollars running in production initially. So, you don't need to be worried about spending dollars or tens of dollars by switching to GPT-4o mini. Anyway, it's just one perspective.

Ed Donner: 00:17:13
No, for sure. Of course, it makes total sense. I think there are some situations where you would move to open source and maybe somewhere you would start with open source. Obviously, the one that's most common and the one that guided us at Nebula is where you have a vast amount of proprietary data which has nuanced information captured in that data. We want to fine-tune a model that we believe will be able to outperform the frontier because we have this proprietary data set and that's obviously a great reason to do it. You would still probably start with GPT-4o, but then you would use that to train a model. Another situation that's very common is if you have private data.

00:17:56
You have data that's sensitive and that you are not comfortable sending that data to a third party. You're not comfortable for it to leave your infrastructure at all despite some of the guarantees that you might get from an OpenAI enterprise agreement. But in those cases, you would still want to use open source, keep the data locally, and run it on your models. There might be some situations where at inference time you are very focused on API costs and so you can reduce the costs by running an open source model. Then the final thing I can think of is if you're trying to build models to run on device or without a network connection, then again of course you would need to work with perhaps not with an LLM, but with an SLM, a small language model like a Lambda 3.2 or something like that.

Jon Krohn: 00:18:42
Nice. I guess maybe one thing to add that you said you can save money with open source and that would be the case, particularly if you were doing a lot of calls at inference time. If you're going to have a small number of calls at inference time, then spinning up a server with a GPU running to have that open source model run could end up being way more expensive than calling a closed source model just the API.

Ed Donner: 00:19:04
For sure, depending on the number of parameters. If it turns out that your problem can be solved with a small model, then it could be quite cheap. But if you're starting to talk about Lambda 3.1 405B, then yes, it's probably going to be cheaper to use GPT-4o.

Jon Krohn: 00:19:18
Yeah, yeah, unless you have tons of volume. Okay. There's just a few other maybe techniques here you want to talk about that AI engineers use, these kinds of key things like RAG, fine-tuning, agentic AI. These are all very trendy terms these days.

Ed Donner: 00:19:33
Yeah, for sure. So, once a model has been selected, the next step for an AI engineer is to then figure out, okay, how are we going to optimize applying this model to the problem at hand? Of course then the world breaks into two, you've got the kinds of things you can do at training time and the kinds of things you can do at inference time and training optimization, that's where we began. That's what that's all the people did. Maybe a year and a half ago, we were all fine-tuning models, we were using QLoRA to fine tune open source model or you can also fine tune a closed source model as well. But increasingly in the last year and a half, people have been using more and more inference time techniques to better optimize your model for the problem at hand.

00:20:22
I think that the OG of this is probably multi shot prompting, which we were all doing way back, giving a bunch of examples, providing that to the model and say, "Hey, here's a few examples and now here's the new question you're being asked." Then along came RAG, where you take the question and you then look up in your database, "Okay, do I have information that I could provide the model that's going to help it answer this question?"

00:20:49
RAG has lots of fancy stuff. There's lots of techniques you can use to get better and better at selecting the right information that's going to be most likely to be useful to the model, to bias it towards giving a high accuracy result. RAG comes in lots of flavors. There's a hierarchical RAG where you make multiple queries to your vector store to get bigger, broader documents and then more fine-tuned password.

Jon Krohn: 00:21:14
That's news to me. Cool.

Ed Donner: 00:21:15
So for example, let's say you've got an airline has a chatbot and someone asked the question of I'm going to Paris this Christmas. What could I do in Paris over the Christmas time? Maybe first of all, there would be a RAG query just to retrieve all of the information about travel to Paris backward come a lot of information, and then a second query would select out holiday season activities so that it could be just a little bit more precise in the context that it's providing.

Jon Krohn: 00:21:43
I see. The second one is doing RAG on the first round of documents that was brought back. I got you. That's cool.

Ed Donner: 00:21:49
There's actually another kind which is somewhat similar to that, which is called Query Conditioned RAG, which is a bit similar to something that we do at Nebula, which is where you take the original query from the user like I want to go to Paris and do holiday related things. You pass that to an LLM and you have the LLM rewrite that in the form of a query that's going to be most applicable for RAG to query against your vector data store. So, that gives you that that two rounds before you look up in the data store again, increases the accuracy of the contact that you're providing the model.

Jon Krohn: 00:22:27
Yeah, it's very cool. That has been powerful for us in Nebula for sure. I think one thing that we may not have mentioned is that RAG stands for retrieval augmented generation and that meaning that it's a generative AI, but you're augmenting your responses by retrieving relevant documents, relevant information from potentially a vast, vast store of billions and billions of documents you could search over efficiently.

Ed Donner: 00:22:52
For sure. Of course, as you say, usually the way you're doing that is you're taking the question and you're passing it through an encoder LLM that's able to turn that question, that block of text into a series of numbers. You can think of those series of numbers as representing a point in multidimensional space which reflects in some way the meaning of that text.

 00:23:17
If you've also taken all of your millions of documents and you've also found little vectors associated with each one, then the idea behind RAG is you just find which of these documents are most close or closest to this question that you're asking. You take those documents and you just shove them in the prompt and you say to the LLM, "Hey, let me provide you with some context that might help you in answering this question." That's the idea.

Jon Krohn: 00:23:43
Nice. Yeah. Then fine-tuning and agentic AI, I think, are the remainder or do you want dig into those later?

Ed Donner: 00:23:49
No, no, for sure. All topics, we could spend an entire podcast talking about any of them of course, and they're fascinating. Agentic AI is of course all the rage at the moment. This is such a hot topic and agentic AI, it's a technique that there are some hallmark situations when you think, "Okay, agentic is going to work for me here." One of them is where you have a complex problem and it's clear that it's going to make sense to break that down into a series of smaller steps that are each well-defined and that together we'll solve the bigger problem. That of course sounds like agentic AI. Another obvious case is when you want to be able to make use of tools, which for example is supposing you're writing a model, which is going to generate some code.

00:24:37
Potentially, you want to be able to call a tool that will execute that code and then inform the LLM whether it worked or not, and it can use the tool to iterate on that solution. The third situation where you might use agentic AI is probably the hardest one to explain. It's the case where you're trying to solve a problem where the LLM needs to exist beyond the construct of a chat with a user. It's got a longer lifespan than that. It's got some autonomous existence. Maybe to make it concrete, an example might be, again, if we think of this airline that's having this conversation with a user that says, "I'd like to go to Paris for the holidays. What can I do?" They have that conversation.

00:25:23
Maybe a few days later, the LLM detects that the ticket price to Paris has come down by 100 bucks and it proactively text messages that same user and says, "I've noticed the price of the ticket has come down. Are you now interested in your Christmas holiday in Paris?" And so from that perspective, it's got this existence that goes beyond this one chat conversation and that sounds like, "Okay, that's an agentic AI solution right there."

Jon Krohn: 00:25:50
Yeah, I think the key difference is agentic AI allows you to be proactive as opposed to reactive. So, if you're using an LLM in a non-agentic setting, like going to ChatGPT and typing in a query, it is reacting to your query exactly as you said perfectly there with that example of an agentic system noticing, proactively scraping the web to look for deals, noticing that the price has gone down and notifying you.

Ed Donner: 00:26:16
That's a really nice way of explaining it for sure. Then there's one small recent delta on agentic AI that's very similar, which is the emergence of reasoning frameworks, which is perhaps just a way of applying agentic AI. But these are frameworks which are able to make multiple calls usually to the same LLM to guide it through the process of reasoning. Very much like the way we see o1 Preview reasons its way through a particular thread. So, reasoning frameworks is another technique that AI engineers can use to try and get more from their models.

Jon Krohn: 00:26:58
Do you ever feel isolated, surrounded by people who don't share your enthusiasm for data science and technology? Do you wish to connect with more like-minded individuals? Well, look no further, Super Data Science community is the perfect place to connect, interact, and exchange ideas with over 600 professionals in data science, machine learning, and AI. In addition to networking, you can get direct support for your career through the mentoring program, where experienced members help beginners navigate. Whether you're looking to learn, collaborate, or advance your career, our community is here to help you succeed. Join Kirill, Hadelin and myself, and hundreds of other members who connect daily. Start your free 14-day trial today at superdatascience.com and become a part of the community.

00:27:43
Very cool. So, we've already touched a little bit on how selecting a model tends to be one of the biggest and most important roles for an AI engineer in order for an application to be effective. We've touched a little bit on deciding between a closed source or an open source model. Is there anything else you'd like to cover on that?

Ed Donner: 00:27:59
I think probably talking more about the open source parts of that and about something which I love talking about, which is benchmarks and leaderboards and that stuff, because I've really got into that. So, yeah, personally, I found myself super overwhelmed when I first started trying to find the right open source model to use for a problem because there are just so many of them and it's just so hard to figure out. Aside from prototyping with all of them, try and identify which one is going to be right for the task I'm trying to work with. Of course, one of the first places you go is Hugging Face's open LLM leaderboard, which is just a treasure trove of useful information. But again, there's a lot to it.

00:28:47
So, when you bring that up and you see all of this information, one of the first things that I always do is I check the box to show the number of parameters in the table because you need to ground yourself on comparing apples and apples and seeing which sizes of models are being compared. I also use the filter to zoom in and then you need to look at the different benchmarks and identify which of these benchmarks are going to be most relevant for the problem I'm trying to solve. There are a lot of problems with benchmarks. There are a lot of known limitations. They can be gamed. There are plenty of examples of contamination that's happened that people are overfit to benchmarks, but still they give you a decent indication of what you're working with.

00:29:31
So, they give you a good grounding to pick a model. Probably my favorite benchmark, if I can go right into one of them, there's a benchmark called GPQA stands for Google Proof Question and Answers, which is a really fun one. They came up with this metric in a paper just over a year ago. It was November 23 that they published it. The idea is they wanted to come up with a metric that was far out there, something that's going to be, the models won't be able to solve this for a long time. We want to set a really high bar here. That was the thinking.

00:30:09
So, they come up with GPQA and the idea is it is 448 difficult questions in physics, biology, and chemistry that someone who's either taking a PhD or has a PhD should just about be able to solve. In fact, if you put it to people with that caliber, they tend to get 65% on average in the GPQA test. 65% is like an expert human level. If you give it to normal people like me without a PhD and you say, "Okay, here's the thing, you can use Google. You can spend half an hour as much as you want just going through Google, figure out the answer to these questions," then people will score 35%, a dismal 35%.

Jon Krohn: 00:31:03
A key thing here is that so when you talk about we're not taking any PhD saying, okay, somebody has a biology PhD and testing them on chemistry PhD questions. It's like these 448 hard science questions are segregated when you give that 65% of PhD level humans. It's not just like, "Oh, this person has a PhD and therefore they're really smart and they can do all these questions." It's like, no, they can do the subset of science questions for their discipline.

Ed Donner: 00:31:28
That's exactly right. That's exactly right.

Jon Krohn: 00:31:30
65% of the time.

Ed Donner: 00:31:31
65% of the time. When this first came out, it seemed like that was a long way out. Then Claude 3.5 Sonnet came out earlier this year and it scored around 59%. People were just shocked. Wow. Already it's approaching human expert level. Then a few weeks ago, the new Claude 3.5 Sonnet came out and it's exactly 65%. It's on par with expert humanity. It's outrageous. It's so spectacular. And then, yeah, I guess it's no news to you, but o1 Preview, of course, shatters all these numbers. o1 Preview is above 70% already at its above PhD level in these kinds of subjects.

Jon Krohn: 00:32:19
It's just the Preview model.

Ed Donner: 00:32:20
It's just the Preview model. We believe that Orion is around the corner. That's what the rumors tell us.

Jon Krohn: 00:32:26
Tell me about Orion.

Ed Donner: 00:32:28
Orion is apparently the code name for either GPT-5 or maybe it will be o2, whatever it's going to be. Apparently the OpenAI's next model is code named Orion. In fact, I think that most recent speculation is that people are starting to see diminishing returns from these models. There was an article in Bloomberg that was suggesting that maybe we won't be as blown away as we're expecting by the next OpenAI model. It's been somewhat surprising that Claude 3.5 Opus still hasn't come out when 3.5 Sonic came out a while ago. We're still on 3.0 Opus. So, we're wondering whether that's because it's not yet ready, not at that level yet. But anyway, the next generation of models is I'm sure going to push GPQA even more.

Jon Krohn: 00:33:19
I've completely forgotten about the Opus thing, right? Because the way that Claude does their sizings, you've got Haiku, which is their lightweight one, like a GPT-4o mini and then Sonnet. Just in my head, because I've been using Claude 3.5 Sonnet and it's so great and it does outperform Claude 3.0 Opus on so many tasks, I'd completely forgotten about that Opus that that's the bigger model. We're still waiting on it.

Ed Donner: 00:33:44
We are, we are... but people are saying a couple months early in the new year is when we will expect Orion and the next Anthropic model. I'm sure that these benchmarks will be blown away yet again. If you look at the open source models on the Hugging Face leaderboard, you'll see that they are not at this level yet, not, hugely surprising. I think the winning model right now, and this changes all the time, is the Qwen 2.5 model from Alibaba Cloud.

Jon Krohn: 00:34:15
Oh, really?

Ed Donner: 00:34:16
Yeah, that's the leader when it comes to GPQA. It's scoring about 22%, so it's not even close, not even at the ME level.

Jon Krohn: 00:34:26
Some spellings here for, I guess, all of our listeners since we don't have words that show up on the screen in the video version either. So, Orion is not like an Irish last name. It's like Orion's belt, O-R-I-O-N. Then this Qwen that you just mentioned from Alibaba is Q-W-E-N, like Gwen, but with a Q. It's Qwen 2.5.

Ed Donner: 00:34:50
Yes, the 32 billion version of it is the one scores best in GPQA as of now, but these things change all the time, which is why it's well worth bookmarking the Hugging Face LLM leaderboard. I guess we'll be able to add some bookmarks to the broadcast notes.

Jon Krohn: 00:35:06
To the show notes. Yeah, for sure.

Ed Donner: 00:35:07
Lovely. I have about eight or nine leaderboards bookmarks that I go to a lot and they're such an incredible resource.

Jon Krohn: 00:35:15
If you send us your eight or nine, we will put them in there or you can maybe create a little GitHub gist with your eight and we could link to that. Your choice. We'll see what happens.

Ed Donner: 00:35:27
Figure it out. I'll send it to you. Indulge me to go through another couple of metrics that I love that are just fun to mention-

Jon Krohn: 00:35:35
Of course, fascinating.

Ed Donner: 00:35:36
... some of the benchmarks. So, there's one that's called MuSR, M-U-S-R, that you'll also see on the Hugging Face that stands for Multi-Step Soft Reasoning. Again, this is another of these metrics, this is now about thinking through difficult problems, and there's a bunch of questions that are asked to models. The one that I enjoy hearing about the most is that they are given 1,000-word short stories of a murder mystery, and they have to respond with who has the motive, the means, and the opportunity.

Jon Krohn: 00:36:08
That's funny.

Ed Donner: 00:36:08
They said, "Yeah, it's that great." So you'll see the results of MuSR. So, if you're looking for reasoning capabilities, that's what you would look at. Then there's a benchmark called MMLU-Pro. MMLU is a very well-known metric that was used a lot for language understanding, but it was fairly-

Jon Krohn: 00:36:27
Well, massive multitask language understanding. Yeah.

Ed Donner: 00:36:30
Exactly. It was criticized for that. There was certainly some contamination and there was also ambiguity in the metric, but Hugging Face is now using one called MMLU-Pro, which fixes those problems and which is a better metric. So, if you're looking for a model that can show understanding of language, then that's the metric that you would look at. Then one more that's on there and then I'll stop with the metrics is called BBHard and BBHard, which stands for Big Bench Hard is also... So this was another one that they came up with two years ago.

 00:37:06
The idea again was that this would be testing for future capabilities for things which LLMs can't do today, that one day we hope they'll be able to do. It was in response to the fact that LLMs were getting into the 90s on all of the existing benchmarks. One example, one of the categories of questions is about identifying whether text has sarcasm in it, which that's a really clever nuanced test that as of a couple of years ago, LLMs really struggled with. As of now, Claude 3.5 Sonnet scores 93%.

Jon Krohn: 00:37:40
Wow.

Ed Donner: 00:37:41
Unbelievable. Unbelievable. So, already this future capability is the present capability.

Jon Krohn: 00:37:47
Yeah, it's moving really quickly, which makes AI engineering or LLM engineering a super fun field and also evidently creating all these jobs rivaling the number of data scientists that we have job postings for out there.

Ed Donner: 00:38:00
For sure. There's a different leaderboard altogether on Hugging Face that not many people know about, which is unbelievably useful, which I feel like I'm doing a public service by telling people about it. It's called the Hugging Face LLM-Perf Leaderboard. It's a leaderboard that helps you understand your hardware requirements and your timing, your latency of using different sized models on different hardware. If you go to this leaderboard, there's tons of information there, but you have to know that there's a separate tab called Find the Best Model and you have to click on that tab. That's where all the magic happens. That's the juicy part.

Jon Krohn: 00:38:37
Wow.

Ed Donner: 00:38:38
It shows you a scatter plot. So, it's a scatter plot where every dot in the scatter plot represents a model and the X-axis is the latency. So, how long will the model take to respond? The Y-axis is the accuracy, the performance against benchmarks. Each dot has a size, a size of the marker and the size of the marker is how much GPU RAM it takes up. You'll see all of the models out there and you can hover over them and get the details on them. That means that if you've got a particular hardware set up, you've got a certain box with 40 gigs of GPU RAM and you need it to respond at a certain time, you can simply look at this diagram and pick out exactly which models you can work with. So, it's a really useful resource.

Jon Krohn: 00:39:31
Very cool. Now in situations like we have a Nebula where we have a completely new kind of task that maybe we have some competitors out there that are trying to do similar kinds of things, like going back to the Untapt example, still today, a big part of our AI R&D involves finding the right people for the right job. And we can do that so much... I mean, we'd be completely imagine if we could go back in time five years and show the kinds of results that we have, it would completely knock our socks off and LLMs play a big part in that. They're not everything. There's other tricks as well, but LLMs play a big part in us getting great match results back to our users. So, we've talked about leaderboards so far, but these leaderboards are relatively generic intelligence. Do you have any thoughts on when you have some specific problem we have, what would you do?

Ed Donner: 00:40:27
Well, the short answer is that it's what data scientists like to call it's empirical is what they sometimes say.

Jon Krohn: 00:40:38
That's right. It's such a sophisticated sounding word.

Ed Donner: 00:40:39
Yeah, I'm right. It's one that you have to learn to say that to senior management, to executives. You say, "Well, that's empirical," or an area of active research is the other little expression. But what it means is there is a lot of trial and error involved. A lot of what you'll do, you'll look at these benchmarks that might help you guide you towards three or four models. Maybe you're starting with Phi from Microsoft and Gemma from Google and a couple of others, maybe Qwen from Alibaba Cloud and Llama from Meta. Then you'll build prototypes with a subset of your data and you will be measuring it against some of your outcome metrics that we talked about at the beginning. You'll be using that to help guide your decision towards which model to take.

00:41:24
The same will apply for the different techniques when you're thinking about, "Okay, am I going to be doing fine-tuning or working at training time or am I going to be working at inference time working with RAG, working with agentic AI?" It will again come down to some empirical work as some trial and error. There are rules of thumb. If you're focused on trying to improve the accuracy and specialist skills, then it tends to lend itself towards RAG. We talked about some of the pointers towards agentic AI before, so you'll use those to guide you, but at the end of the day, trial and error, try a few things out, see what results you get.

Jon Krohn: 00:42:02
I guess in that context, I'd add a couple of things like you might need to create a test set, which depending on the task, you might need to create that manually or you might need to, you could actually use some of these state-of-the-art proprietary LLMs or the really big like Llama 3.1 405B to be generating synthetic data that are relevant to your test case. One thing that I would suggest that you be careful about in that circumstance is to be sure that your simulated data cover the range of use cases that you anticipate your users will have. So, if you were to naively go to Claude 3.5 Sonnet and say our problem, say, "I want to create a test set of jobs and candidates and I want you to come up with some reasonable score for how well they match with each other."

00:43:00
If you just naively do that, you could end up stuck in a relatively small part of the whole sample space that you might want all of your samples to come from. So, a cool trick here that we have leveraged at Nebula ourselves is to use real platform data to seed the simulation. Maybe you had users come in. So in our case, you could have a user come in and say, "Find me a data scientist in New York." That isn't a very rich query for testing our, our test data set, but we could have that be the starting point, that piece of information and then Claude 3.5 Sonnet or whatever proprietary API or some big open source model that you're using to simulate data with can make a data scientist profile, can make data scientists jobs to match against related jobs to score against.

00:44:00
Then you are getting a real sense of, okay, these are the users that we have. They cover this range of industries, and you get that nice range of seeds that are representative of what your user base is looking for.

Ed Donner: 00:44:12
Yeah, it makes total sense. Also, there are some companies that specialize in helping with that process, both generating synthetic data and also in building real data sets. So, there's Scale AI and our scale.com that is really great at that and that's what they do. That also gives me an opportunity to mention that they also have a leaderboard. I'm a leaderboar.

Jon Krohn: 00:44:39
A leaderboar.

Ed Donner: 00:44:43
There's a leaderboard called the seal leaderboards, which they have made, which are meant to be about specifically applying models to business problems. They've got a bunch of them. They measure things like they've got one specifically for use of tools, which is really interesting, which models are better at being able to use tools. They've got one really interesting one called adversarial robustness. That is specifically testing models to see how good they are at refusing to answer questions that are inappropriate and not being allowed to be led astray. That's particularly important and relevant, because if for example, you are working on a chatbot that you're going to have as your airline's customer support chatbot, you want to make sure that people aren't going to be able to generate-

Jon Krohn: 00:45:31
Make a bomb.

Ed Donner: 00:45:32
Have it make a bomb, or make something that's very memeable and is going to be embarrassing and be posted all over the place. So, knowing that you're picking a model that is strong in terms of adversarial robustness is something that's very helpful. So, scale.com generates test data, is very useful for that, and also has some great business-specific leaderboards.

Jon Krohn: 00:45:54
Eager to learn about large language models and generative AI but don't know where to start? Check out my comprehensive two-hour training, which is available in its entirety on YouTube. Yep. That means not only is it totally free, but it's ad-free as well. It's a pure educational resource. In the training, we introduced deep learning transformer architectures and how these enable the extraordinary capabilities of state-of-the-art LLMs. And it isn't just theory; my hands-on code demos, which feature the hugging face and PyTorch Lightning Python libraries, guide you through the entire life cycle of LLM development, from training to real-world deployment. Check out my generative AI with large language models, hands-on training today on YouTube. We've got a link for you in the show notes.

 00:46:38
Nice, S-C-A-L-E. Any other leaderboards that we must know about, Edward?

Ed Donner: 00:46:44
Well, if you're going to open that door, let me see. Yes, there's one called Vellum. In fact, I would say that the Vellum.ai leaderboard is my first bookmark. It's my number one bookmark. It's the first one I have up because what Vellum has, which is really hard to find when you need it, is for all of the main frontier models, the cost, the cost per million input tokens and the cost per million output tokens and the context window size.

Jon Krohn: 00:47:14
I actually see, it's off-camera, a laptop that Ed left open while we're recording and it's on that Vellum page. So, I've noted it as context windows, model names, and input cost in terms of million tokens. It's funny, because as you were describing this, I was like, "Did we already talk about this on the show? Why am I so familiar with Vellum?" It's because I've been reading about it over your shoulder.

Ed Donner: 00:47:39
I am a leaderboar. It's right there always on my laptop. So, yeah, Vellum.ai really useful. It also has a bunch of other things too. It has that BBHard metric that we mentioned before, the future capabilities. The great thing about Vellum comparing it to something like Hugging Face is that with Vellum, you see both open-source and closed-source together. So, you can see that some of the massive open-source models like the Llama 405B that you mentioned is competing with frontier models. It's creeping up the leaderboards. So, it is really useful to see it from that point of view.

00:48:17
I think, well, there are a couple of other leaderboards. So, there's a big code leaderboard from Hugging Face, which is one way you can see how good models are writing code. It doesn't just cover Python coding, which is a metric that many people know, but it also has Java coding, JavaScript, and C++. So, it's another really good one. Then Hugging Face has a whole ton of leaderboards. They have leaderboards for medical models, models that are specialized in medical domain. It has financial services leaderboards, different spoken languages, so Spanish and Korean and Japanese and vision-generating leaderboards. So, Hugging Face has tons of these.

Jon Krohn: 00:49:00
Very nice. I do see there's one more on your list that I'd like to call out, which is LMSYS, L-M-S-Y-S, which is really cool because that one, instead of having...so the evaluations are in head-to-head comparisons where human users evaluate whether output A or output B is superior given the query that they put in. So, it's a more expensive leaderboard to run to be able to collect those data. I don't mean that you're necessarily paying the users to come, but it's labor-intensive. There's a lot of effort that goes into that, but it provides a unique perspective relative to a lot of the other leaderboards out there. We have a whole episode actually about that. So, Professor Joey Gonzalez was on this show maybe about a year ago now. So, we'll have a link in the show notes to that episode for people to check out.

Ed Donner: 00:49:55
I can mention that they've actually changed their name. It's no longer called LMSYS. As of very recently, they've just changed their name. As you're seeing, it's called lmarena.ai. That is the new name of LMSYS.

Jon Krohn: 00:50:10
A better name.

Ed Donner: 00:50:11
It is a better name, but it's been around for so long and everyone knows LMSYS that it's hard to change that. But yes, it is now called lmarena.ai, and if you go there now, it's such a great thing for everyone in the community to do. We could all contribute to this arena by doing it ourselves and by going through and voting on a model. I believe as of right now, that the top slot is going to the very new version of Gemini 1.5 Flash.

Jon Krohn: 00:50:39
Oh really?

Ed Donner: 00:50:39
Gemini for the win. It's come right up to the top. Sorry, Gemini 1.5 Pro is right up at the top. People think this might be a preview version of Gemini 2, which is expected in the next month or so. So, right now, as of the time we're recording this, it may have been dethroned by the time you listen to this, but as of now, that is at the top.

Jon Krohn: 00:51:03
Well, all of this talk about other leaderboards and competitions evaluating LLMs, you yourself have come up with an innovative metric and innovative test of LLM quality. You want to tell us about it?

Ed Donner: 00:51:18
Well, thank you for bringing that up. So, it was a super fun project that I did. It brought me great joy because I'm such a leaderboar, I did decide I should try and do one myself. I'd had this idea of trying to have models try and compete with each other, set up some very simple rules that mean that models, you'd have four models playing against each other. They'll be given a certain number of coins. So, in fact, they start with 12 coins, and in each turn, they would have to choose to take a coin from one of the other players and to give a coin to one of the other players. But before they make their decision, they have an opportunity to speak to all of the other players privately by sending them a message and they receive the messages and they use that to decide who they will take and give.

 00:52:13
They can build alliances, they can gang up on other models, and there is an extra bonus if two of the players, two of the LLMs decide to both gang up on the same model, then they get an extra top-up of coins. So, they are incented to try and form alliances. So, it's called Outsmart, that's what I called it, because models outsmart each other and I have it running. I guess I'll put that link in the show notes as well. A lot of people have run it and played with it. One of the really fun things about it is that you could see the trace, you can see what's going on in terms of both what they are saying to all of the other models. Also, they get to share their private strategy. They're told that the other models will not be told the strategy. This is just for their own record keeping.

00:53:04
So, they get to say what they're doing and they do hatch plans. They are certainly devious in what they do. It's also the case that the stronger models tend to do better. I was excited to see whether with that construct of rules, it would be able to separate out the stronger models from the weaker models. So, that you'll see there's a leaderboard there to look at, and you can play the game yourself and see how they scheme against each other. It was so much fun to make it. Honestly, it was just a joy.

Jon Krohn: 00:53:35
Cool. So, when I go to Outsmart and I play with it, what do I do? What do I get to change as a user?

Ed Donner: 00:53:43
So really all you get to do is kick off the game.

Jon Krohn: 00:53:46
You don't choose an LLM.

Ed Donner: 00:53:49
No, but I encourage people, you can download the code and then put in your own keys and choose the LLM. But because it's using my keys for this, I'm using cheap models in the public internet version of it. But, it's very much, it's super easy to download and very configurable. So, you can then have it running yourself to try out different models. As a voyeur, you kick off the game and you can then watch while they compete. You can watch their messages to each other as they gang up on each other and the outcome of your game then gets recorded and added to the ELO rating of the different models.

Jon Krohn: 00:54:26
Very cool. So, that gives us a nice round tour of the various ways that we can be evaluating and selecting an LLM for our task. We've talked about closed source, open source. We've talked about various leaderboards out there, including your own. Once you've selected a model and you've applied it to the problem, you know which one you'd like to use, what does an AI engineer do? Do they hand it off to another team to productionize or what?

Ed Donner: 00:54:53
That's a great question, and the answer is that it depends. There are different boundaries in different organizations, but it is becoming increasingly common for an AI engineer or the AI engineering team to also be responsible for productionizing their model. Again, there are a number of different ways that that can happen as well. One of the ways that's very popular and that I love is to use a product like modal.com. We use it at Nebula and it's the most fabulous platform. It's called a serverless AI platform. It allows you to deploy your models so that it's running in the cloud.

00:55:32
Then you can call it through an endpoint, but you can also just have Python code and the Python code, you can run your code. It can either run your model locally or it can call out to the cloud, and the code looks almost identical. So, it feels almost as if you're just running something locally, and in fact, it's calling out to the cloud. The great innovation about Modal is that you only pay for the clock cycles that your model is actually running for. When you call out to, it spins up and you start paying. It will then service your request. It stays warm for a couple of minutes in case other requests come in and then it calmly shuts down and then you stop paying. That's of course very, very useful in startup land.

Jon Krohn: 00:56:15
We've had Erik Bernhardsson, the CEO of Modal on the show.

Ed Donner: 00:56:18
Wow.

Jon Krohn: 00:56:19
He sat right where you're sitting and we did an episode. It might be two years ago now. Again, we'll have that for you in the show notes. Yeah, Erik Bernhardsson, brilliant guy. Actually, he recently posted on LinkedIn his GitHub contributions over the last year, that 7 by 52 metrics that people have on their GitHub profiles. It makes you feel really silly. It's like every day.

Ed Donner: 00:56:48
He is a overdeliverer. I'm sitting on a hallowed ground. That's amazing. Modal is a fabulous product. If anyone hasn't played, you can because you also get... I think it's $30 of free credit every month, so there's really no excuse if you're out there dabbling with models. It's so easy and it's free to be trying out running a model on a Modal serverless endpoint. Yet he makes it very easy to be deploying models to production. So, that's one way to do it. There are some other similar products, but I don't think others you can pay for in that way. I don't think they have that innovation. But there's Hugging Face endpoints, which are very useful. There's RunPod. It's another good one. In fact, actually, I think you recently had Luca Antiga on the show from lightning.ai and they have Lightning Studios, which allows you to deploy onto their cloud, which is another fabulous platform.

Jon Krohn: 00:57:47
Yeah, Lightning Studios makes it easy to be prototyping in a Jupyter Notebook setting like in where you're very quickly using powerful GPUs in the cloud if you want to be for your task, but then you can seamlessly transition that to a production application and they're doing cool work over there. We'll have Luca Antiga's episode in the show notes for you as well, I'm sure.

Ed Donner: 00:58:08
And there is...So there's a more comprehensive way kind of way that AI engineers can be deploying as well. That is if you are responsible for building out like a RAG pipeline, like if you've been optimizing at inference time as well as at training time, and so you've built in functionality around that, then that might be something that you are responsible for deploying that entire service. So, that could be something that maybe you put into a Docker container, you're deploying Kubernetes on AWS or GCP, and you are responsible for that whole production service. It might have multiple endpoints that the engineering team is calling into. So, that's another model as well.

 00:58:45
Then to go even further in that direction, if you're working with agents and you've built out an agentic AI platform, you may have an environment, an agentic environment where you have multiple agents collaborating and you would be responsible for deploying all of that potentially. Some of the agent platforms give you a way to deploy it onto their cloud. So, LangGraph, for example, has LangGraph platform that lets you deploy your full set of agents onto their cloud. CrewAI has CrewAI Enterprise, which is another one. So, you can use some of these production enterprise agent platforms to be running your agents in production.

Jon Krohn: 00:59:28
Fascinating episode, Ed. I knew it would be. So, what should people's next steps be after this episode if they want to continue learning about AI engineering or pursuing a career in the field?

Ed Donner: 00:59:39
Oh, well, I would of course give a totally shameless plug for this Udemy class I made. It was so much fun making this class. It was such a joy. As I say, you encouraged me to do it and I'm super grateful. I feel like I identified that gap because I didn't see other courses out there that would give you that comprehensive grounding in all of these different bits, both the choosing the right model, the training time optimizations, the inference time optimizations, and the deployment RAG, agentic AI. So, the full package is in the Udemy course. Yeah, I do hope that some people check it out.

Jon Krohn: 01:00:24
That's LLM: Engineering Master AI and Large Language Models. We'll have a link to that Udemy course in the show notes. Yup. Sorry, I interrupted you.

Ed Donner: 01:00:31
Wonderful. I would give perhaps a slightly less shameless plug, but a much greater plug for another Udemy class, which would be your Udemy class on the mathematical foundations of machine learning. I just feel like, again, this is one of those extraordinary things. It's such a treasure trove of educational material that this is the thing that one would need to have done a university degree in. You need to invest so much time and money in being able to learn this kind of foundational information about deep learning and it's all available. In fact, I think on your YouTube, you've got most of the content, much of the content is free, and then if you want to do the exercises and the notebooks, then you've got a value add on the-

Jon Krohn: 01:01:17
Yeah, exactly. So, there's two. So, my machine learning foundation's curriculum, I had a studio recorded version that Pearson, the big educational giant, they paid for me to be in a studio for four long weekends. It was like 15 recording days, something like that. Then they had a professional production team convert that into amazing high quality professional recordings that are fully available, the entire curriculum is available today in the O'Reilly platform. They also very generously gave me a unique and unusual carve out where they allowed me to also record the same material at home using a webcam and publish that to YouTube and this Udemy course.

01:02:05
Now, the downside of that in me doing it on my own is that I've only gotten halfway through. So, the linear algebra is covered, the calculus is covered, and a little bit of the probability theory, but there's lots more probability theory still to come, statistics and data structures and algorithms that still needs to get in there. So, computer science that still needs to get in there in YouTube and Udemy. But yeah, it's all freely available on YouTube. We'll have a link to that in the show notes. The only difference between the YouTube version and the Udemy version is if you want to support me a little bit, then you can buy it on YouTube. That's nice, but you also get access to fully worked solutions in the Udemy version. But all of the educational content is in either.

Ed Donner: 01:02:41
I hardly need to tell your audience this, but honestly you are so masterful at this, being able to break things down into bite-sized chunks and build on top of them. I've seen you explaining things so many times and just a tremendous resource.

Jon Krohn: 01:02:55
Thank you, Ed. I really appreciate it. Actually, a question that you might have as a listener is why do I need to learn, say the mathematical foundations of machine learning anymore when so much can be done by LLMs? It's amazing when you're building, training, or deploying a production AI system, there are tons of opportunities. If you can get into the nitty-gritty of how these things work and how you can deploy them, you can end up having clever things that will be specific to your application that will be proprietary and will actually give you a moat as well as a company. Ed Donner:

01:03:25
Absolutely. Yeah, it's incredibly important to have that foundation and you see the people that don't have that have a much, where particularly if you hit a problem with training or you're starting to see diminishing returns and you don't have that foundation that you can tie on, then it's much, much harder.

Jon Krohn: 01:03:42
Exactly. All right, and then my two questions for all of my guests, do you have a book recommendation for us?

Ed Donner: 01:03:49
Okay. I mean, I was prepared. I did know.

Jon Krohn: 01:03:54
Pretending that you didn't know.

Ed Donner: 01:03:56
So first of all, I should say that I am terrible at reading. I am a really bad, bad reader.

Jon Krohn: 01:04:03
You could be listening to books at the gym.

Ed Donner: 01:04:04
I did that for a while, but not enough. I come from a family that are all avid readers. They are really prolific readers. I'm the scientist in the family, and they're highly, literally skilled. So, when we have our weekly family Zooms and they're really split in half. The first half of the family Zoom is everybody asking me technical support questions. Why is the WiFi so flaky? Why can't I read my email? That's the first half. Then the second half is all about, "Oh, that's such a good read, this book. It's so good." Everyone else in my family seems to get through at least a book a week-

Jon Krohn: 01:04:49
Oh, my goodness. Wow.

Ed Donner: 01:04:50
They do force me to read from time to time. The most recent thing I read, which was six months ago, but still quite recent, is a book called Klara and the Sun, if you know that.

Jon Krohn: 01:05:00
I know, I started it.

Ed Donner: 01:05:03
So it's by Kazuo Ishiguro. In some ways, it's not a science fiction book, but it has science fiction elements to it for sure, but he's not a science fiction writer. He also wrote The Remains of the Day, which is a super famous prize winning book, but it's about a humanoid robot. It's set in the future. The humanoid robot is known as an AF, an artificial friend, and it's written from the first person point of view of this artificial friend, which is really powerful. The thing about it that is so astonishing is that this book was written about a year before ChatGPT came out. So, it predates. I mean now this thing is more something that is on all of our minds, but it predates ChatGPT and that has made it so much more relevant and it's really moving. I urge you to finish it.

Jon Krohn: 01:06:02
I'll get going on it again. Thank you.

Ed Donner: 01:06:04
Oh yeah, oh yeah, it's great. Really amazing.

Jon Krohn: 01:06:05
All right, and then quickly as we're losing light here, we're recording under natural light though, it's funny that as I said that a lamp turned on.

Ed Donner: 01:06:14
I think you did that.

Jon Krohn: 01:06:14
Yeah, it's just indicator that it's sunset. No, it was on cue. Incredible. Although not bright enough to light our faces on camera. So, very last question for you is how should people connect with you after the episode or follow your work?

Ed Donner: 01:06:26
Okay, well, thank you for asking that, and for me, it's simple. The best way to be in touch with me is through LinkedIn. I absolutely love LinkedIn, connecting with people. You hear some people saying, "If you want to LinkedIn with me, then write a short thing saying where you heard this or something."

Jon Krohn: 01:06:43
I do that.

Ed Donner: 01:06:43
Well, I'm not like that. I'm a LinkedIn board.

Jon Krohn: 01:06:48
You're a LION. I can't remember what it stands for, but if you have LION in all caps in your LinkedIn description, it means you accept all requests.

Ed Donner: 01:06:57
I see. Well, then I'm one of those. I can't be fussy. I'm a fully open for all LinkedIn requests. Please bring them on. I love being part of the community. If this has interested you, if the field of AI engineering is something which seems interesting and you have career questions, then I'm always available for you to balance questions at any time. I'm pretty responsive as well. If you have ideas, if you have something that you're working on that involves LLMs and you're thinking, "What model should I use? Which leaderboard should I consult?" or something like that, then please hit me up. I love engaging in this stuff, so connect with me on LinkedIn. That's the best way.

Jon Krohn: 01:07:40
Very generous, very generous offer, and very generous of you to spend so much of your valuable time with us here. Thank you, Ed, for this amazing episode. It's been such a delight after 10 years of working together to have you on the show.

Ed Donner: 01:07:51
It's been such a joy to be on the show. I knew I'd have a blast and I did. Thank you, Jon. Thank you for having me. It's been really great.

Jon Krohn: 01:08:04
It was such a treat to finally share the incredible Ed Donner with you today. In today's episode, Ed covered AI engineering as a hybrid role combining data science, software engineering, and ML engineering skills with approximately 4,000 current job openings in the US, rivaling the number of job openings for data scientists. He talked about when selecting an LLM, you want to consider starting with closed source models like GP-4o for prototyping, then potentially moving to open source options if you have proprietary data privacy requirements or high inference costs. He filled this in on how key techniques in AI engineering include fine-tuning models with domain specific data, retrieval, augmented generation for enhancing responses with relevant context, agentic AI for autonomous proactive systems.

01:08:47
He also talked about important benchmarks for evaluating models like GPQA for testing expert level knowledge, MMLU-Pro for language understanding and Big Bench Hard, BBHard for testing advanced capabilities. Then finally, he left us with deployment options for AI systems, including modal.com for serverless deployment, Lightning Studios for seamless prototyping to production, Docker and Kubernetes for full production services, and specialized platforms for agentic AI deployment. As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Ed's social media profiles, as well as my own at superdatascience.com/847.

01:09:29
Thanks, of course, to everyone on the SuperDataScience Podcast team, our podcast manager Sonja Brajovic, our media editor, Mario Pombo, partnerships manager, Natalie Ziajski, researcher Serg Masis, our writers Dr. Zara Karschay and Sylvia Ogweng, and our founder Kirill Eremenko. Thanks to all of those folks for producing another magnificent episode for us today. For enabling that super team to create this free podcast for you, we're deeply grateful to our sponsors. You can support this show by checking out our sponsor's links, which are in the show notes, and if you yourself are interested in sponsoring an episode, you can get the details on how to do that by heading to jonkrohn.com/podcast.

01:10:07
Otherwise, share this episode with folks who'd love to learn about AI engineering. Review the show on your favorite podcasting platform or YouTube. Subscribe obviously if you're not already a subscriber. But most importantly, I just hope you'll keep on tuning in to more episodes in the future. I'm so grateful to have you listening, and I hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there and I'm looking forward to enjoying another round of the SuperDataScience Podcast with you very soon.

Show all

arrow_downward

Share on