Podcastskeyboard_arrow_rightSDS 860: DeepSeek R1: SOTA Reasoning at 1% of the Cost

13 minutes

Data ScienceArtificial Intelligence

SDS 860: DeepSeek R1: SOTA Reasoning at 1% of the Cost

Podcast Guest: Jon Krohn

Friday Feb 07, 2025

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


DeepSeek-curious? This Five-Minute Friday is for you! Jon Krohn investigates the overwhelming overnight success of this new LLM, the product of a Chinese hedge fund. DeepSeek is a market newcomer, and yet it runs shoulder to shoulder with behemoths from OpenAI, Anthropic and Google like it’s all in a day’s work. 


Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.
 
News of DeepSeek’s capabilities had an enormous effect on the economy, contributing to a significant fall in Nvidia’s share prices. Listen to the episode to hear just how significantly DeepSeek is disrupting the market, what DeepSeek’s comparatively low costs mean for that recently announced $500 billion Stargate project, and where we now stand on the map towards artificial general intelligence.

ITEMS MENTIONED IN THIS EPISODE
Jon Krohn: 00:05 
This is episode number 860 on the DeepSeek R1.

00:27 
Welcome back to the SuperDataScience Podcast. I'm your host, Jon Krohn. Let's start off with a couple recent reviews of the show like we sometimes do on Fridays the first ones from Remnasa who provided an Apple podcast review that says that they listen often and that they've been listening to these Super Data Science podcasts for a couple of years now and they always find the content fascinating and they say that sometimes the content is a bit over their head, but you can bet that they look up the information and learn from every episode. Very cool. Nice to hear that, Remnasa. 

00:59 
And we had a second Apple podcast review as well. This one is from sailATX. It says the Super Data Science podcasts is a fantastic way to keep up with the world of AI and the people who work in the industry. They say that the guests that we have on the show are spot on and always interesting. They also have nice things to say about my YouTube calculus course and it's helping them brush up their math for a data science course that they're taking, cool. Good luck with that course, sailATX, and I hope both of you, Remnasa and sailATX, continue to enjoy this show. 

01:33 
Thanks for all the recent ratings and feedback on Apple Podcasts, Spotify, and all the other podcasting platforms out there, as well as for likes and comments on YouTube videos as a bit of friendly competition. I mentioned this for the first time a couple of weeks ago. Regular listeners may know that I've guest co -hosted the excellent last week in AI podcast half a dozen times, and both regular hosts of that show, Andre and Jeremy, have been my guests on the SuperDataScience podcast. Well, despite their show being many years younger than the Super Data Science podcast, they are closing in on the Super Data Science podcast in terms of number of Apple podcast ratings. At the time of recording, we have 286 and they're at 255. So we're staying ahead. And since I last mentioned this a couple of weeks ago, both podcasts have had about five ratings each. So we're staying neck and neck, but it does seem like I need you to keep going at it and press towards 300 ratings there on Apple podcast. So help me stay ahead of Andre and Jeremy by heading to your podcasting app and rating, the Super Data Science podcast there. Bonus points, if you leave written feedback, I'll be sure to read it on air like I did today. 

02:40 
All right, into the meat of today's episode now. In recent weeks, I'm sure you've noticed there's been a ton of excitement over Deepseek, a Chinese AI company that was spun out of a Chinese hedge fund just two years ago. Deepseek's v3 stream of consciousness chatbot style model caught the world's attention because it was able to perform near state -of -the -art models like OpenAI's GPT-4 and Google's Gemini 2.0 Flash, but it was Deepseek’s reasoning model, which is kind of like OpenAI's 01 reasoning model, so instead of having a stream of consciousness, these reasoning models review what they've been "thinking" before necessarily pumping something out and this kind of reasoning has turned out to be great for the same kinds of tasks that you might ponder and reason with a pencil and paper over math problems, computer science problems, those kinds of things. So you can hear more about these kinds of reasoning models in episode 820 of this show, but suffice it to say that DeepSeek's reasoning model called R1 caused huge economic disruption, such as both Nvidia's share price falling by 17 % and the Nasdaq falling several percent last Monday. At the time of writing, DeepSeek's R1 reasoning model is statistically, so within a 95 % confidence interval tied for first place on the overall LM Arena leaderboard with the top models. 

02:40 
It's literally in first place, statistically speaking, alongside GPT- 4o and Gemini 2.0 Flash from Google. So the LM Arena Leaderboard is one of many kinds of leaderboards that you could use to compare LLM performance, but the LM Arena Leaderboard is particularly interesting because it involves humans blindly rating the performance of one output versus another. 

04:38 
It's an interesting leaderboard. You could actually hear a ton about that leaderboard in episode number 707 of this podcast if you'd like to. Anyway, this great performance being on top of the Eleanorina leaderboard and other kinds of leaderboards out there caught global attention first because DeepSeek is an obscure Chinese company while all the previous top models were devised by Americans, specifically by Bay Area tech giants. More consequentially than even great power politics, however, Deepseek's R1 caused a global economic tsunami because it is comparable in performance to the best OpenAI Google and Anthropic models while costing a fraction as much to train. 

05:19 
There are all kinds of complexities, externalities and estimates to take into account when trying to make a comparison in cost between two different LLMs at two different companies. For example, what about the cost of training runs that didn't pan out? But speaking in rough approximations, training a single DeepSeek V3 or DeepSeek R1 model appears to cost on the order of millions of dollars. While training a state of the art Bay Area model like 01 Gemini or Claude 3 .5 Sonnet reportedly costs on the order of hundreds of millions of dollars. So about about 100X more. 

05:52 
As I've stated on this show several times, even without conceptual scientific breakthroughs, simply scaling up the transformer architecture that underlies O1, Gemini or Claude, such as by increasing training dataset size, increasing the number of model parameters, increasing training time compute, or in the case of reasoning models like o1, increasing inference time compute. And so doing any of that kind of scaling will lead to impressive LLM improvements that overtake more and more humans on more and more cognitive tasks and bring machines in the direction of artificial general intelligence. If you don't know what AGI is, you can check out episodes number 748 and 824 more on all of what I just said in the last sentence. Implicit in this scaling statement that I made, however, is that if researchers can devise major conceptual scientific breakthroughs with respect to how machines learn. So, you know, actually making scientific breakthroughs instead of just scaling things up, we could accelerate toward AGI even more rapidly. 

06:54 
If conceptual breakthroughs on AI model development can allow machines to improve their cognitive capabilities while also learning more efficiently, this would reduce server farm energy consumption, loss of fresh water through server cooling, and of course, it would just save plain old financial costs associated with running AI models. DeepSeek has achieved such a conceptual breakthrough by combining a number of existing ideas like mixture of experts models, you can learn more about those in episode 778, combining those kinds of existing ideas with brand new major efficiencies such as a GPU communications accelerator called dual pipe that schedules the way data pass between the couple thousand GPUs DeepSeek appeared to train R1 with to get the breathtaking results that they did. 

07:40 
Now, 2000 GPUs might sound like a lot, but it's again about 1% of the number of chips met as Mark Zuckerberg and XAI's Elon Musk bragged about procuring in a given year for potentially training a single ever larger next large language model. I'm not going to go further into the technical details of the DeepSeek models in this episode. But if you'd like to dig into the technical aspects more deeply I have provided a link to DeepSeeks full R1 paper as well as an exceptionally detailed well -written blog post on an online tech news site called NextPlatform that breaks down that paper. 

08:19 
Moving beyond technical aspects to geopolitics, DeepSeek success demonstrates that American sanctions that prevent Chinese firms from accessing the latest most powerful Nvidia chips have been ineffective. These sanctions were explicitly designed to prevent China from being able to overtake the US on the road to AGI, particularly given the military implications of having access to a machine that could far exceed human cognitive capabilities. But now a Chinese firm has figured out how to approach U.S. firms' AI capabilities with about 1% of the quantity of chips at about 1% of the cost and using less capable Nvidia chips than American firms have access to. On a side note, in a separate quandary for the Chinese Communist Party, for geopolitical reasons, they'd probably prefer the DeepSeek's intellectual property be kept proprietary, and yet, DeepSeek graciously open-source their work for the world to leverage and advance AI research, as well as AI application development. 

09:13 
All of the DeepSeek, V3 and R1 source code and model weights are available on GitHub, I've got a link to that in the show notes, and all of that source code and model weights are available for use under a highly permissive MIT license. All aspects of proprietary models like those from OpenAI, Google, Anthropic, and xAI are, on the other hand, proprietary that's another big positive for the AI community from the folks at DeepSeek. This level of openness from DeepSeek is far beyond even what so-called open LLMs like Meta's Lama family offer because Meta provides model weights but not source code and Meta's unusual license includes constraints such as limiting Lama model usage to companies with fewer than 700 million active users. 

10:00 
Beyond providing their models open source, DeepSeek also created an iOS app. It was number one in the Apple App Store at the time of recording this episode, but I would caution you against using the DeepSeek app because, per the app's privacy policy, anything you input into Deepseek's app is collected by the company and stored on servers in China. If you'd like to privately use a DeepSeek model but don't want to take your time or money to download the model weights and run it on your own hardware, you can use a platform like Olama. I've got a link in the show notes to the R1 model from DeepSeek provided by Olama, so you can do just that. 

10:35 
Okay, so hopefully you're excited that you now have untethered access to state-of-the-art AI capabilities, but that should only be the beginning of your excitement. So, markedly more efficient LLM training does make recent $6 billion dollar raises by OpenAI, xAI, and Anthropic, much of which would have been earmarked for training ever larger transformer architectures for ever longer inference time. It looks like those raises may no longer be very well allocated capital. And the DeepSeek’s release ended up being coincidentally, but nevertheless comically timed with the announcement of the $500 billion Stargate AI infrastructure project that included the CEOs of OpenAI, Oracle, and SoftBank alongside Donald Trump, that enormous $500 billion stargate figure probably only made sense when bean counters assumed LLMs would keep growing and growing by orders of magnitude in the coming years. And yeah, correspondingly, NVIDIA's share price took a 17 % hit in one day, although at the time of writing this podcast episode and recording at some of this share price hit had recovered. 

11:44 
But that share price took that 17% hit because shareholders realized the LLM size increases they'd baked into future GPU orders may no longer come to fruition. But for most of us, certainly for me and probably for most listeners, markedly more efficient LLM training and a rehashing of the open source model that dominated AI model research until just a few years ago is fabulous news. Increased LLM efficiency in particular means fewer environmental issues associated with AI and it means that developing, training, and running AI models is more economical and therefore developing practical AI applications becomes cheaper and more widely available for all to use and benefit from around the world. These are exciting times indeed. Dream up something big and make it happen. There's never been an opportunity to make an impact like there is today.

12:37 
All right, that's it for today's episode. If you enjoyed it or know someone who might consider sharing this episode with them, leave a review of the show on your favorite podcasting platform. Tag me in a LinkedIn or Twitter post with your thoughts. And if you aren't already, obviously subscribe to the show. Most importantly, however, I just hope you'll keep on listening. Until next time, keep on rocketing it out there and I'm looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Show all

arrow_downward

Share on