Podcasts SDS 684: Get More Language Context out of your LLM

6 minutes
Artificial Intelligence, Data Science

SDS 684: Get More Language Context out of your LLM

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn

Open-source LLMs, FlashAttention and generative AI terminology: Host Jon Krohn gives us the lift we need to explore the next big steps in generative AI.

How can large-language models (LLMs) compete with GPT-4’s capabilities? For this Five-Minute Friday, host Jon Krohn digs into a solution, which has been spearheaded by Stanford University and their “exact attention” algorithm, FlashAttention.

Several open-source LLMs are on the market, and their relatively small model size makes them a great “fits-in-your-pocket” solution for running natural language tasks. Despite this efficiency, open-source LLMs cannot compete with OpenAI’s GPT architectures. This setback is due to their comparatively small context windows, which are essential for generative AI in that they help LLMs to predict the next word in a sequence. When context windows are involved in generating text, they require computational power to the power of 2, which is why open-source options keep them to a minimum. Smaller context windows may provide fewer opportunities for a model to capture the semantic interconnections between words, potentially limiting the model’s ability to consider the broader context and resulting in a narrower scope for generative content.

Find out how FlashAttention’s researchers have managed to create a solution that tripled the standard speeds of model training and increased attention at inference time speeds by a multiple of seven on this week’s episode.

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?

Do you think top-level generative AI architectures will become a two-player race, or will more competitors soon enter the scene?
Download The Transcript

Podcast Transcript

(00:02):
This is Five-Minute Friday on FlashAttention.

(00:19):
In most recent Friday episodes since mid-April, I’ve been singing the praises of open-source large language models (LLMs) like Vicuña that can approach GPT-4’s state-of-the-art capability but, in comparison to recent OpenAI GPT architectures, which have hundreds of millions of parameters, these open-source ones we’ve been talking about are much smaller. The model families I covered typically range from about a billion model parameters up to 13 billion model parameters. This relatively small size, particularly when paired with the parameter-efficient fine-tuning I covered in Episode #674, means that these powerful LLMs can often be fine-tuned to the natural-language tasks of your choice on a single GPU and they are relatively fast and inexpensive to run in production.

(01:08):
A big downside of these open-source LLMs that I hadn’t mentioned yet is that relative to GPT-4, the size of their context window is pretty small. So context windows are the number of tokens a language model can handle. For the uninitiated, you can think of tokens as words, although in reality they are typically actually sub-words; you can hear all about this in Episode #626. There is a special GPT-4 version that is in a limited beta that has a context window of 32,000 tokens, which corresponds to about 25,000 words or about 50 pages of single-spaced text — that means you could provide a ton of natural-language context to GPT-4. Even the standard GPT-4, which you can access immediately for just $20/month via a ChatGPT Plus subscription, has a context window of 8000 tokens, corresponding to about 12 pages of text. That’s quite a lot.

(02:01):
In contrast, in order to get decent real-time performance from an open-source LLM like Vicuña, you might need to limit your model to a range of about 500 or 1000 tokens. That’s just a page or two of text, which is going to be a prohibitively small amount of context for a lot of everyday natural-language tasks. To understand the reason for this strict limitation, let’s quickly digress by explaining that LLMs consist of deep learning architectures called Transformers. These Transformers have something called a self-attention module that allows the Transformer to “pay attention” to the most important semantic context within a stretch of natural language. The computational and memory complexity of this self-attention is quadratic. So it’s quatratic to the length of the language sequence input into the model. So, if we want to add ten more words of context, 10² is equal to 100, so adding just ten words of context increases the amount of computational power and the amount of memory we’re going to need in our LLM by 100. If we add, say, 500 more words of text, 500² is equal to 250,000, corresponding to a pretty massive increase in the required compute power and the memory for our LLM. This is the power of quadratic scaling.

(03:17):
The solution to this quadratic scaling problem, was devised last year by researchers at Stanford University, and it’s called FlashAttention. Using GPT-2 as a benchmark, the researchers demonstrated a 3x speed up on model training and a 7x speed up on attention at inference time. And, the larger the natural-language sequence of your LLM, the greater the relative speed up of your LLM that it will experience as a result of using FlashAttention. Pretty cool, right?

(03:48):

Well, how does it work? I don’t have time to go into a huge amount of detail, I don’t think it’s really suitable to a podcast format, but you can check out the FlashAttention paper which I’ve got in the show notes. In a nutshell, and using a bit of technical jargon that I’m not going to take the time to break down today, the way FlashAttention attains these remarkable speed ups is by having Transformers’ self-attention computations happen more with your GPU’s super-fast SRAM memory instead of your GPU’s relatively slow HBM memory.

(04:21):
So, how do you take advantage of FlashAttention yourself to increase the context window of your LLMs without exploding out your LLMs’ compute and memory requirements too much? Well, look no further. I’ve included a link to the FlashAttention GitHub repo in the show notes and the technique is available via lots of different places, at least half a dozen or so different places, popular ones. So it’s available via PyTorch’s nn.Transformer module, via Hugging Face’s transformers library and via Microsoft’s DeepSpeed inference engine. I’ve also provided a link in the show notes to the full list of current FlashAttention integrations.

(04:59):
All right, that’s it for today. FlashAttention is super useful for broadening the context window of your LLMs, making it useful for much broader range of natural language tasks and it’s readily available to you via a range of popular sources. I hope you build something super cool with it!

(05:17):
Thanks to Shaan Khosla, a data scientist on my team at Nebula.io, for inspiring today’s episode with a FlashAttention-focused installment of his weekly Let’s Talk Text newsletter. His newsletter is free and available via Substack so check it out if you’re looking for bite-sized chunks of technical natural language processing content each week.

(05:35):
And as for us, well, until next time, keep on rockin’ it out there, my friend, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon.

Share on

Related Podcasts