Podcastskeyboard_arrow_rightSDS 626: Subword Tokenization with Byte-Pair Encoding

7 minutes

Data Science

SDS 626: Subword Tokenization with Byte-Pair Encoding

Podcast Guest: Jon Krohn

Friday Nov 11, 2022

Subscribe on Website, Apple Podcasts, Spotify, Stitcher Radio or TuneIn


Jon Krohn delivers a mini-bootcamp on tokenization this week, comparing word tokenization, character tokenization and subword tokenization. Less than 10-minutes is all you need to better understand this NLP-related process.
 

If you're new to tokenization, it's simply transforming a long string of characters into shorter entities called tokens. There are, of course, several types of tokenizations, which is exactly what Jon tackles in this seven-minute episode. Starting with the standard word-level tokenization, Jon explains the pros and cons of three types and explores their roles in techniques like Word2Vec, GloVe, ELMo and more.

In word-level tokenization, the white space between the words is used to produce the smaller tokens. In character tokenization, the characters themselves are converted into shorter entities. These two methods, however, contain critical flaws. But as Jon explains, NLP researchers have developed a solution: subword tokenization, where the tokens aren't quite as coarse as words but also not as granular as characters.

Intrigued and eager to learn more? Take a 10-minute break to dive into this fascinating topic with Jon.

Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?
  • Now that you're familiar with tokenization, can you list more pros and cons for each type of tokenization?
  • Download The Transcript 
(00:05): This is Five-Minute Friday on Subword Tokenization with Byte-Pair Encoding.

(00:27): When working with written natural language data as we do with many natural language processing models, a step we typically carry out while preprocessing the data is tokenization. In a nutshell, tokenization is the conversion of a long string of characters into smaller units that we call tokens.

(00:45): The standard way to tokenize natural language is historically something called word-level tokenization. This is a conceptually straightforward kind of tokenization: We can, for example, simply use the white space between words to identify where one word ends and the next word begins, thereby converting a natural language string like “the cat sat” into three tokens: “the” and “cat” and “sat”.

(01:11): This word-level tokenization is used by techniques like Word2Vec and GloVe, two popular NLP techniques for quantitatively representing the relative meaning of words. A big drawback with such word-level tokenization is that if a word didn’t show up enough times in our training data, then when the NLP model encounters that word in production, there’s no way to handle it. In situations like this, we consider the new token to be unknown and therefore it is ignored by the model — even though the word might have been important to our production NLP application.

(01:45): To avoid the big unknown-token issue that word-level tokenization has, we can use something called character-level tokenization instead. With character-level tokenization, a natural language string like “the cat sat” is converted now into eleven tokens instead of three, so each of the characters in the sentence “the cat sat”: “t”, “h”, “e”, “space”, “c”, “a”, “t”, “space”, “s”, “a”, “t”. All of those characters, including the spaces, are included in the tokenization. This way now when we encounter a word outside of our model’s vocabulary in production, we don’t need to ignore that word — instead the model can leverage its aggregate representation of the characters that make up the new word to represent the new word.

(02:29): A technique called ELMo — which stands for “Embeddings from Language Model” — is a prominent example of an NLP technique that uses character-level tokenization. Unfortunately, character-level tokenization also has its own drawbacks. For one, it requires a large number of tokens to represent a sequence of text. In addition, unlike a word, a character doesn’t on its own convey any meaning, which can result in suboptimal model performance.

(02:54): So, we’ve now learned that both word-level and character-level tokenization have critical flaws. What can we do? Well, thankfully NLP researchers have devised a solution, something called subword tokenization. Subwords sit between words and characters: They aren’t as coarse as words, but they aren’t as meaningless or small as characters. Subword tokenization blends the computational efficiency of word-level tokens with the capacity for character-level tokenization to handle the out-of-vocabulary words — it ends up being the best of both worlds!

(03:28): There are many algorithms out there for tokenizing strings of natural language into subwords, many of which rely upon a concept called byte-pair encoding. The general idea is that we specify how many subwords we’d like to have in our vocabulary and then we rely on byte-pair encoding to predict what the particular subwords should be given the natural language we provide to it. So, there is a four step process here. The first involves the algorithm, the byte-pair encoding algorithm performing word-level tokenization. Second, it splits each individual word-level token into granular character-level tokens. Third, it computes how frequently character-level tokens occur next to each other across all the words in our natural language data. And then finally, it merges together the most-frequently-occurring adjacent characters until the number of subwords you specified to compute is reached.

(04:24): Once computed this way, the beauty of subwords is that — unlike characters — subwords do have meaning and so they can be recombined to represent out-of-vocabulary words efficiently. For example, let’s say that after we ran byte-pair encoding over our natural language data it learned to represent the subword tokens “re”, “lat”, and “ed”. These three subwords can be combined to form the word “related”. Now, in a contrived example, to demonstrate the power of this technique, let’s say that the word “unrelated” wasn’t in our training data. When our NLP application comes across the word “unrelated” in production, it should nevertheless be able to efficiently represent the meaning of the word “unrelated” because not only did the byte-pair encoding tokenize “re”, “lat”, and “ed” but let’s assume that it tokenized the subword “un” as well. The subword “un” and its negation of meaning would allow our NLP application to represent that “unrelated” means the opposite of “related” even though it never encountered the word “unrelated” during training. Very cool, and very powerful!

(05:33): The upshot is that byte-pair encoding is indeed so powerful that it is a crucial component behind many of the leading NLP models of today such as BERT, GPT-3, and XLNet. So, if you didn’t understand the broad strokes of tokenization, particularly this influential byte-pair encoding approach to tokenization prior to today’s episode, hopefully you do now!

(05:55): Thanks to Shaan Khosla, a data scientist on my team at my machine learning company Nebula for inspiring this Five-Minute Friday episode by writing about tokenization with byte-pair encoding in his Let’s Talk Text Substack newsletter. He uses the newsletter to provide a weekly easy-to-read summary of a recent key natural language processing paper or concept and so you can subscribe to that for free if that’s something you’re interested in — we’ve provided a link to Shaan’s Substack in the show notes.

(06:23): Ok, that’s it for this episode. Until next time, keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 

Show all

arrow_downward

Share on