subtitles/en/13_word-based-tokenizers.srt

1 00:00:00,165 --> 00:00:01,416 (screen whooshing) 2 00:00:01,416 --> 00:00:02,716 (sticker popping) 3 00:00:02,716 --> 00:00:03,549 (screen whooshing) 4 00:00:03,549 --> 00:00:05,603 - Let's take a look at word-based tokenization. 5 00:00:07,650 --> 00:00:09,780 Word-based tokenization is the idea 6 00:00:09,780 --> 00:00:11,940 of splitting the raw text into words 7 00:00:11,940 --> 00:00:14,673 by splitting on spaces or other specific rules, 8 00:00:16,020 --> 00:00:17,163 like punctuation. 9 00:00:18,900 --> 00:00:21,810 In this algorithm, each word has a specific number 10 00:00:21,810 --> 00:00:23,463 or ID attributed to it. 11 00:00:24,360 --> 00:00:27,270 Here, let's has the ID 250, 12 00:00:27,270 --> 00:00:30,150 do has 861, and tokenization 13 00:00:30,150 --> 00:00:33,393 followed by an exclamation mark has 345. 14 00:00:34,380 --> 00:00:36,000 This approach is interesting 15 00:00:36,000 --> 00:00:38,100 as the model has representations 16 00:00:38,100 --> 00:00:40,233 that are based on entire words. 17 00:00:42,720 --> 00:00:45,960 The information held in a single number is high, 18 00:00:45,960 --> 00:00:48,240 as a word contains a lot of contextual 19 00:00:48,240 --> 00:00:49,803 and semantic information. 20 00:00:53,070 --> 00:00:55,473 However, this approach does have its limits. 21 00:00:56,610 --> 00:01:00,570 For example, the word dog and the word dogs are very similar 22 00:01:00,570 --> 00:01:01,923 and their meaning is close. 23 00:01:03,210 --> 00:01:05,550 The word-based tokenization, however, 24 00:01:05,550 --> 00:01:08,520 will attribute entirely different IDs to these two words 25 00:01:08,520 --> 00:01:10,110 and the model will therefore learn 26 00:01:10,110 --> 00:01:12,930 two different embeddings for these two words. 27 00:01:12,930 --> 00:01:15,090 This is unfortunate as we would like the model 28 00:01:15,090 --> 00:01:18,240 to understand that these words are indeed related, 29 00:01:18,240 --> 00:01:21,483 and that dogs is simply the plural form of the word dog. 30 00:01:22,980 --> 00:01:24,480 Another issue with this approach, 31 00:01:24,480 --> 00:01:28,050 is that there are a lot of different words in the language. 32 00:01:28,050 --> 00:01:29,490 If we want our model to understand 33 00:01:29,490 --> 00:01:32,160 all possible sentences in that language, 34 00:01:32,160 --> 00:01:35,850 then we will need to have an ID for each different word. 35 00:01:35,850 --> 00:01:37,380 And the total number of words, 36 00:01:37,380 --> 00:01:40,080 which is also known as the vocabulary size, 37 00:01:40,080 --> 00:01:41,913 can quickly become very large. 38 00:01:44,400 --> 00:01:47,640 This is an issue because each ID is mapped to a large vector 39 00:01:47,640 --> 00:01:50,190 that represents the word's meaning, 40 00:01:50,190 --> 00:01:52,170 and keeping track of these mappings 41 00:01:52,170 --> 00:01:54,990 requires an enormous number of weights 42 00:01:54,990 --> 00:01:57,123 when the vocabulary size is very large. 43 00:01:59,160 --> 00:02:00,960 If we want our models to stay lean, 44 00:02:00,960 --> 00:02:04,440 we can opt for our tokenizer to ignore certain words 45 00:02:04,440 --> 00:02:06,093 that we don't necessarily need. 46 00:02:08,400 --> 00:02:11,970 For example, here, when training our tokenizer on a text, 47 00:02:11,970 --> 00:02:15,020 we might want to take only the 10,000 most frequent words 48 00:02:15,020 --> 00:02:16,320 in that text. 49 00:02:16,320 --> 00:02:18,600 Rather than taking all words from in that text 50 00:02:18,600 --> 00:02:22,503 or all languages words to create our basic vocabulary. 51 00:02:23,790 --> 00:02:26,520 The tokenizer will know how to convert those 10,000 words 52 00:02:26,520 --> 00:02:29,370 into numbers, but any other word will be converted 53 00:02:29,370 --> 00:02:31,530 to the out-of-vocabulary word, 54 00:02:31,530 --> 00:02:33,783 or like shown here, the unknown word. 55 00:02:35,280 --> 00:02:37,440 Unfortunately, this is a compromise. 56 00:02:37,440 --> 00:02:39,900 The model will have the exact same representation 57 00:02:39,900 --> 00:02:42,390 for all words that it doesn't know, 58 00:02:42,390 --> 00:02:45,210 which can result in a lot of lost information 59 00:02:45,210 --> 00:02:47,664 if many unknown words are present. 60 00:02:47,664 --> 00:02:50,581 (screen whooshing)

subtitles/en/13_word-based-tokenizers.srt (204 lines of code) (raw):