subtitles/en/14_character-based-tokenizers.srt (217 lines of code) (raw):

1 00:00:00,234 --> 00:00:02,901 (page whirring) 2 00:00:04,260 --> 00:00:07,200 - Before diving in character-based tokenization, 3 00:00:07,200 --> 00:00:10,350 understanding why this kind of tokenization is interesting 4 00:00:10,350 --> 00:00:13,533 requires understanding the flaws of word-based tokenization. 5 00:00:14,640 --> 00:00:16,320 If you haven't seen the first video 6 00:00:16,320 --> 00:00:17,880 on word-based tokenization 7 00:00:17,880 --> 00:00:21,450 we recommend you check it out before looking at this video. 8 00:00:21,450 --> 00:00:24,250 Okay, let's take a look at character-based tokenization. 9 00:00:25,650 --> 00:00:28,560 We now split our text into individual characters, 10 00:00:28,560 --> 00:00:29,673 rather than words. 11 00:00:32,850 --> 00:00:35,550 There are generally a lot of different words in languages, 12 00:00:35,550 --> 00:00:37,743 while the number of characters stays low. 13 00:00:38,610 --> 00:00:41,313 To begin let's take a look at the English language, 14 00:00:42,210 --> 00:00:45,540 it has an estimated 170,000 different words, 15 00:00:45,540 --> 00:00:47,730 so we would need a very large vocabulary 16 00:00:47,730 --> 00:00:49,413 to encompass all words. 17 00:00:50,280 --> 00:00:52,200 With a character-based vocabulary, 18 00:00:52,200 --> 00:00:55,440 we can get by with only 256 characters, 19 00:00:55,440 --> 00:00:58,683 which includes letters, numbers and special characters. 20 00:00:59,760 --> 00:01:02,190 Even languages with a lot of different characters 21 00:01:02,190 --> 00:01:04,800 like the Chinese languages can have dictionaries 22 00:01:04,800 --> 00:01:08,130 with up to 20,000 different characters 23 00:01:08,130 --> 00:01:11,523 but more than 375,000 different words. 24 00:01:12,480 --> 00:01:14,310 So character-based vocabularies 25 00:01:14,310 --> 00:01:16,293 let us use fewer different tokens 26 00:01:16,293 --> 00:01:19,050 than the word-based tokenization dictionaries 27 00:01:19,050 --> 00:01:20,523 we would otherwise use. 28 00:01:23,250 --> 00:01:25,830 These vocabularies are also more complete 29 00:01:25,830 --> 00:01:28,950 than their word-based vocabularies counterparts. 30 00:01:28,950 --> 00:01:31,410 As our vocabulary contains all characters 31 00:01:31,410 --> 00:01:33,960 used in a language, even words unseen 32 00:01:33,960 --> 00:01:36,990 during the tokenizer training can still be tokenized, 33 00:01:36,990 --> 00:01:39,633 so out-of-vocabulary tokens will be less frequent. 34 00:01:40,680 --> 00:01:42,840 This includes the ability to correctly tokenize 35 00:01:42,840 --> 00:01:45,210 misspelled words, rather than discarding them 36 00:01:45,210 --> 00:01:46,623 as unknown straight away. 37 00:01:48,240 --> 00:01:52,380 However, this algorithm isn't perfect either. 38 00:01:52,380 --> 00:01:54,360 Intuitively, characters do not hold 39 00:01:54,360 --> 00:01:57,990 as much information individually as a word would hold. 40 00:01:57,990 --> 00:02:00,930 For example, "Let's" holds more information 41 00:02:00,930 --> 00:02:03,570 than it's first letter "l". 42 00:02:03,570 --> 00:02:05,880 Of course, this is not true for all languages, 43 00:02:05,880 --> 00:02:08,880 as some languages like ideogram-based languages 44 00:02:08,880 --> 00:02:11,523 have a lot of information held in single characters, 45 00:02:12,750 --> 00:02:15,360 but for others like roman-based languages, 46 00:02:15,360 --> 00:02:17,760 the model will have to make sense of multiple tokens 47 00:02:17,760 --> 00:02:20,670 at a time to get the information otherwise held 48 00:02:20,670 --> 00:02:21,753 in a single word. 49 00:02:23,760 --> 00:02:27,000 This leads to another issue with character-based tokenizers, 50 00:02:27,000 --> 00:02:29,520 their sequences are translated into very large amount 51 00:02:29,520 --> 00:02:31,593 of tokens to be processed by the model. 52 00:02:33,090 --> 00:02:36,810 And this can have an impact on the size of the context 53 00:02:36,810 --> 00:02:40,020 the model will carry around, and will reduce the size 54 00:02:40,020 --> 00:02:42,030 of the text we can use as input for our model, 55 00:02:42,030 --> 00:02:43,233 which is often limited. 56 00:02:44,100 --> 00:02:46,650 This tokenization, while it has some issues, 57 00:02:46,650 --> 00:02:48,720 has seen some very good results in the past 58 00:02:48,720 --> 00:02:50,490 and so it should be considered when approaching 59 00:02:50,490 --> 00:02:52,680 a new problem as it solves issues 60 00:02:52,680 --> 00:02:54,843 encountered in the word-based algorithm. 61 00:02:56,107 --> 00:02:58,774 (page whirring)