subtitles/en/15_subword-based-tokenizers.srt

1 00:00:06,450 --> 00:00:09,540 - Let's take a look at subword based tokenization. 2 00:00:09,540 --> 00:00:11,610 Understanding why subword based tokenization is 3 00:00:11,610 --> 00:00:13,980 interesting requires understanding the flaws 4 00:00:13,980 --> 00:00:17,340 of word based and corrector based tokenization. 5 00:00:17,340 --> 00:00:18,780 If you haven't seen the first videos 6 00:00:18,780 --> 00:00:22,020 on word based and character based tokenization 7 00:00:22,020 --> 00:00:23,130 we recommend you check them 8 00:00:23,130 --> 00:00:24,780 out before looking at this video. 9 00:00:27,840 --> 00:00:31,493 Subword based tokenization lies in between character based 10 00:00:31,493 --> 00:00:35,280 and word based tokenization algorithms. 11 00:00:35,280 --> 00:00:37,410 The idea is to find a middle ground 12 00:00:37,410 --> 00:00:39,486 between very large vocabularies 13 00:00:39,486 --> 00:00:42,600 a large quantity of out vocabulary tokens 14 00:00:42,600 --> 00:00:45,360 and a loss of meaning across very similar words 15 00:00:45,360 --> 00:00:48,630 for word based tokenizers and very long sequences 16 00:00:48,630 --> 00:00:51,330 as well as less meaningful individual tokens. 17 00:00:51,330 --> 00:00:53,133 For character based tokenizers. 18 00:00:54,840 --> 00:00:57,960 These algorithms rely on the following principle. 19 00:00:57,960 --> 00:01:00,000 Frequently used words should not be split 20 00:01:00,000 --> 00:01:01,500 into smaller subwords 21 00:01:01,500 --> 00:01:03,433 while rare words should be decomposed 22 00:01:03,433 --> 00:01:05,103 into meaningful subwords. 23 00:01:06,510 --> 00:01:08,460 An example is the word dog. 24 00:01:08,460 --> 00:01:11,190 We would like to have our tokenizer to have a single ID 25 00:01:11,190 --> 00:01:12,600 for the word dog rather 26 00:01:12,600 --> 00:01:15,363 than splitting it into correctors D O and G. 27 00:01:16,650 --> 00:01:19,260 However, when encountering the word dogs 28 00:01:19,260 --> 00:01:22,710 we would like our tokenize to understand that at the root 29 00:01:22,710 --> 00:01:24,120 this is still the word dog. 30 00:01:24,120 --> 00:01:27,030 With an added S, that slightly changes the meaning 31 00:01:27,030 --> 00:01:28,923 while keeping the original idea. 32 00:01:30,600 --> 00:01:34,080 Another example is a complex word like tokenization 33 00:01:34,080 --> 00:01:37,140 which can be split into meaningful subwords. 34 00:01:37,140 --> 00:01:37,973 The root 35 00:01:37,973 --> 00:01:40,590 of the word is token and -ization completes the root 36 00:01:40,590 --> 00:01:42,870 to give it a slightly different meaning. 37 00:01:42,870 --> 00:01:44,430 It makes sense to split the word 38 00:01:44,430 --> 00:01:47,640 into two, token as the root of the word, 39 00:01:47,640 --> 00:01:49,950 labeled as the start of the word 40 00:01:49,950 --> 00:01:52,530 and ization as additional information labeled 41 00:01:52,530 --> 00:01:54,393 as a completion of the word. 42 00:01:55,826 --> 00:01:58,740 In turn, the model will now be able to make sense 43 00:01:58,740 --> 00:02:01,080 of token in different situations. 44 00:02:01,080 --> 00:02:04,602 It will understand that the word's token, tokens, tokenizing 45 00:02:04,602 --> 00:02:08,760 and tokenization have a similar meaning and are linked. 46 00:02:08,760 --> 00:02:12,450 It's will also understand that tokenization, modernization 47 00:02:12,450 --> 00:02:16,200 and immunization, which all have the same suffixes 48 00:02:16,200 --> 00:02:19,383 are probably used in the same syntactic situations. 49 00:02:20,610 --> 00:02:23,130 Subword based tokenizers generally have a way to 50 00:02:23,130 --> 00:02:25,890 identify which tokens are a start of word 51 00:02:25,890 --> 00:02:28,443 and which tokens complete start of words. 52 00:02:29,520 --> 00:02:31,140 So here token as the start 53 00:02:31,140 --> 00:02:35,100 of a ward and hash hash ization as completion of award. 54 00:02:35,100 --> 00:02:38,103 Here, the hash hash prefix indicates that ization is part 55 00:02:38,103 --> 00:02:41,013 of award rather than the beginning of it. 56 00:02:41,910 --> 00:02:43,110 The hash hash comes 57 00:02:43,110 --> 00:02:47,013 from the BERT tokenizer based on the word piece algorithm. 58 00:02:47,850 --> 00:02:50,700 Other tokenizes use other prefixes which can be 59 00:02:50,700 --> 00:02:52,200 placed to indicate part of words 60 00:02:52,200 --> 00:02:55,083 like in here or start of words instead. 61 00:02:56,250 --> 00:02:57,083 There are a lot 62 00:02:57,083 --> 00:02:58,740 of different algorithms that can be used 63 00:02:58,740 --> 00:03:00,090 for subword tokenization 64 00:03:00,090 --> 00:03:02,670 and most models obtaining state-of-the-art results 65 00:03:02,670 --> 00:03:03,780 in English today 66 00:03:03,780 --> 00:03:06,663 use some kind of subword tokenization algorithms. 67 00:03:07,620 --> 00:03:10,953 These approaches help in reducing the vocabulary sizes 68 00:03:10,953 --> 00:03:13,636 by sharing information across different words 69 00:03:13,636 --> 00:03:15,960 having the ability to have prefixes 70 00:03:15,960 --> 00:03:18,630 and suffixes understood as such. 71 00:03:18,630 --> 00:03:20,700 They keep meaning across very similar words 72 00:03:20,700 --> 00:03:23,103 by recognizing similar tokens, making them up.

subtitles/en/15_subword-based-tokenizers.srt (251 lines of code) (raw):