subtitles/en/52_wordpiece-tokenization.srt

1 00:00:00,151 --> 00:00:02,818 (air whooshing) 2 00:00:05,520 --> 00:00:08,370 - Let's see together what is the training strategy 3 00:00:08,370 --> 00:00:11,851 of the WordPiece algorithm, and how it performs 4 00:00:11,851 --> 00:00:15,150 the tokenization of a text, once trained. 5 00:00:19,351 --> 00:00:23,580 WordPiece is a tokenization algorithm introduced by Google. 6 00:00:23,580 --> 00:00:25,653 It is used, for example, by BERT. 7 00:00:26,640 --> 00:00:28,020 To our knowledge, 8 00:00:28,020 --> 00:00:31,590 the code of WordPiece has not been open source. 9 00:00:31,590 --> 00:00:33,510 So we base our explanations 10 00:00:33,510 --> 00:00:36,903 on our own interpretation of the published literature. 11 00:00:42,090 --> 00:00:44,883 So, what is the training strategy of WordPiece? 12 00:00:46,200 --> 00:00:48,663 Similarly to the BPE algorithm, 13 00:00:48,663 --> 00:00:52,380 WordPiece starts by establishing an initial vocabulary 14 00:00:52,380 --> 00:00:54,660 composed of elementary units, 15 00:00:54,660 --> 00:00:58,773 and then increases this vocabulary to the desired size. 16 00:00:59,970 --> 00:01:01,950 To build the initial vocabulary, 17 00:01:01,950 --> 00:01:04,920 we divide each word in the training corpus 18 00:01:04,920 --> 00:01:07,443 into the sequence of letters that make it up. 19 00:01:08,430 --> 00:01:11,820 As you can see, there is a small subtlety. 20 00:01:11,820 --> 00:01:14,190 We add two hashtags in front of the letters 21 00:01:14,190 --> 00:01:16,083 that do not start a word. 22 00:01:17,190 --> 00:01:20,430 By keeping only one occurrence per elementary unit, 23 00:01:20,430 --> 00:01:23,313 we now have our initial vocabulary. 24 00:01:26,580 --> 00:01:29,823 We will list all the existing pairs in our corpus. 25 00:01:30,990 --> 00:01:32,640 Once we have this list, 26 00:01:32,640 --> 00:01:35,253 we will calculate a score for each of these pairs. 27 00:01:36,630 --> 00:01:38,400 As for the BPE algorithm, 28 00:01:38,400 --> 00:01:40,750 we will select the pair with the highest score. 29 00:01:43,260 --> 00:01:44,340 Taking for example, 30 00:01:44,340 --> 00:01:47,343 the first pair composed of the letters H and U. 31 00:01:48,510 --> 00:01:51,390 The score of a pair is simply equal to the frequency 32 00:01:51,390 --> 00:01:54,510 of appearance of the pair, divided by the product 33 00:01:54,510 --> 00:01:57,330 of the frequency of appearance of the first token, 34 00:01:57,330 --> 00:02:00,063 by the frequency of appearance of the second token. 35 00:02:01,260 --> 00:02:05,550 Thus, at a fixed frequency of appearance of the pair, 36 00:02:05,550 --> 00:02:09,913 if the subparts of the pair are very frequent in the corpus, 37 00:02:09,913 --> 00:02:11,823 then this score will be decreased. 38 00:02:13,140 --> 00:02:17,460 In our example, the pair HU appears four times, 39 00:02:17,460 --> 00:02:22,460 the letter H four times, and the letter U four times. 40 00:02:24,030 --> 00:02:26,733 This gives us a score of 0.25. 41 00:02:28,410 --> 00:02:30,960 Now that we know how to calculate this score, 42 00:02:30,960 --> 00:02:33,360 we can do it for all pairs. 43 00:02:33,360 --> 00:02:35,217 We can now add to the vocabulary 44 00:02:35,217 --> 00:02:38,973 the pair with the highest score, after merging it of course. 45 00:02:40,140 --> 00:02:43,863 And now we can apply this same fusion to our split corpus. 46 00:02:45,780 --> 00:02:47,490 As you can imagine, 47 00:02:47,490 --> 00:02:50,130 we just have to repeat the same operations 48 00:02:50,130 --> 00:02:53,013 until we have the vocabulary at the desired size. 49 00:02:54,000 --> 00:02:55,800 Let's look at a few more steps 50 00:02:55,800 --> 00:02:58,113 to see the evolution of our vocabulary, 51 00:02:58,957 --> 00:03:01,773 and also the evolution of the length of the splits. 52 00:03:06,390 --> 00:03:09,180 And now that we are happy with our vocabulary, 53 00:03:09,180 --> 00:03:12,663 you are probably wondering how to use it to tokenize a text. 54 00:03:13,830 --> 00:03:17,640 Let's say we want to tokenize the word "huggingface". 55 00:03:17,640 --> 00:03:20,310 WordPiece follows these rules. 56 00:03:20,310 --> 00:03:22,530 We will look for the longest possible token 57 00:03:22,530 --> 00:03:24,960 at the beginning of the word. 58 00:03:24,960 --> 00:03:28,920 Then we start again on the remaining part of our word, 59 00:03:28,920 --> 00:03:31,143 and so on until we reach the end. 60 00:03:32,100 --> 00:03:35,973 And that's it. Huggingface is divided into four sub-tokens. 61 00:03:37,200 --> 00:03:39,180 This video is about to end. 62 00:03:39,180 --> 00:03:41,370 I hope it helped you to understand better 63 00:03:41,370 --> 00:03:43,653 what is behind the work, WordPiece. 64 00:03:45,114 --> 00:03:47,864 (air whooshing)

subtitles/en/52_wordpiece-tokenization.srt (226 lines of code) (raw):