subtitles/en/52_wordpiece-tokenization.srt (226 lines of code) (raw):
1
00:00:00,151 --> 00:00:02,818
(air whooshing)
2
00:00:05,520 --> 00:00:08,370
- Let's see together what
is the training strategy
3
00:00:08,370 --> 00:00:11,851
of the WordPiece algorithm,
and how it performs
4
00:00:11,851 --> 00:00:15,150
the tokenization of a text, once trained.
5
00:00:19,351 --> 00:00:23,580
WordPiece is a tokenization
algorithm introduced by Google.
6
00:00:23,580 --> 00:00:25,653
It is used, for example, by BERT.
7
00:00:26,640 --> 00:00:28,020
To our knowledge,
8
00:00:28,020 --> 00:00:31,590
the code of WordPiece
has not been open source.
9
00:00:31,590 --> 00:00:33,510
So we base our explanations
10
00:00:33,510 --> 00:00:36,903
on our own interpretation
of the published literature.
11
00:00:42,090 --> 00:00:44,883
So, what is the training
strategy of WordPiece?
12
00:00:46,200 --> 00:00:48,663
Similarly to the BPE algorithm,
13
00:00:48,663 --> 00:00:52,380
WordPiece starts by establishing
an initial vocabulary
14
00:00:52,380 --> 00:00:54,660
composed of elementary units,
15
00:00:54,660 --> 00:00:58,773
and then increases this
vocabulary to the desired size.
16
00:00:59,970 --> 00:01:01,950
To build the initial vocabulary,
17
00:01:01,950 --> 00:01:04,920
we divide each word in the training corpus
18
00:01:04,920 --> 00:01:07,443
into the sequence of
letters that make it up.
19
00:01:08,430 --> 00:01:11,820
As you can see, there is a small subtlety.
20
00:01:11,820 --> 00:01:14,190
We add two hashtags in
front of the letters
21
00:01:14,190 --> 00:01:16,083
that do not start a word.
22
00:01:17,190 --> 00:01:20,430
By keeping only one occurrence
per elementary unit,
23
00:01:20,430 --> 00:01:23,313
we now have our initial vocabulary.
24
00:01:26,580 --> 00:01:29,823
We will list all the
existing pairs in our corpus.
25
00:01:30,990 --> 00:01:32,640
Once we have this list,
26
00:01:32,640 --> 00:01:35,253
we will calculate a score
for each of these pairs.
27
00:01:36,630 --> 00:01:38,400
As for the BPE algorithm,
28
00:01:38,400 --> 00:01:40,750
we will select the pair
with the highest score.
29
00:01:43,260 --> 00:01:44,340
Taking for example,
30
00:01:44,340 --> 00:01:47,343
the first pair composed
of the letters H and U.
31
00:01:48,510 --> 00:01:51,390
The score of a pair is
simply equal to the frequency
32
00:01:51,390 --> 00:01:54,510
of appearance of the pair,
divided by the product
33
00:01:54,510 --> 00:01:57,330
of the frequency of
appearance of the first token,
34
00:01:57,330 --> 00:02:00,063
by the frequency of appearance
of the second token.
35
00:02:01,260 --> 00:02:05,550
Thus, at a fixed frequency
of appearance of the pair,
36
00:02:05,550 --> 00:02:09,913
if the subparts of the pair are
very frequent in the corpus,
37
00:02:09,913 --> 00:02:11,823
then this score will be decreased.
38
00:02:13,140 --> 00:02:17,460
In our example, the pair
HU appears four times,
39
00:02:17,460 --> 00:02:22,460
the letter H four times,
and the letter U four times.
40
00:02:24,030 --> 00:02:26,733
This gives us a score of 0.25.
41
00:02:28,410 --> 00:02:30,960
Now that we know how to
calculate this score,
42
00:02:30,960 --> 00:02:33,360
we can do it for all pairs.
43
00:02:33,360 --> 00:02:35,217
We can now add to the vocabulary
44
00:02:35,217 --> 00:02:38,973
the pair with the highest score,
after merging it of course.
45
00:02:40,140 --> 00:02:43,863
And now we can apply this same
fusion to our split corpus.
46
00:02:45,780 --> 00:02:47,490
As you can imagine,
47
00:02:47,490 --> 00:02:50,130
we just have to repeat the same operations
48
00:02:50,130 --> 00:02:53,013
until we have the vocabulary
at the desired size.
49
00:02:54,000 --> 00:02:55,800
Let's look at a few more steps
50
00:02:55,800 --> 00:02:58,113
to see the evolution of our vocabulary,
51
00:02:58,957 --> 00:03:01,773
and also the evolution of
the length of the splits.
52
00:03:06,390 --> 00:03:09,180
And now that we are happy
with our vocabulary,
53
00:03:09,180 --> 00:03:12,663
you are probably wondering how
to use it to tokenize a text.
54
00:03:13,830 --> 00:03:17,640
Let's say we want to tokenize
the word "huggingface".
55
00:03:17,640 --> 00:03:20,310
WordPiece follows these rules.
56
00:03:20,310 --> 00:03:22,530
We will look for the
longest possible token
57
00:03:22,530 --> 00:03:24,960
at the beginning of the word.
58
00:03:24,960 --> 00:03:28,920
Then we start again on the
remaining part of our word,
59
00:03:28,920 --> 00:03:31,143
and so on until we reach the end.
60
00:03:32,100 --> 00:03:35,973
And that's it. Huggingface is
divided into four sub-tokens.
61
00:03:37,200 --> 00:03:39,180
This video is about to end.
62
00:03:39,180 --> 00:03:41,370
I hope it helped you to understand better
63
00:03:41,370 --> 00:03:43,653
what is behind the work, WordPiece.
64
00:03:45,114 --> 00:03:47,864
(air whooshing)