1
00:00:00,151 --> 00:00:02,818
(air whooshing)

2
00:00:05,520 --> 00:00:08,370
- Let's see together what
is the training strategy

3
00:00:08,370 --> 00:00:11,851
of the WordPiece algorithm,
and how it performs

4
00:00:11,851 --> 00:00:15,150
the tokenization of a text, once trained.

5
00:00:19,351 --> 00:00:23,580
WordPiece is a tokenization
algorithm introduced by Google.

6
00:00:23,580 --> 00:00:25,653
It is used, for example, by BERT.

7
00:00:26,640 --> 00:00:28,020
To our knowledge,

8
00:00:28,020 --> 00:00:31,590
the code of WordPiece
has not been open source.

9
00:00:31,590 --> 00:00:33,510
So we base our explanations

10
00:00:33,510 --> 00:00:36,903
on our own interpretation
of the published literature.

11
00:00:42,090 --> 00:00:44,883
So, what is the training
strategy of WordPiece?

12
00:00:46,200 --> 00:00:48,663
Similarly to the BPE algorithm,

13
00:00:48,663 --> 00:00:52,380
WordPiece starts by establishing
an initial vocabulary

14
00:00:52,380 --> 00:00:54,660
composed of elementary units,

15
00:00:54,660 --> 00:00:58,773
and then increases this
vocabulary to the desired size.

16
00:00:59,970 --> 00:01:01,950
To build the initial vocabulary,

17
00:01:01,950 --> 00:01:04,920
we divide each word in the training corpus

18
00:01:04,920 --> 00:01:07,443
into the sequence of
letters that make it up.

19
00:01:08,430 --> 00:01:11,820
As you can see, there is a small subtlety.

20
00:01:11,820 --> 00:01:14,190
We add two hashtags in
front of the letters

21
00:01:14,190 --> 00:01:16,083
that do not start a word.

22
00:01:17,190 --> 00:01:20,430
By keeping only one occurrence
per elementary unit,

23
00:01:20,430 --> 00:01:23,313
we now have our initial vocabulary.

24
00:01:26,580 --> 00:01:29,823
We will list all the
existing pairs in our corpus.

25
00:01:30,990 --> 00:01:32,640
Once we have this list,

26
00:01:32,640 --> 00:01:35,253
we will calculate a score
for each of these pairs.

27
00:01:36,630 --> 00:01:38,400
As for the BPE algorithm,

28
00:01:38,400 --> 00:01:40,750
we will select the pair
with the highest score.

29
00:01:43,260 --> 00:01:44,340
Taking for example,

30
00:01:44,340 --> 00:01:47,343
the first pair composed
of the letters H and U.

31
00:01:48,510 --> 00:01:51,390
The score of a pair is
simply equal to the frequency

32
00:01:51,390 --> 00:01:54,510
of appearance of the pair,
divided by the product

33
00:01:54,510 --> 00:01:57,330
of the frequency of
appearance of the first token,

34
00:01:57,330 --> 00:02:00,063
by the frequency of appearance
of the second token.

35
00:02:01,260 --> 00:02:05,550
Thus, at a fixed frequency
of appearance of the pair,

36
00:02:05,550 --> 00:02:09,913
if the subparts of the pair are
very frequent in the corpus,

37
00:02:09,913 --> 00:02:11,823
then this score will be decreased.

38
00:02:13,140 --> 00:02:17,460
In our example, the pair
HU appears four times,

39
00:02:17,460 --> 00:02:22,460
the letter H four times,
and the letter U four times.

40
00:02:24,030 --> 00:02:26,733
This gives us a score of 0.25.

41
00:02:28,410 --> 00:02:30,960
Now that we know how to
calculate this score,

42
00:02:30,960 --> 00:02:33,360
we can do it for all pairs.

43
00:02:33,360 --> 00:02:35,217
We can now add to the vocabulary

44
00:02:35,217 --> 00:02:38,973
the pair with the highest score,
after merging it of course.

45
00:02:40,140 --> 00:02:43,863
And now we can apply this same
fusion to our split corpus.

46
00:02:45,780 --> 00:02:47,490
As you can imagine,

47
00:02:47,490 --> 00:02:50,130
we just have to repeat the same operations

48
00:02:50,130 --> 00:02:53,013
until we have the vocabulary
at the desired size.

49
00:02:54,000 --> 00:02:55,800
Let's look at a few more steps

50
00:02:55,800 --> 00:02:58,113
to see the evolution of our vocabulary,

51
00:02:58,957 --> 00:03:01,773
and also the evolution of
the length of the splits.

52
00:03:06,390 --> 00:03:09,180
And now that we are happy
with our vocabulary,

53
00:03:09,180 --> 00:03:12,663
you are probably wondering how
to use it to tokenize a text.

54
00:03:13,830 --> 00:03:17,640
Let's say we want to tokenize
the word "huggingface".

55
00:03:17,640 --> 00:03:20,310
WordPiece follows these rules.

56
00:03:20,310 --> 00:03:22,530
We will look for the
longest possible token

57
00:03:22,530 --> 00:03:24,960
at the beginning of the word.

58
00:03:24,960 --> 00:03:28,920
Then we start again on the
remaining part of our word,

59
00:03:28,920 --> 00:03:31,143
and so on until we reach the end.

60
00:03:32,100 --> 00:03:35,973
And that's it. Huggingface is
divided into four sub-tokens.

61
00:03:37,200 --> 00:03:39,180
This video is about to end.

62
00:03:39,180 --> 00:03:41,370
I hope it helped you to understand better

63
00:03:41,370 --> 00:03:43,653
what is behind the work, WordPiece.

64
00:03:45,114 --> 00:03:47,864
(air whooshing)