1
00:00:00,125 --> 00:00:05,125
(air whooshing)

2
00:00:05,190 --> 00:00:06,720
- You are at the right place

3
00:00:06,720 --> 00:00:10,464
if you want to understand
what the Byte Pair Encoding

4
00:00:10,464 --> 00:00:13,263
subword tokenization algorithm is,

5
00:00:14,160 --> 00:00:15,505
how to train it

6
00:00:15,505 --> 00:00:17,790
and how the tokenization of a text is done

7
00:00:17,790 --> 00:00:19,107
with this algorithm.

8
00:00:21,417 --> 00:00:22,920
The BPE algorithm

9
00:00:22,920 --> 00:00:26,820
was initially proposed as a
text compression algorithm

10
00:00:26,820 --> 00:00:28,770
but it is also very well suited

11
00:00:28,770 --> 00:00:31,143
as a tokenizer for your language models.

12
00:00:32,910 --> 00:00:34,890
The idea of BPE is to divide words

13
00:00:34,890 --> 00:00:36,933
into a sequence of 'subword units'

14
00:00:38,100 --> 00:00:41,970
which are units that appear
frequently in a reference corpus

15
00:00:41,970 --> 00:00:44,613
which is, the corpus we used to train it.

16
00:00:46,701 --> 00:00:49,083
How is a BPE tokenizer trained?

17
00:00:50,100 --> 00:00:53,340
First of all, we have to
get a corpus of texts.

18
00:00:53,340 --> 00:00:56,940
We will not train our
tokenizer on this raw text

19
00:00:56,940 --> 00:00:59,490
but we will first normalize it

20
00:00:59,490 --> 00:01:00,873
then pre-tokenize it.

21
00:01:01,890 --> 00:01:03,240
As the pre-tokenization

22
00:01:03,240 --> 00:01:05,790
divides the text into a list of words,

23
00:01:05,790 --> 00:01:08,400
we can represent our corpus in another way

24
00:01:08,400 --> 00:01:10,350
by gathering together the same words

25
00:01:10,350 --> 00:01:12,450
and by maintaining a counter,

26
00:01:12,450 --> 00:01:14,223
here represented in blue.

27
00:01:17,340 --> 00:01:19,860
To understand how the training works,

28
00:01:19,860 --> 00:01:23,730
we consider this toy corpus
composed of the following words:

29
00:01:23,730 --> 00:01:28,203
huggingface, hugging, hug, hugger, etc.

30
00:01:29,100 --> 00:01:32,640
BPE is an algorithm that starts
with an initial vocabulary

31
00:01:32,640 --> 00:01:35,583
and then increases it to the desired size.

32
00:01:36,450 --> 00:01:38,460
To build the initial vocabulary,

33
00:01:38,460 --> 00:01:41,550
we start by separating
each word of the corpus

34
00:01:41,550 --> 00:01:44,253
into a list of elementary
units that compose them,

35
00:01:45,210 --> 00:01:47,013
here, the characters.

36
00:01:50,850 --> 00:01:54,310
We list in our vocabulary all
the characters that appear

37
00:01:55,218 --> 00:01:58,053
and that will constitute
our initial vocabulary.

38
00:02:00,420 --> 00:02:02,523
Let's now see how to increase it.

39
00:02:05,520 --> 00:02:08,250
We return to our split corpus,

40
00:02:08,250 --> 00:02:11,340
we will go through the words one by one

41
00:02:11,340 --> 00:02:14,313
and count all the
occurrences of token pairs.

42
00:02:15,450 --> 00:02:18,397
The first pair is composed
of the token 'h' and 'u',

43
00:02:20,130 --> 00:02:23,067
the second 'u' and 'g',

44
00:02:23,067 --> 00:02:26,253
and we continue like that until
we have the complete list.

45
00:02:35,580 --> 00:02:37,724
Once we know all the pairs

46
00:02:37,724 --> 00:02:40,140
and their frequency of appearance,

47
00:02:40,140 --> 00:02:42,940
we will choose the one that
appears the most frequently.

48
00:02:44,220 --> 00:02:47,697
Here it is the pair composed
of the letters 'l' and 'e'.

49
00:02:51,930 --> 00:02:53,590
We note our first merging rule

50
00:02:54,593 --> 00:02:57,243
and we add the new
token to our vocabulary.

51
00:03:00,330 --> 00:03:04,260
We can then apply this
merging rule to our splits.

52
00:03:04,260 --> 00:03:07,350
You can see that we have
merged all the pairs of tokens

53
00:03:07,350 --> 00:03:09,793
composed of the tokens 'l' and 'e'.

54
00:03:14,008 --> 00:03:18,150
And now, we just have to
reproduce the same steps

55
00:03:18,150 --> 00:03:19,353
with our new splits.

56
00:03:21,750 --> 00:03:23,460
We calculate the frequency of occurrence

57
00:03:23,460 --> 00:03:25,023
of each pair of tokens,

58
00:03:27,990 --> 00:03:30,603
we select the pair with
the highest frequency,

59
00:03:32,190 --> 00:03:34,083
we note it in our merge rules,

60
00:03:36,000 --> 00:03:39,360
we add the new one token the vocabulary

61
00:03:39,360 --> 00:03:41,880
and then we merge all the pairs of tokens

62
00:03:41,880 --> 00:03:46,503
composed of the token 'le'
and 'a' into our splits.

63
00:03:50,323 --> 00:03:51,960
And we can repeat this operation

64
00:03:51,960 --> 00:03:54,843
until we reach the
desired vocabulary size.

65
00:04:05,671 --> 00:04:10,671
Here, we stopped when our
vocabulary reached 21 tokens.

66
00:04:11,040 --> 00:04:13,920
We can see now that
the words of our corpus

67
00:04:13,920 --> 00:04:17,040
are now divided into far fewer tokens

68
00:04:17,040 --> 00:04:20,280
than at the beginning of the training.

69
00:04:20,280 --> 00:04:21,720
And that our algorithm

70
00:04:21,720 --> 00:04:24,990
has learned the radicals 'hug' and 'learn'

71
00:04:24,990 --> 00:04:27,537
and also the verbal ending 'ing'.

72
00:04:29,880 --> 00:04:32,160
Now that we have learned our vocabulary

73
00:04:32,160 --> 00:04:35,943
and merging rules, we
can tokenize new texts.

74
00:04:37,980 --> 00:04:39,210
For example,

75
00:04:39,210 --> 00:04:41,160
if we want to tokenize the word 'hugs',

76
00:04:42,960 --> 00:04:46,680
first we'll divide it
into elementary units

77
00:04:46,680 --> 00:04:48,843
so it became a sequence of characters.

78
00:04:50,040 --> 00:04:52,020
Then, we'll go through our merge rules

79
00:04:52,020 --> 00:04:54,690
until we have one we can apply.

80
00:04:54,690 --> 00:04:57,930
Here, we can merge the
letters 'h' and 'u'.

81
00:04:57,930 --> 00:05:01,467
And here, we can merge 2 tokens
to get the new token 'hug'.

82
00:05:02,400 --> 00:05:05,760
When we get to the end of our merge rules,

83
00:05:05,760 --> 00:05:07,563
the tokenization is finished.

84
00:05:10,650 --> 00:05:11,727
And that's it.

85
00:05:12,846 --> 00:05:14,850
I hope that now the BPE algorithm

86
00:05:14,850 --> 00:05:16,413
has no more secret for you!

87
00:05:17,739 --> 00:05:20,406
(air whooshing)