1
00:00:00,000 --> 00:00:02,667
(air whooshing)

2
00:00:05,310 --> 00:00:06,420
- In this video,

3
00:00:06,420 --> 00:00:09,881
we will study together
'the Unigram Language Model

4
00:00:09,881 --> 00:00:13,288
subword tokenization algorithm'.

5
00:00:13,288 --> 00:00:15,567
The overall training strategy

6
00:00:15,567 --> 00:00:18,450
of a Unigram Language Model tokenizer

7
00:00:18,450 --> 00:00:21,480
is to start with a very large vocabulary

8
00:00:21,480 --> 00:00:24,240
and then to remove
tokens at each iteration

9
00:00:24,240 --> 00:00:27,300
until we reach the desired size.

10
00:00:27,300 --> 00:00:28,530
At each iteration,

11
00:00:28,530 --> 00:00:30,930
we will calculate a loss
on our training corpus

12
00:00:30,930 --> 00:00:33,480
thanks to the Unigram model.

13
00:00:33,480 --> 00:00:37,470
As the loss calculation depends
on the available vocabulary,

14
00:00:37,470 --> 00:00:40,563
we can use it to choose how
to reduce the vocabulary.

15
00:00:41,550 --> 00:00:43,620
So we look at the evolution of the loss

16
00:00:43,620 --> 00:00:47,103
by removing in turn each
token from the vocabulary.

17
00:00:48,000 --> 00:00:50,430
We will choose to remove the p-percents

18
00:00:50,430 --> 00:00:52,200
which increase the loss the less.

19
00:00:56,310 --> 00:00:57,540
Before going further

20
00:00:57,540 --> 00:01:00,240
in the explanation of
the training algorithm,

21
00:01:00,240 --> 00:01:02,973
I need to explain what
is an Unigram model.

22
00:01:04,183 --> 00:01:06,030
The Unigram Language Model

23
00:01:06,030 --> 00:01:08,493
is a type of Statistical Language Modem.

24
00:01:09,450 --> 00:01:10,980
A Statistical Language Model

25
00:01:10,980 --> 00:01:13,530
will assign a probability to a text

26
00:01:13,530 --> 00:01:18,090
considering that the text is
in fact a sequence of tokens.

27
00:01:18,090 --> 00:01:21,090
The simplest sequences
of tokens to imagine

28
00:01:21,090 --> 00:01:24,753
are the words that compose the
sentence or the characters.

29
00:01:26,130 --> 00:01:28,890
The particularity of
Unigram Language Model

30
00:01:28,890 --> 00:01:32,010
is that it assumes that
the occurrence of each word

31
00:01:32,010 --> 00:01:34,533
is independent of its previous word.

32
00:01:35,400 --> 00:01:37,620
This assumption allows us to write

33
00:01:37,620 --> 00:01:39,570
that the probability of a text

34
00:01:39,570 --> 00:01:42,210
is equal to the product
of the probabilities

35
00:01:42,210 --> 00:01:43,953
of the tokens that compose it.

36
00:01:45,840 --> 00:01:50,220
It should be noted here that
it is a very simple model

37
00:01:50,220 --> 00:01:53,850
which would not be adapted
to the generation of text

38
00:01:53,850 --> 00:01:57,840
since this model would always
generate the same token,

39
00:01:57,840 --> 00:02:00,453
the one which has the
greatest probability.

40
00:02:01,320 --> 00:02:03,360
Nevertheless, to do tokenization,

41
00:02:03,360 --> 00:02:05,790
this model is very useful to us

42
00:02:05,790 --> 00:02:07,440
because it can be used

43
00:02:07,440 --> 00:02:10,893
to estimate the relative
likelihood of different phrases.

44
00:02:14,100 --> 00:02:15,000
We are now ready

45
00:02:15,000 --> 00:02:19,830
to return to our explanation
of the training algorithm.

46
00:02:19,830 --> 00:02:21,690
Let's say that we have
as a training corpus

47
00:02:21,690 --> 00:02:23,880
with 10 times the word hug,

48
00:02:23,880 --> 00:02:25,410
12 times the word pug,

49
00:02:25,410 --> 00:02:27,330
5 times the word lug,

50
00:02:27,330 --> 00:02:28,560
4 times bug

51
00:02:28,560 --> 00:02:29,943
and 5 times dug.

52
00:02:33,120 --> 00:02:34,560
As said earlier,

53
00:02:34,560 --> 00:02:37,473
the training starts with a big vocabulary.

54
00:02:38,460 --> 00:02:41,400
Obviously, as we are using a toy corpus,

55
00:02:41,400 --> 00:02:44,430
this vocabulary will not be that big

56
00:02:44,430 --> 00:02:46,773
but it should show you the principle.

57
00:02:47,610 --> 00:02:51,870
A first method is to list all
the possible strict substrings

58
00:02:51,870 --> 00:02:53,823
and that's what we'll do here.

59
00:02:54,780 --> 00:02:58,170
We could also have used the BPE algorithm

60
00:02:58,170 --> 00:03:00,010
with a very large vocabulary size

61
00:03:01,410 --> 00:03:05,103
but for now, the strict
substrings are enough.

62
00:03:06,990 --> 00:03:09,120
The training of the Unigram tokenizer

63
00:03:09,120 --> 00:03:12,093
is based on the
Expectation-Maximization method.

64
00:03:13,320 --> 00:03:15,120
At each iteration,

65
00:03:15,120 --> 00:03:17,430
we estimate the
probabilities of the tokens

66
00:03:17,430 --> 00:03:18,430
of the vocabulary

67
00:03:20,130 --> 00:03:23,100
and then we remove the p-percent of tokens

68
00:03:23,100 --> 00:03:26,070
that minimize the loss on the corpus

69
00:03:26,070 --> 00:03:28,900
and which do not belong
to the basic character

70
00:03:29,880 --> 00:03:33,150
as we want to keep in our final vocabulary

71
00:03:33,150 --> 00:03:36,693
the basic characters to be
able to tokenize any word.

72
00:03:37,770 --> 00:03:39,641
Let's go for it!

73
00:03:39,641 --> 00:03:42,360
The probability of a
token simply estimated

74
00:03:42,360 --> 00:03:44,760
by the number of appearance of this token

75
00:03:44,760 --> 00:03:46,440
in our training corpus

76
00:03:46,440 --> 00:03:50,133
divided by the total number of
appearance of all the tokens.

77
00:03:51,510 --> 00:03:54,390
We could use this vocabulary
to tokenize our words

78
00:03:54,390 --> 00:03:56,283
according to the Unigram model.

79
00:03:57,150 --> 00:04:00,892
We will do it together
to understand two things:

80
00:04:00,892 --> 00:04:04,110
how we tokenize a word
with a Unigram model

81
00:04:04,110 --> 00:04:07,803
and how the loss is
calculated on our corpus.

82
00:04:09,088 --> 00:04:12,263
The Unigram LM tokenization
of our text 'Hug'

83
00:04:12,263 --> 00:04:15,270
will be the one with the highest
probability of occurrence

84
00:04:15,270 --> 00:04:17,403
according to our Unigram model.

85
00:04:19,080 --> 00:04:21,750
To find it, the simplest way to proceed

86
00:04:21,750 --> 00:04:24,120
would be to list all the
possible segmentations

87
00:04:24,120 --> 00:04:25,800
of our text 'Hug',

88
00:04:25,800 --> 00:04:29,340
calculate the probability of
each of these segmentations

89
00:04:29,340 --> 00:04:32,043
and then choose the one with
the highest probability.

90
00:04:33,210 --> 00:04:34,920
With the current vocabulary,

91
00:04:34,920 --> 00:04:38,640
two tokenizations get
exactly the same probability.

92
00:04:38,640 --> 00:04:40,080
So we choose one of them

93
00:04:40,080 --> 00:04:42,603
and keep in memory the
associated probability.

94
00:04:43,710 --> 00:04:46,380
To compute the loss on
our training corpus,

95
00:04:46,380 --> 00:04:48,570
we need to tokenize as we just did

96
00:04:48,570 --> 00:04:50,673
all the remaining words in the corpus.

97
00:04:52,290 --> 00:04:56,430
The loss is then the sum over
all the words in the corpus

98
00:04:56,430 --> 00:04:58,920
of the frequency of occurrence of the word

99
00:04:58,920 --> 00:05:02,670
multiplied by the opposite
of the log of the probability

100
00:05:02,670 --> 00:05:05,463
associated with the
tokenization of the word.

101
00:05:07,620 --> 00:05:10,803
We obtain here a loss of 170.

102
00:05:13,830 --> 00:05:18,630
Remember, our initial goal
was to reduce the vocabulary.

103
00:05:18,630 --> 00:05:21,870
To do this, we will remove
a token from the vocabulary

104
00:05:21,870 --> 00:05:24,213
and calculate the associated loss.

105
00:05:27,630 --> 00:05:30,627
Let's remove for example, the token 'ug'.

106
00:05:31,920 --> 00:05:35,370
We notice that the tokenization for 'hug'

107
00:05:35,370 --> 00:05:39,990
with the letter 'h' and the
tuple 'ug' is now impossible.

108
00:05:39,990 --> 00:05:42,240
Nevertheless, as we saw earlier

109
00:05:42,240 --> 00:05:45,180
that two tokenizations
had the same probability,

110
00:05:45,180 --> 00:05:47,730
we can still choose the
remaining tokenization

111
00:05:47,730 --> 00:05:51,093
with a probability of 1.10e-2.

112
00:05:52,410 --> 00:05:55,350
The tokenizations of the
other words of the vocabulary

113
00:05:55,350 --> 00:05:57,060
also remain unchanged.

114
00:05:57,060 --> 00:06:00,600
And finally, even if we
remove the token 'ug'

115
00:06:00,600 --> 00:06:05,403
from our vocabulary the
loss remains equal to 170.

116
00:06:06,630 --> 00:06:08,100
For this first iteration,

117
00:06:08,100 --> 00:06:10,080
if we continue the calculation,

118
00:06:10,080 --> 00:06:13,050
we would notice that we
could remove any token

119
00:06:13,050 --> 00:06:16,110
without it impacting the loss.

120
00:06:16,110 --> 00:06:19,200
We will therefore choose at
random to remove the token 'ug'

121
00:06:19,200 --> 00:06:21,843
before starting a second iteration.

122
00:06:24,240 --> 00:06:27,300
So we estimate again the
probability of each token

123
00:06:27,300 --> 00:06:30,630
before calculating the impact
of each token on the loss.

124
00:06:32,160 --> 00:06:33,990
For example, if we remove now

125
00:06:33,990 --> 00:06:36,290
the token composed of
the letters 'h' and 'u',

126
00:06:37,350 --> 00:06:41,013
there is only one possible
tokenization left for hug.

127
00:06:41,940 --> 00:06:44,700
The tokenization of the
other words of the vocabulary

128
00:06:44,700 --> 00:06:45,633
is not changed.

129
00:06:46,560 --> 00:06:47,393
In the end,

130
00:06:47,393 --> 00:06:49,200
we obtain by removing the token

131
00:06:49,200 --> 00:06:52,749
composed of the letters 'h'
and 'u' from the vocabulary,

132
00:06:52,749 --> 00:06:56,430
a loss of 168.

133
00:06:56,430 --> 00:06:59,490
Finally, to choose which token to remove,

134
00:06:59,490 --> 00:07:02,490
we will for each remaining
token of the vocabulary,

135
00:07:02,490 --> 00:07:04,800
which is not an elementary token,

136
00:07:04,800 --> 00:07:07,380
calculate the associated loss.

137
00:07:07,380 --> 00:07:09,843
Then, compare these losses between them.

138
00:07:11,730 --> 00:07:13,800
The token which we will remove

139
00:07:13,800 --> 00:07:17,340
is the token which impacts
the least the loss,

140
00:07:17,340 --> 00:07:18,870
here the token 'bu'.

141
00:07:20,040 --> 00:07:22,380
We had mentioned at the
beginning of the video

142
00:07:22,380 --> 00:07:24,930
that at each iteration we could remove

143
00:07:24,930 --> 00:07:27,093
p-percent of the tokens by iteration.

144
00:07:29,356 --> 00:07:33,000
The second token that could
be removed at this iteration

145
00:07:33,000 --> 00:07:34,317
is the token 'du'.

146
00:07:36,510 --> 00:07:37,920
And that's it.

147
00:07:37,920 --> 00:07:39,720
We just have to repeat these steps

148
00:07:39,720 --> 00:07:43,203
until we get the vocabulary
of the desired size.

149
00:07:45,030 --> 00:07:46,500
One last thing.

150
00:07:46,500 --> 00:07:50,310
In practice, when we tokenize
a word with a Unigram model,

151
00:07:50,310 --> 00:07:53,130
we don't compute the
set of probabilities of

152
00:07:53,130 --> 00:07:55,500
all the possible splits of a word

153
00:07:55,500 --> 00:07:58,770
before comparing them to keep the best one

154
00:07:58,770 --> 00:08:01,440
but we use the Viterbi algorithm

155
00:08:01,440 --> 00:08:04,563
which is much more efficient way to do it.

156
00:08:06,540 --> 00:08:07,680
And that's it!

157
00:08:07,680 --> 00:08:09,270
I hope that this example

158
00:08:09,270 --> 00:08:10,987
has allowed you to better understand

159
00:08:10,987 --> 00:08:12,933
the Unigram tokenization algorithm.

160
00:08:14,355 --> 00:08:17,022
(air whooshing)