1
00:00:00,000 --> 00:00:02,667
(air whooshing)

2
00:00:05,310 --> 00:00:08,700
- In this video we will see together

3
00:00:08,700 --> 00:00:11,820
what is the purpose of
training a tokenizer,

4
00:00:11,820 --> 00:00:14,400
what are the key steps to follow,

5
00:00:14,400 --> 00:00:16,953
and what is the easiest way to do it.

6
00:00:18,690 --> 00:00:20,677
You will ask yourself the question,

7
00:00:20,677 --> 00:00:23,040
"Should I train a new tokenizer?",

8
00:00:23,040 --> 00:00:25,773
when you plan to train a
new model from scratch.

9
00:00:29,520 --> 00:00:34,020
A trained tokenizer would not
be suitable for your corpus

10
00:00:34,020 --> 00:00:37,080
if your corpus is in a different language,

11
00:00:37,080 --> 00:00:42,060
uses new characters, such as
accents or upper cased letters,

12
00:00:42,060 --> 00:00:47,060
has a specific vocabulary,
for example medical or legal,

13
00:00:47,100 --> 00:00:49,050
or uses a different style,

14
00:00:49,050 --> 00:00:51,873
a language from another
century for example.

15
00:00:56,490 --> 00:00:58,320
If I take the tokenizer trained on

16
00:00:58,320 --> 00:01:00,780
the bert-base-uncased model,

17
00:01:00,780 --> 00:01:03,213
and ignore its normalization step,

18
00:01:04,260 --> 00:01:07,650
then we can see that the
tokenization operation

19
00:01:07,650 --> 00:01:09,277
on the English sentence,

20
00:01:09,277 --> 00:01:12,480
"Here is a sentence
adapted to our tokenizer",

21
00:01:12,480 --> 00:01:15,600
produces a rather
satisfactory list of tokens,

22
00:01:15,600 --> 00:01:18,510
in the sense that this
sentence of eight words

23
00:01:18,510 --> 00:01:20,643
is tokenized into nine tokens.

24
00:01:22,920 --> 00:01:26,340
On the other hand, if I
use this same tokenizer

25
00:01:26,340 --> 00:01:29,370
on a sentence in Bengali, we see that

26
00:01:29,370 --> 00:01:33,690
either a word is divided
into many sub tokens,

27
00:01:33,690 --> 00:01:36,270
or that the tokenizer does not know one of

28
00:01:36,270 --> 00:01:39,873
the unicode characters and
returns only an unknown token.

29
00:01:41,220 --> 00:01:44,970
The fact that a common word
is split into many tokens

30
00:01:44,970 --> 00:01:47,910
can be problematic,
because language models

31
00:01:47,910 --> 00:01:51,903
can only handle a sequence
of tokens of limited length.

32
00:01:52,830 --> 00:01:55,830
A tokenizer that excessively
splits your initial text

33
00:01:55,830 --> 00:01:58,503
may even impact the
performance of your model.

34
00:01:59,760 --> 00:02:02,280
Unknown tokens are also problematic,

35
00:02:02,280 --> 00:02:04,530
because the model will
not be able to extract

36
00:02:04,530 --> 00:02:07,563
any information from the
unknown part of the text.

37
00:02:11,430 --> 00:02:13,440
In this other example, we can see that

38
00:02:13,440 --> 00:02:17,100
the tokenizer replaces
words containing characters

39
00:02:17,100 --> 00:02:20,973
with accents and capital
letters with unknown tokens.

40
00:02:22,050 --> 00:02:24,770
Finally, if we use again this tokenizer

41
00:02:24,770 --> 00:02:28,170
to tokenize medical
vocabulary, we see again that

42
00:02:28,170 --> 00:02:31,800
a single word is divided
into many sub tokens,

43
00:02:31,800 --> 00:02:34,803
four for paracetamol,
and four for pharyngitis.

44
00:02:37,110 --> 00:02:39,360
Most of the tokenizers used by the current

45
00:02:39,360 --> 00:02:42,540
state of the art language
models need to be trained

46
00:02:42,540 --> 00:02:45,360
on a corpus that is
similar to the one used

47
00:02:45,360 --> 00:02:47,463
to pre-train the language model.

48
00:02:49,140 --> 00:02:51,150
This training consists in learning rules

49
00:02:51,150 --> 00:02:53,250
to divide the text into tokens.

50
00:02:53,250 --> 00:02:56,160
And the way to learn
these rules and use them

51
00:02:56,160 --> 00:02:58,233
depends on the chosen tokenizer model.

52
00:03:00,630 --> 00:03:04,590
Thus, to train a new tokenizer,
it is first necessary

53
00:03:04,590 --> 00:03:07,653
to build a training corpus
composed of raw texts.

54
00:03:08,910 --> 00:03:12,423
Then, you have to choose an
architecture for your tokenizer.

55
00:03:13,410 --> 00:03:14,763
Here there are two options.

56
00:03:15,900 --> 00:03:19,710
The simplest is to reuse the
same architecture as the one

57
00:03:19,710 --> 00:03:22,863
of a tokenizer used by
another model already trained.

58
00:03:24,210 --> 00:03:25,980
Otherwise it is also possible

59
00:03:25,980 --> 00:03:28,560
to completely design your tokenizer.

60
00:03:28,560 --> 00:03:31,683
But it requires more
experience and attention.

61
00:03:33,750 --> 00:03:36,660
Once the architecture
is chosen, you can thus

62
00:03:36,660 --> 00:03:39,513
train this tokenizer on
your constituted corpus.

63
00:03:40,650 --> 00:03:43,440
Finally, the last thing that
you need to do is to save

64
00:03:43,440 --> 00:03:46,443
the learned rules to be
able to use this tokenizer.

65
00:03:49,530 --> 00:03:51,330
Let's take an example.

66
00:03:51,330 --> 00:03:54,873
Let's say you want to train
a GPT-2 model on Python code.

67
00:03:56,160 --> 00:03:59,640
Even if the Python code
is usually in English

68
00:03:59,640 --> 00:04:02,386
this type of text is very specific,

69
00:04:02,386 --> 00:04:04,473
and deserves a tokenizer trained on it.

70
00:04:05,340 --> 00:04:07,980
To convince you of this,
we will see at the end

71
00:04:07,980 --> 00:04:10,023
the difference produced on an example.

72
00:04:11,400 --> 00:04:13,747
For that we are going to use the method

73
00:04:13,747 --> 00:04:18,240
"train_new_from_iterator"
that all the fast tokenizers

74
00:04:18,240 --> 00:04:20,040
of the library have and thus,

75
00:04:20,040 --> 00:04:22,503
in particular GPT2TokenizerFast.

76
00:04:23,880 --> 00:04:26,100
This is the simplest method in our case

77
00:04:26,100 --> 00:04:28,983
to have a tokenizer
adapted to Python code.

78
00:04:30,180 --> 00:04:34,140
Remember, the first thing is
to gather a training corpus.

79
00:04:34,140 --> 00:04:37,320
We will use a subpart of
the CodeSearchNet dataset

80
00:04:37,320 --> 00:04:39,360
containing only Python functions

81
00:04:39,360 --> 00:04:42,360
from open source libraries on Github.

82
00:04:42,360 --> 00:04:43,650
It's good timing.

83
00:04:43,650 --> 00:04:46,980
This dataset is known
by the datasets library

84
00:04:46,980 --> 00:04:49,203
and we can load it in two lines of code.

85
00:04:50,760 --> 00:04:55,230
Then, as the "train_new_from_iterator"
method expects

86
00:04:55,230 --> 00:04:57,150
a iterator of lists of texts,

87
00:04:57,150 --> 00:04:59,970
we create the
"get_training_corpus" function,

88
00:04:59,970 --> 00:05:01,743
which will return an iterator.

89
00:05:03,870 --> 00:05:05,430
Now that we have our iterator

90
00:05:05,430 --> 00:05:09,630
on our Python functions
corpus, we can load

91
00:05:09,630 --> 00:05:12,351
the GPT-2 tokenizer architecture.

92
00:05:12,351 --> 00:05:16,560
Here old_tokenizer is not
adapted to our corpus.

93
00:05:16,560 --> 00:05:17,700
But we only need

94
00:05:17,700 --> 00:05:20,733
one more line to train
it on our new corpus.

95
00:05:21,780 --> 00:05:24,720
An argument that is common
to most of the tokenization

96
00:05:24,720 --> 00:05:28,980
algorithms used at the moment
is the size of the vocabulary.

97
00:05:28,980 --> 00:05:31,773
We choose here the value 52,000.

98
00:05:32,820 --> 00:05:35,760
Finally, once the training is finished,

99
00:05:35,760 --> 00:05:38,850
we just have to save our
new tokenizer locally,

100
00:05:38,850 --> 00:05:41,730
or send it to the hub
to be able to reuse it

101
00:05:41,730 --> 00:05:43,593
very easily afterwards.

102
00:05:45,270 --> 00:05:48,990
Finally, let's see together
on an example if it was useful

103
00:05:48,990 --> 00:05:53,073
to re-train a tokenizer
similar to GPT-2 one.

104
00:05:55,110 --> 00:05:57,660
With the original tokenizer of GPT-2

105
00:05:57,660 --> 00:06:00,330
we see that all spaces are isolated,

106
00:06:00,330 --> 00:06:01,920
and the method name randn,

107
00:06:01,920 --> 00:06:04,833
relatively common in Python
code, is split in two.

108
00:06:05,730 --> 00:06:09,060
With our new tokenizer,
single and double indentations

109
00:06:09,060 --> 00:06:10,890
have been learned and the method randn

110
00:06:10,890 --> 00:06:13,770
is tokenized into one token.

111
00:06:13,770 --> 00:06:15,000
And with that,

112
00:06:15,000 --> 00:06:18,123
you now know how to train
your very own tokenizers now.

113
00:06:19,498 --> 00:06:22,165
(air whooshing)