subtitles/en/54_building-a-new-tokenizer.srt (309 lines of code) (raw):
1
00:00:00,188 --> 00:00:02,855
(air whooshing)
2
00:00:05,400 --> 00:00:07,500
In this video, we will see how
3
00:00:07,500 --> 00:00:11,310
you can create your own
tokenizer from scratch.
4
00:00:11,310 --> 00:00:15,000
To create your own tokenizer,
you will have to think about
5
00:00:15,000 --> 00:00:18,180
each of the operations
involved in tokenization.
6
00:00:18,180 --> 00:00:22,440
Namely, the normalization,
the pre-tokenization,
7
00:00:22,440 --> 00:00:25,233
the model, the post
processing, and the decoding.
8
00:00:26,100 --> 00:00:28,350
If you don't know what normalization,
9
00:00:28,350 --> 00:00:30,900
pre-tokenization, and the model are,
10
00:00:30,900 --> 00:00:34,531
I advise you to go and see
the videos linked below.
11
00:00:34,531 --> 00:00:37,110
The post processing gathers
all the modifications
12
00:00:37,110 --> 00:00:40,860
that we will carry out
on the tokenized text.
13
00:00:40,860 --> 00:00:43,890
It can include the
addition of special tokens,
14
00:00:43,890 --> 00:00:46,290
the creation of an intention mask,
15
00:00:46,290 --> 00:00:48,903
but also the generation
of a list of token IDs.
16
00:00:50,220 --> 00:00:53,487
The decoding operation
occurs at the very end,
17
00:00:53,487 --> 00:00:54,660
and will allow passing
18
00:00:54,660 --> 00:00:57,753
from the sequence of IDs in a sentence.
19
00:00:58,890 --> 00:01:01,800
For example, you can see that the hashtags
20
00:01:01,800 --> 00:01:04,260
have been removed, and the tokens
21
00:01:04,260 --> 00:01:07,323
composing the word today
have been grouped together.
22
00:01:10,860 --> 00:01:13,440
In a fast tokenizer, all these components
23
00:01:13,440 --> 00:01:16,413
are gathered in the
backend_tokenizer attribute.
24
00:01:17,370 --> 00:01:20,070
As you can see with
this small code snippet,
25
00:01:20,070 --> 00:01:22,020
it is an instance of a tokenizer
26
00:01:22,020 --> 00:01:23,763
from the tokenizers library.
27
00:01:25,740 --> 00:01:28,263
So, to create your own tokenizer,
28
00:01:29,970 --> 00:01:31,770
you will have to follow these steps.
29
00:01:33,270 --> 00:01:35,433
First, create a training dataset.
30
00:01:36,690 --> 00:01:39,000
Second, create and train a tokenizer
31
00:01:39,000 --> 00:01:41,700
with the transformer library.
32
00:01:41,700 --> 00:01:46,700
And third, load this tokenizer
into a transformer tokenizer.
33
00:01:49,350 --> 00:01:50,850
To understand these steps,
34
00:01:50,850 --> 00:01:54,573
I propose that we recreate
a BERT tokenizer together.
35
00:01:56,460 --> 00:01:58,893
The first thing to do
is to create a dataset.
36
00:01:59,970 --> 00:02:02,460
With this code snippet
you can create an iterator
37
00:02:02,460 --> 00:02:05,430
on the dataset wikitext-2-raw-V1,
38
00:02:05,430 --> 00:02:08,160
which is a rather small
dataset in English,
39
00:02:08,160 --> 00:02:09,730
perfect for the example.
40
00:02:12,210 --> 00:02:13,920
We attack here the big part,
41
00:02:13,920 --> 00:02:17,373
the design of our tokenizer
with the tokenizer library.
42
00:02:18,750 --> 00:02:22,020
We start by initializing
a tokenizer instance
43
00:02:22,020 --> 00:02:26,133
with a WordPiece model because
it is the model used by BERT.
44
00:02:29,100 --> 00:02:32,190
Then we can define our normalizer.
45
00:02:32,190 --> 00:02:35,891
We will define it as a
succession of two normalizations
46
00:02:35,891 --> 00:02:39,453
used to clean up characters
not visible in the text.
47
00:02:40,590 --> 00:02:43,440
One lowercasing normalization,
48
00:02:43,440 --> 00:02:47,253
and two last normalizations
used to remove accents.
49
00:02:49,500 --> 00:02:53,553
For the pre-tokenization, we
will chain two pre_tokenizers.
50
00:02:54,390 --> 00:02:58,200
The first one separating the
text at the level of spaces,
51
00:02:58,200 --> 00:03:01,533
and the second one isolating
the punctuation marks.
52
00:03:03,360 --> 00:03:06,360
Now, we can define the
trainer that will allow us
53
00:03:06,360 --> 00:03:09,753
to train the WordPiece model
chosen at the beginning.
54
00:03:11,160 --> 00:03:12,600
To carry out the training,
55
00:03:12,600 --> 00:03:14,853
we will have to choose a vocabulary size.
56
00:03:16,050 --> 00:03:17,910
Here we choose 25,000.
57
00:03:17,910 --> 00:03:21,270
And we also need to
announce the special tokens
58
00:03:21,270 --> 00:03:24,663
that we absolutely want
to add to our vocabulary.
59
00:03:29,160 --> 00:03:33,000
In one line of code, we can
train our WordPiece model
60
00:03:33,000 --> 00:03:35,553
using the iterator we defined earlier.
61
00:03:39,060 --> 00:03:42,570
Once the model has been
trained, we can retrieve
62
00:03:42,570 --> 00:03:46,560
the IDs of the special
class and separation tokens,
63
00:03:46,560 --> 00:03:49,413
because we will need them to
post-process our sequence.
64
00:03:50,820 --> 00:03:52,860
Thanks to the TemplateProcessing class,
65
00:03:52,860 --> 00:03:57,210
we can add the CLS token at
the beginning of each sequence,
66
00:03:57,210 --> 00:04:00,120
and the SEP token at
the end of the sequence,
67
00:04:00,120 --> 00:04:03,873
and between two sentences if
we tokenize a pair of text.
68
00:04:07,260 --> 00:04:10,500
Finally, we just have
to define our decoder,
69
00:04:10,500 --> 00:04:12,690
which will allow us to remove the hashtags
70
00:04:12,690 --> 00:04:14,610
at the beginning of the tokens
71
00:04:14,610 --> 00:04:17,193
that must be reattached
to the previous token.
72
00:04:21,300 --> 00:04:22,260
And there it is.
73
00:04:22,260 --> 00:04:25,110
You have all the necessary lines of code
74
00:04:25,110 --> 00:04:29,403
to define your own tokenizer
with the tokenizer library.
75
00:04:30,960 --> 00:04:32,280
Now that we have a brand new tokenizer
76
00:04:32,280 --> 00:04:35,400
with the tokenizer library,
we just have to load it
77
00:04:35,400 --> 00:04:38,463
into a fast tokenizer from
the transformers library.
78
00:04:39,960 --> 00:04:42,630
Here again, we have several possibilities.
79
00:04:42,630 --> 00:04:44,430
We can load it in the generic class,
80
00:04:44,430 --> 00:04:48,330
PreTrainedTokenizerFast, or
in the BertTokenizerFast class
81
00:04:48,330 --> 00:04:52,353
since we have built a
BERT like tokenizer here.
82
00:04:57,000 --> 00:04:59,670
I really hope this video
has helped you understand
83
00:04:59,670 --> 00:05:02,133
how you can create your own tokenizer,
84
00:05:03,178 --> 00:05:06,240
and that you are ready now to navigate
85
00:05:06,240 --> 00:05:08,070
the tokenizer library documentation
86
00:05:08,070 --> 00:05:11,367
to choose the components for
your brand new tokenizer.
87
00:05:12,674 --> 00:05:15,341
(air whooshing)