subtitles/en/44_fast-tokenizer-superpowers.srt (260 lines of code) (raw):
1
00:00:05,010 --> 00:00:06,270
- The fast tokenizers
2
00:00:06,270 --> 00:00:08,580
of the Transformers library are fast,
3
00:00:08,580 --> 00:00:11,490
but they also implement features
that will be super useful
4
00:00:11,490 --> 00:00:14,536
for data pre-processing
and post-processing.
5
00:00:14,536 --> 00:00:17,239
Let's have a look at them!
6
00:00:17,239 --> 00:00:18,650
First, let's have a look
7
00:00:18,650 --> 00:00:21,690
at the usual output of a tokenizer.
8
00:00:21,690 --> 00:00:24,278
We get input IDs that correspond to token,
9
00:00:24,278 --> 00:00:27,960
but we lose a lot of
information in the process.
10
00:00:27,960 --> 00:00:29,010
For instance,
11
00:00:29,010 --> 00:00:31,856
here the tokenization is the
same for the two sentences
12
00:00:31,856 --> 00:00:35,373
even if one has several
more spaces than the other.
13
00:00:36,300 --> 00:00:39,150
Just having the input
IDs is thus not enough
14
00:00:39,150 --> 00:00:42,330
if we want to match some
tokens with a span of text,
15
00:00:42,330 --> 00:00:43,320
something we'll need to do
16
00:00:43,320 --> 00:00:46,111
when tackling question
answering, for instance.
17
00:00:46,111 --> 00:00:47,592
It's also difficult to know
18
00:00:47,592 --> 00:00:50,850
when two tokens belong
to the same word or not.
19
00:00:50,850 --> 00:00:52,860
It looks easy when you
just look at the output
20
00:00:52,860 --> 00:00:55,650
of a BERT tokenizer where
we just need to look
21
00:00:55,650 --> 00:00:56,779
for the hash hash.
22
00:00:56,779 --> 00:00:59,040
But other tokenizers have different ways
23
00:00:59,040 --> 00:01:00,987
to tokenize parts of words.
24
00:01:00,987 --> 00:01:04,470
For instance, RoBERTa
adds this special G symbol
25
00:01:04,470 --> 00:01:06,491
to mark the tokens at
the beginning of the word
26
00:01:06,491 --> 00:01:09,570
and T5 uses this special underscore symbol
27
00:01:09,570 --> 00:01:11,150
for the same purpose.
28
00:01:11,150 --> 00:01:14,760
Thankfully, the fast tokenizers
keep track of the word
29
00:01:14,760 --> 00:01:16,230
each token comes from,
30
00:01:16,230 --> 00:01:19,571
with a word_ids method you
can use on their outputs.
31
00:01:19,571 --> 00:01:21,870
The output is not necessarily clear,
32
00:01:21,870 --> 00:01:24,076
but assembled together in
a nice table like this,
33
00:01:24,076 --> 00:01:26,853
we can look at the word
position for each token.
34
00:01:27,930 --> 00:01:30,220
Even better, the fast
tokenizers keep track
35
00:01:30,220 --> 00:01:33,198
of the span of characters
each token comes from,
36
00:01:33,198 --> 00:01:35,760
and we can get them when calling it on one
37
00:01:35,760 --> 00:01:37,221
or several text by adding
38
00:01:37,221 --> 00:01:40,470
the return_offsets_mapping=True argument.
39
00:01:40,470 --> 00:01:42,312
In this instance, we can
see how we jump positions
40
00:01:42,312 --> 00:01:45,650
between the hash hash
token and the super token,
41
00:01:45,650 --> 00:01:49,992
because of the multiple spaces
in the initial sentence.
42
00:01:49,992 --> 00:01:52,110
To enable this, the fast tokenizers
43
00:01:52,110 --> 00:01:54,270
store additional information at each step
44
00:01:54,270 --> 00:01:55,440
of their internal pipeline.
45
00:01:55,440 --> 00:01:57,951
That internal pipeline
consists of normalization,
46
00:01:57,951 --> 00:02:00,360
where we apply some cleaning to the text,
47
00:02:00,360 --> 00:02:02,621
like lower casing or removing the accents;
48
00:02:02,621 --> 00:02:04,088
pre-tokenization,
49
00:02:04,088 --> 00:02:06,530
which is where we split
the texts into words;
50
00:02:06,530 --> 00:02:09,360
then we apply the model of the tokenizer,
51
00:02:09,360 --> 00:02:11,725
which is where the words
are split into tokens,
52
00:02:11,725 --> 00:02:13,748
before finally doing the post processing,
53
00:02:13,748 --> 00:02:16,023
where special tokens are added.
54
00:02:17,100 --> 00:02:19,050
From the beginning to
the end of the pipeline,
55
00:02:19,050 --> 00:02:21,390
the tokenizer keeps track
of each span of text
56
00:02:21,390 --> 00:02:23,853
that corresponds to each
word, then each token.
57
00:02:24,990 --> 00:02:26,100
We'll see how useful it is
58
00:02:26,100 --> 00:02:27,990
when we tackle the following tasks:
59
00:02:27,990 --> 00:02:29,549
when doing masked language modeling
60
00:02:29,549 --> 00:02:32,407
one variation that gets
state-of-the-art results
61
00:02:32,407 --> 00:02:35,040
is to mask all the tokens of a given word
62
00:02:35,040 --> 00:02:37,440
instead of randomly chosen words.
63
00:02:37,440 --> 00:02:40,800
This will require us to
use the word IDs we saw.
64
00:02:40,800 --> 00:02:42,329
When doing token classification,
65
00:02:42,329 --> 00:02:45,090
we'll need to convert the
labels we have on words,
66
00:02:45,090 --> 00:02:47,250
to labels on each tokens.
67
00:02:47,250 --> 00:02:48,480
As for the offset mappings,
68
00:02:48,480 --> 00:02:50,610
it will be super useful
when we need to convert
69
00:02:50,610 --> 00:02:53,436
token positions in a
sentence into a span of text,
70
00:02:53,436 --> 00:02:55,800
which we'll need to
know when we're looking
71
00:02:55,800 --> 00:02:56,813
at question answering
72
00:02:56,813 --> 00:02:58,680
or when grouping the tokens corresponding
73
00:02:58,680 --> 00:03:01,023
to the same entity in
token classification.
74
00:03:02,160 --> 00:03:03,450
To have a look at these tasks,
75
00:03:03,450 --> 00:03:04,950
check the videos linked below!