1
00:00:05,010 --> 00:00:06,270
- The fast tokenizers

2
00:00:06,270 --> 00:00:08,580
of the Transformers library are fast,

3
00:00:08,580 --> 00:00:11,490
but they also implement features
that will be super useful

4
00:00:11,490 --> 00:00:14,536
for data pre-processing
and post-processing.

5
00:00:14,536 --> 00:00:17,239
Let's have a look at them!

6
00:00:17,239 --> 00:00:18,650
First, let's have a look

7
00:00:18,650 --> 00:00:21,690
at the usual output of a tokenizer.

8
00:00:21,690 --> 00:00:24,278
We get input IDs that correspond to token,

9
00:00:24,278 --> 00:00:27,960
but we lose a lot of
information in the process.

10
00:00:27,960 --> 00:00:29,010
For instance,

11
00:00:29,010 --> 00:00:31,856
here the tokenization is the
same for the two sentences

12
00:00:31,856 --> 00:00:35,373
even if one has several
more spaces than the other.

13
00:00:36,300 --> 00:00:39,150
Just having the input
IDs is thus not enough

14
00:00:39,150 --> 00:00:42,330
if we want to match some
tokens with a span of text,

15
00:00:42,330 --> 00:00:43,320
something we'll need to do

16
00:00:43,320 --> 00:00:46,111
when tackling question
answering, for instance.

17
00:00:46,111 --> 00:00:47,592
It's also difficult to know

18
00:00:47,592 --> 00:00:50,850
when two tokens belong
to the same word or not.

19
00:00:50,850 --> 00:00:52,860
It looks easy when you
just look at the output

20
00:00:52,860 --> 00:00:55,650
of a BERT tokenizer where
we just need to look

21
00:00:55,650 --> 00:00:56,779
for the hash hash.

22
00:00:56,779 --> 00:00:59,040
But other tokenizers have different ways

23
00:00:59,040 --> 00:01:00,987
to tokenize parts of words.

24
00:01:00,987 --> 00:01:04,470
For instance, RoBERTa
adds this special G symbol

25
00:01:04,470 --> 00:01:06,491
to mark the tokens at
the beginning of the word

26
00:01:06,491 --> 00:01:09,570
and T5 uses this special underscore symbol

27
00:01:09,570 --> 00:01:11,150
for the same purpose.

28
00:01:11,150 --> 00:01:14,760
Thankfully, the fast tokenizers
keep track of the word

29
00:01:14,760 --> 00:01:16,230
each token comes from,

30
00:01:16,230 --> 00:01:19,571
with a word_ids method you
can use on their outputs.

31
00:01:19,571 --> 00:01:21,870
The output is not necessarily clear,

32
00:01:21,870 --> 00:01:24,076
but assembled together in
a nice table like this,

33
00:01:24,076 --> 00:01:26,853
we can look at the word
position for each token.

34
00:01:27,930 --> 00:01:30,220
Even better, the fast
tokenizers keep track

35
00:01:30,220 --> 00:01:33,198
of the span of characters
each token comes from,

36
00:01:33,198 --> 00:01:35,760
and we can get them when calling it on one

37
00:01:35,760 --> 00:01:37,221
or several text by adding

38
00:01:37,221 --> 00:01:40,470
the return_offsets_mapping=True argument.

39
00:01:40,470 --> 00:01:42,312
In this instance, we can
see how we jump positions

40
00:01:42,312 --> 00:01:45,650
between the hash hash
token and the super token,

41
00:01:45,650 --> 00:01:49,992
because of the multiple spaces
in the initial sentence.

42
00:01:49,992 --> 00:01:52,110
To enable this, the fast tokenizers

43
00:01:52,110 --> 00:01:54,270
store additional information at each step

44
00:01:54,270 --> 00:01:55,440
of their internal pipeline.

45
00:01:55,440 --> 00:01:57,951
That internal pipeline
consists of normalization,

46
00:01:57,951 --> 00:02:00,360
where we apply some cleaning to the text,

47
00:02:00,360 --> 00:02:02,621
like lower casing or removing the accents;

48
00:02:02,621 --> 00:02:04,088
pre-tokenization,

49
00:02:04,088 --> 00:02:06,530
which is where we split
the texts into words;

50
00:02:06,530 --> 00:02:09,360
then we apply the model of the tokenizer,

51
00:02:09,360 --> 00:02:11,725
which is where the words
are split into tokens,

52
00:02:11,725 --> 00:02:13,748
before finally doing the post processing,

53
00:02:13,748 --> 00:02:16,023
where special tokens are added.

54
00:02:17,100 --> 00:02:19,050
From the beginning to
the end of the pipeline,

55
00:02:19,050 --> 00:02:21,390
the tokenizer keeps track
of each span of text

56
00:02:21,390 --> 00:02:23,853
that corresponds to each
word, then each token.

57
00:02:24,990 --> 00:02:26,100
We'll see how useful it is

58
00:02:26,100 --> 00:02:27,990
when we tackle the following tasks:

59
00:02:27,990 --> 00:02:29,549
when doing masked language modeling

60
00:02:29,549 --> 00:02:32,407
one variation that gets
state-of-the-art results

61
00:02:32,407 --> 00:02:35,040
is to mask all the tokens of a given word

62
00:02:35,040 --> 00:02:37,440
instead of randomly chosen words.

63
00:02:37,440 --> 00:02:40,800
This will require us to
use the word IDs we saw.

64
00:02:40,800 --> 00:02:42,329
When doing token classification,

65
00:02:42,329 --> 00:02:45,090
we'll need to convert the
labels we have on words,

66
00:02:45,090 --> 00:02:47,250
to labels on each tokens.

67
00:02:47,250 --> 00:02:48,480
As for the offset mappings,

68
00:02:48,480 --> 00:02:50,610
it will be super useful
when we need to convert

69
00:02:50,610 --> 00:02:53,436
token positions in a
sentence into a span of text,

70
00:02:53,436 --> 00:02:55,800
which we'll need to
know when we're looking

71
00:02:55,800 --> 00:02:56,813
at question answering

72
00:02:56,813 --> 00:02:58,680
or when grouping the tokens corresponding

73
00:02:58,680 --> 00:03:01,023
to the same entity in
token classification.

74
00:03:02,160 --> 00:03:03,450
To have a look at these tasks,

75
00:03:03,450 --> 00:03:04,950
check the videos linked below!