1
00:00:00,479 --> 00:00:03,396
(object whooshing)

2
00:00:05,610 --> 00:00:06,873
- The tokenizer pipeline.

3
00:00:07,920 --> 00:00:10,570
In this video, we'll look
at how a tokenizer converts

4
00:00:11,433 --> 00:00:12,480
raw texts to numbers,

5
00:00:12,480 --> 00:00:14,970
that a Transformer
model can make sense of,

6
00:00:14,970 --> 00:00:16,520
like when we execute this code.

7
00:00:17,760 --> 00:00:18,690
Here is a quick overview

8
00:00:18,690 --> 00:00:21,630
of what happens inside
the tokenizer object:

9
00:00:21,630 --> 00:00:24,360
first, the text is split into tokens,

10
00:00:24,360 --> 00:00:27,453
which are words, parts of
words, or punctuation symbols.

11
00:00:28,440 --> 00:00:31,500
Then the tokenizer adds
potential special tokens

12
00:00:31,500 --> 00:00:34,680
and converts each token to
their unique respective ID

13
00:00:34,680 --> 00:00:36,843
as defined by the tokenizer's vocabulary.

14
00:00:37,710 --> 00:00:40,380
As we'll see, it doesn't
quite happen in this order,

15
00:00:40,380 --> 00:00:43,233
but doing it like this is
better for understandings.

16
00:00:44,280 --> 00:00:47,670
The first step is to split
our input text into tokens.

17
00:00:47,670 --> 00:00:49,653
We use the tokenize method for this.

18
00:00:50,550 --> 00:00:54,030
To do that, the tokenizer may
first perform some operations,

19
00:00:54,030 --> 00:00:56,880
like lowercasing all words,
then follow a set of rules

20
00:00:56,880 --> 00:00:59,283
to split the result in
small chunks of text.

21
00:01:00,480 --> 00:01:02,286
Most of the Transformer models uses

22
00:01:02,286 --> 00:01:04,890
a word tokenization algorithm, which means

23
00:01:04,890 --> 00:01:06,750
that one given word can be split

24
00:01:06,750 --> 00:01:10,050
in several tokens like tokenize here.

25
00:01:10,050 --> 00:01:12,570
Look at the "Tokenization
algorithms" video link below

26
00:01:12,570 --> 00:01:13,743
for more information.

27
00:01:14,760 --> 00:01:17,820
The # # prefix we see in front of ize is

28
00:01:17,820 --> 00:01:19,830
a convention used by Bert to indicate

29
00:01:19,830 --> 00:01:22,762
this token is not the
beginning of the word.

30
00:01:22,762 --> 00:01:26,310
Other tokenizers may use
different conventions however:

31
00:01:26,310 --> 00:01:29,984
for instance, ALBERT tokenizers
will add a long underscore

32
00:01:29,984 --> 00:01:31,620
in front of all the tokens

33
00:01:31,620 --> 00:01:34,920
that added space before them,
which is a convention shared

34
00:01:34,920 --> 00:01:37,700
by all sentencepiece tokenizers.

35
00:01:38,580 --> 00:01:41,040
The second step of the
tokenization pipeline is

36
00:01:41,040 --> 00:01:43,470
to map those tokens to
their respective IDs

37
00:01:43,470 --> 00:01:45,770
as defined by the
vocabulary of the tokenizer.

38
00:01:46,770 --> 00:01:48,690
This is why we need to download the file

39
00:01:48,690 --> 00:01:50,580
when we instantiate a tokenizer

40
00:01:50,580 --> 00:01:52,400
with the from_pretrained method.

41
00:01:52,400 --> 00:01:54,390
We have to make sure
we use the same mapping

42
00:01:54,390 --> 00:01:56,520
as when the model was pretrained.

43
00:01:56,520 --> 00:01:59,703
To do this, we use the
convert_tokens_to_ids method.

44
00:02:01,050 --> 00:02:01,883
We may have noticed

45
00:02:01,883 --> 00:02:03,540
that we don't have the exact same results

46
00:02:03,540 --> 00:02:05,580
as in our first slide, or not

47
00:02:05,580 --> 00:02:07,920
as this looks like a list
of random numbers anyway,

48
00:02:07,920 --> 00:02:10,680
in which case, allow me
to refresh your memory.

49
00:02:10,680 --> 00:02:12,350
We had a the number at
the beginning and a number

50
00:02:12,350 --> 00:02:17,130
at the end that are missing,
those are the special tokens.

51
00:02:17,130 --> 00:02:20,340
The special tokens are added
by the prepare_for_model method

52
00:02:20,340 --> 00:02:22,350
which knows the indices of this token

53
00:02:22,350 --> 00:02:25,680
in the vocabulary and just
adds the proper numbers.

54
00:02:25,680 --> 00:02:27,243
in the input IDs list.

55
00:02:28,590 --> 00:02:29,541
You can look at the special tokens

56
00:02:29,541 --> 00:02:30,990
and, more generally,

57
00:02:30,990 --> 00:02:33,870
at how the tokenizer
has changed your text,

58
00:02:33,870 --> 00:02:35,280
by using the decode method

59
00:02:35,280 --> 00:02:37,503
on the outputs of the tokenizer object.

60
00:02:38,490 --> 00:02:39,423
As for the prefix for beginning

61
00:02:39,423 --> 00:02:44,160
of words/ part of words, for
special tokens vary depending

62
00:02:44,160 --> 00:02:46,500
on which tokenizer you're using.

63
00:02:46,500 --> 00:02:48,810
So that tokenizer uses CLS and SEP,

64
00:02:48,810 --> 00:02:52,417
but the roberta tokenizer
uses HTML-like anchors

65
00:02:52,417 --> 00:02:55,230
<s> and </s>.

66
00:02:55,230 --> 00:02:57,090
Now that you know how the tokenizer works,

67
00:02:57,090 --> 00:02:59,390
you can forget all those
intermediate methods,

68
00:03:00,283 --> 00:03:01,650
and then you remember that
you just have to call it

69
00:03:01,650 --> 00:03:02,913
on your input texts.

70
00:03:03,870 --> 00:03:05,310
The output of a tokenizer don't

71
00:03:05,310 --> 00:03:07,853
just contain the input IDs, however.

72
00:03:07,853 --> 00:03:09,750
To learn what the attention mask is,

73
00:03:09,750 --> 00:03:12,360
check out the "Batch
input together" video.

74
00:03:12,360 --> 00:03:14,220
To learn about token type IDs,

75
00:03:14,220 --> 00:03:16,570
look at the "Process
pairs of sentences" video.

76
00:03:18,003 --> 00:03:20,920
(object whooshing)