subtitles/en/49_what-is-normalization.srt (321 lines of code) (raw):
1
00:00:00,286 --> 00:00:02,869
(subtle blast)
2
00:00:04,694 --> 00:00:07,380
- In this video, we will see together
3
00:00:07,380 --> 00:00:09,930
what is the normalizer component
4
00:00:09,930 --> 00:00:13,023
that we'd find at the
beginning of each tokenizer.
5
00:00:14,550 --> 00:00:16,830
The normalization operation consists
6
00:00:16,830 --> 00:00:19,890
in applying a succession
of normalization rules
7
00:00:19,890 --> 00:00:20,853
to the raw text.
8
00:00:21,870 --> 00:00:25,710
We choose normalization rules
to remove noise in the text
9
00:00:25,710 --> 00:00:27,900
which seem useless for the learning
10
00:00:27,900 --> 00:00:30,363
and use of our language model.
11
00:00:33,090 --> 00:00:37,470
Let's take a very diverse
sentence with different fonts,
12
00:00:37,470 --> 00:00:39,780
upper and lower case characters,
13
00:00:39,780 --> 00:00:43,083
accents, punctuation and multiple spaces,
14
00:00:43,920 --> 00:00:46,683
to see how several tokenizer normalize it.
15
00:00:48,488 --> 00:00:50,730
The tokenizer from the FNet model
16
00:00:50,730 --> 00:00:53,700
has transformed the
letter with font variants
17
00:00:53,700 --> 00:00:57,480
or circled into their basic version
18
00:00:57,480 --> 00:00:59,733
and has removed the multiple spaces.
19
00:01:00,960 --> 00:01:03,960
And now if we look at the normalization
20
00:01:03,960 --> 00:01:05,880
with Retribert's tokenizer,
21
00:01:05,880 --> 00:01:08,010
we can see that it keeps characters
22
00:01:08,010 --> 00:01:12,090
with several font variants
and keeps the multiple spaces,
23
00:01:12,090 --> 00:01:14,223
but it removes all the accents.
24
00:01:16,170 --> 00:01:18,870
And if we continue to
test this normalization
25
00:01:18,870 --> 00:01:23,040
of many other tokenizers
associated to models
26
00:01:23,040 --> 00:01:25,110
that we can find on the Hub,
27
00:01:25,110 --> 00:01:28,833
we see that they also propose
other kind of normalization.
28
00:01:33,900 --> 00:01:35,850
With the fast tokenizers,
29
00:01:35,850 --> 00:01:39,060
it's very easy to observe
the normalization chosen
30
00:01:39,060 --> 00:01:41,193
for the currently loaded tokenizer.
31
00:01:42,330 --> 00:01:46,140
Indeed, each instance of a fast tokenizer
32
00:01:46,140 --> 00:01:48,030
has an underlying tokenizer
33
00:01:48,030 --> 00:01:51,390
from the HuggingFace
Tokenizers library stored
34
00:01:51,390 --> 00:01:53,643
in the backend_tokenizer attribute.
35
00:01:54,690 --> 00:01:58,470
This object has itself
a normalizer attribute
36
00:01:58,470 --> 00:02:01,830
that we can use thanks to
the normalize_str method
37
00:02:01,830 --> 00:02:03,153
to normalize a string.
38
00:02:04,560 --> 00:02:08,700
It is thus very practical
that this normalization,
39
00:02:08,700 --> 00:02:11,070
which was used at the time of the training
40
00:02:11,070 --> 00:02:12,903
of the tokenizer was saved,
41
00:02:13,857 --> 00:02:16,200
and that it applies automatically
42
00:02:16,200 --> 00:02:19,233
when you ask a trained
tokenizer to tokenize a text.
43
00:02:21,000 --> 00:02:25,500
For example, if we hadn't
included the albert normalizer,
44
00:02:25,500 --> 00:02:28,770
we would have had a lot of unknown tokens
45
00:02:28,770 --> 00:02:30,930
by tokenizing this sentence
46
00:02:30,930 --> 00:02:33,213
with accents and capital letters.
47
00:02:35,730 --> 00:02:38,370
This transformation can
also be undetectable
48
00:02:38,370 --> 00:02:40,050
with a simple print.
49
00:02:40,050 --> 00:02:42,810
Indeed, keep in mind that for a computer,
50
00:02:42,810 --> 00:02:45,840
text is only a succession of 0 and 1,
51
00:02:45,840 --> 00:02:47,820
and it happens that different successions
52
00:02:47,820 --> 00:02:51,363
of 0 and 1 render the
same printed character.
53
00:02:52,380 --> 00:02:56,403
The 0 and 1 go in group
of 8 to form a byte.
54
00:02:57,480 --> 00:03:00,690
The computer must then
decode this sequence of bytes
55
00:03:00,690 --> 00:03:02,493
into a sequence of code points.
56
00:03:04,530 --> 00:03:09,530
In our example, the 2 bytes
is decoded using UTF-8
57
00:03:09,900 --> 00:03:11,403
into a single code point.
58
00:03:12,450 --> 00:03:15,090
The unicode standard then allows us
59
00:03:15,090 --> 00:03:18,191
to find the character
corresponding to this code point,
60
00:03:18,191 --> 00:03:20,283
the c cedilla.
61
00:03:21,499 --> 00:03:23,790
Let's repeat the same operation
62
00:03:23,790 --> 00:03:26,577
with this new sequence
composed of 3 bytes,.
63
00:03:27,420 --> 00:03:30,543
This time it is transformed
into two code points,
64
00:03:31,410 --> 00:03:35,280
which also correspond to
the c cedilla character.
65
00:03:35,280 --> 00:03:36,780
It is in fact the composition
66
00:03:36,780 --> 00:03:39,810
of the unicode Latin Small Letter C
67
00:03:39,810 --> 00:03:42,240
and the combining cedilla.
68
00:03:42,240 --> 00:03:45,000
But it's annoying because
what appears to us
69
00:03:45,000 --> 00:03:46,680
to be a single character
70
00:03:46,680 --> 00:03:49,653
is not at all the same
thing for the computer.
71
00:03:52,470 --> 00:03:57,240
Fortunately, there are unicode
standardization standards
72
00:03:57,240 --> 00:04:02,130
known as NFC, NFD, NFKC or NFKD
73
00:04:02,130 --> 00:04:04,893
that allow erasing some
of these differences.
74
00:04:05,730 --> 00:04:08,223
These standards are
often used by tokenizers.
75
00:04:09,900 --> 00:04:12,090
On all these previous examples,
76
00:04:12,090 --> 00:04:15,510
even if the normalizations
changed the look of the text,
77
00:04:15,510 --> 00:04:17,970
they did not change the content;
78
00:04:17,970 --> 00:04:19,177
you could still read,
79
00:04:19,177 --> 00:04:21,987
"Hello world, let's
normalize this sentence."
80
00:04:22,980 --> 00:04:25,980
However, you must be aware
that some normalizations
81
00:04:25,980 --> 00:04:30,363
can be very harmful if they are
not adapted to their corpus.
82
00:04:31,620 --> 00:04:34,387
For example, if you take
the French sentence,
83
00:04:34,387 --> 00:04:38,790
"Un pere indigne," which
means "An indignant father,"
84
00:04:38,790 --> 00:04:42,510
and normalize it with the
bert-base-uncase tokenizer
85
00:04:42,510 --> 00:04:44,313
which removes the accent,
86
00:04:45,150 --> 00:04:48,000
then the sentence
becomes "Un pere indigne"
87
00:04:48,000 --> 00:04:49,707
which means "An unworthy father".
88
00:04:53,460 --> 00:04:56,760
If you watched this video
to build your own tokenizer,
89
00:04:56,760 --> 00:04:59,610
there are no absolute
rules to choose or not
90
00:04:59,610 --> 00:05:02,970
a normalization for a new tokenizer,
91
00:05:02,970 --> 00:05:06,210
but I advise you to take
the time to select them
92
00:05:06,210 --> 00:05:10,743
so that they do not make you
lose important information.
93
00:05:12,296 --> 00:05:14,879
(subtle blast)