1
00:00:00,286 --> 00:00:02,869
(subtle blast)

2
00:00:04,694 --> 00:00:07,380
- In this video, we will see together

3
00:00:07,380 --> 00:00:09,930
what is the normalizer component

4
00:00:09,930 --> 00:00:13,023
that we'd find at the
beginning of each tokenizer.

5
00:00:14,550 --> 00:00:16,830
The normalization operation consists

6
00:00:16,830 --> 00:00:19,890
in applying a succession
of normalization rules

7
00:00:19,890 --> 00:00:20,853
to the raw text.

8
00:00:21,870 --> 00:00:25,710
We choose normalization rules
to remove noise in the text

9
00:00:25,710 --> 00:00:27,900
which seem useless for the learning

10
00:00:27,900 --> 00:00:30,363
and use of our language model.

11
00:00:33,090 --> 00:00:37,470
Let's take a very diverse
sentence with different fonts,

12
00:00:37,470 --> 00:00:39,780
upper and lower case characters,

13
00:00:39,780 --> 00:00:43,083
accents, punctuation and multiple spaces,

14
00:00:43,920 --> 00:00:46,683
to see how several tokenizer normalize it.

15
00:00:48,488 --> 00:00:50,730
The tokenizer from the FNet model

16
00:00:50,730 --> 00:00:53,700
has transformed the
letter with font variants

17
00:00:53,700 --> 00:00:57,480
or circled into their basic version

18
00:00:57,480 --> 00:00:59,733
and has removed the multiple spaces.

19
00:01:00,960 --> 00:01:03,960
And now if we look at the normalization

20
00:01:03,960 --> 00:01:05,880
with Retribert's tokenizer,

21
00:01:05,880 --> 00:01:08,010
we can see that it keeps characters

22
00:01:08,010 --> 00:01:12,090
with several font variants
and keeps the multiple spaces,

23
00:01:12,090 --> 00:01:14,223
but it removes all the accents.

24
00:01:16,170 --> 00:01:18,870
And if we continue to
test this normalization

25
00:01:18,870 --> 00:01:23,040
of many other tokenizers
associated to models

26
00:01:23,040 --> 00:01:25,110
that we can find on the Hub,

27
00:01:25,110 --> 00:01:28,833
we see that they also propose
other kind of normalization.

28
00:01:33,900 --> 00:01:35,850
With the fast tokenizers,

29
00:01:35,850 --> 00:01:39,060
it's very easy to observe
the normalization chosen

30
00:01:39,060 --> 00:01:41,193
for the currently loaded tokenizer.

31
00:01:42,330 --> 00:01:46,140
Indeed, each instance of a fast tokenizer

32
00:01:46,140 --> 00:01:48,030
has an underlying tokenizer

33
00:01:48,030 --> 00:01:51,390
from the HuggingFace
Tokenizers library stored

34
00:01:51,390 --> 00:01:53,643
in the backend_tokenizer attribute.

35
00:01:54,690 --> 00:01:58,470
This object has itself
a normalizer attribute

36
00:01:58,470 --> 00:02:01,830
that we can use thanks to
the normalize_str method

37
00:02:01,830 --> 00:02:03,153
to normalize a string.

38
00:02:04,560 --> 00:02:08,700
It is thus very practical
that this normalization,

39
00:02:08,700 --> 00:02:11,070
which was used at the time of the training

40
00:02:11,070 --> 00:02:12,903
of the tokenizer was saved,

41
00:02:13,857 --> 00:02:16,200
and that it applies automatically

42
00:02:16,200 --> 00:02:19,233
when you ask a trained
tokenizer to tokenize a text.

43
00:02:21,000 --> 00:02:25,500
For example, if we hadn't
included the albert normalizer,

44
00:02:25,500 --> 00:02:28,770
we would have had a lot of unknown tokens

45
00:02:28,770 --> 00:02:30,930
by tokenizing this sentence

46
00:02:30,930 --> 00:02:33,213
with accents and capital letters.

47
00:02:35,730 --> 00:02:38,370
This transformation can
also be undetectable

48
00:02:38,370 --> 00:02:40,050
with a simple print.

49
00:02:40,050 --> 00:02:42,810
Indeed, keep in mind that for a computer,

50
00:02:42,810 --> 00:02:45,840
text is only a succession of 0 and 1,

51
00:02:45,840 --> 00:02:47,820
and it happens that different successions

52
00:02:47,820 --> 00:02:51,363
of 0 and 1 render the
same printed character.

53
00:02:52,380 --> 00:02:56,403
The 0 and 1 go in group
of 8 to form a byte.

54
00:02:57,480 --> 00:03:00,690
The computer must then
decode this sequence of bytes

55
00:03:00,690 --> 00:03:02,493
into a sequence of code points.

56
00:03:04,530 --> 00:03:09,530
In our example, the 2 bytes
is decoded using UTF-8

57
00:03:09,900 --> 00:03:11,403
into a single code point.

58
00:03:12,450 --> 00:03:15,090
The unicode standard then allows us

59
00:03:15,090 --> 00:03:18,191
to find the character
corresponding to this code point,

60
00:03:18,191 --> 00:03:20,283
the c cedilla.

61
00:03:21,499 --> 00:03:23,790
Let's repeat the same operation

62
00:03:23,790 --> 00:03:26,577
with this new sequence
composed of 3 bytes,.

63
00:03:27,420 --> 00:03:30,543
This time it is transformed
into two code points,

64
00:03:31,410 --> 00:03:35,280
which also correspond to
the c cedilla character.

65
00:03:35,280 --> 00:03:36,780
It is in fact the composition

66
00:03:36,780 --> 00:03:39,810
of the unicode Latin Small Letter C

67
00:03:39,810 --> 00:03:42,240
and the combining cedilla.

68
00:03:42,240 --> 00:03:45,000
But it's annoying because
what appears to us

69
00:03:45,000 --> 00:03:46,680
to be a single character

70
00:03:46,680 --> 00:03:49,653
is not at all the same
thing for the computer.

71
00:03:52,470 --> 00:03:57,240
Fortunately, there are unicode
standardization standards

72
00:03:57,240 --> 00:04:02,130
known as NFC, NFD, NFKC or NFKD

73
00:04:02,130 --> 00:04:04,893
that allow erasing some
of these differences.

74
00:04:05,730 --> 00:04:08,223
These standards are
often used by tokenizers.

75
00:04:09,900 --> 00:04:12,090
On all these previous examples,

76
00:04:12,090 --> 00:04:15,510
even if the normalizations
changed the look of the text,

77
00:04:15,510 --> 00:04:17,970
they did not change the content;

78
00:04:17,970 --> 00:04:19,177
you could still read,

79
00:04:19,177 --> 00:04:21,987
"Hello world, let's
normalize this sentence."

80
00:04:22,980 --> 00:04:25,980
However, you must be aware
that some normalizations

81
00:04:25,980 --> 00:04:30,363
can be very harmful if they are
not adapted to their corpus.

82
00:04:31,620 --> 00:04:34,387
For example, if you take
the French sentence,

83
00:04:34,387 --> 00:04:38,790
"Un pere indigne," which
means "An indignant father,"

84
00:04:38,790 --> 00:04:42,510
and normalize it with the
bert-base-uncase tokenizer

85
00:04:42,510 --> 00:04:44,313
which removes the accent,

86
00:04:45,150 --> 00:04:48,000
then the sentence
becomes "Un pere indigne"

87
00:04:48,000 --> 00:04:49,707
which means "An unworthy father".

88
00:04:53,460 --> 00:04:56,760
If you watched this video
to build your own tokenizer,

89
00:04:56,760 --> 00:04:59,610
there are no absolute
rules to choose or not

90
00:04:59,610 --> 00:05:02,970
a normalization for a new tokenizer,

91
00:05:02,970 --> 00:05:06,210
but I advise you to take
the time to select them

92
00:05:06,210 --> 00:05:10,743
so that they do not make you
lose important information.

93
00:05:12,296 --> 00:05:14,879
(subtle blast)