subtitles/zh-CN/49_what-is-normalization.srt (372 lines of code) (raw):
1
00:00:00,286 --> 00:00:02,869
(微妙的爆炸)
(subtle blast)
2
00:00:04,694 --> 00:00:07,380
- 在这个视频中,我们将一起看到
- In this video, we will see together
3
00:00:07,380 --> 00:00:09,930
什么是规范化组件
what is the normalizer component
4
00:00:09,930 --> 00:00:13,023
我们会在每个 tokenizer 的开头找到它。
that we'd find at the beginning of each tokenizer.
5
00:00:14,550 --> 00:00:16,830
规范化操作包括
The normalization operation consists
6
00:00:16,830 --> 00:00:19,890
在应用一系列规范化规则
in applying a succession of normalization rules
7
00:00:19,890 --> 00:00:20,853
到原始文本时。
to the raw text.
8
00:00:21,870 --> 00:00:25,710
我们选择规范化规则来去除文本中的噪音
We choose normalization rules to remove noise in the text
9
00:00:25,710 --> 00:00:27,900
这似乎对学习没有用
which seem useless for the learning
10
00:00:27,900 --> 00:00:30,363
以及使用我们的语言模型。
and use of our language model.
11
00:00:33,090 --> 00:00:37,470
让我们来看一个非常多样化的句子,用不同的字体
Let's take a very diverse sentence with different fonts,
12
00:00:37,470 --> 00:00:39,780
大写和小写的字符,
upper and lower case characters,
13
00:00:39,780 --> 00:00:43,083
重音符号、标点符号和多个空格,
accents, punctuation and multiple spaces,
14
00:00:43,920 --> 00:00:46,683
看看几个 tokenizer 是如何规范化它的。
to see how several tokenizer normalize it.
15
00:00:48,488 --> 00:00:50,730
来自 FNet 模型的 tokenizer
The tokenizer from the FNet model
16
00:00:50,730 --> 00:00:53,700
用字体变体改变了字母
has transformed the letter with font variants
17
00:00:53,700 --> 00:00:57,480
或圈入他们的基本版本
or circled into their basic version
18
00:00:57,480 --> 00:00:59,733
并删除了多个空格。
and has removed the multiple spaces.
19
00:01:00,960 --> 00:01:03,960
现在,如果我们看一下规范化
And now if we look at the normalization
20
00:01:03,960 --> 00:01:05,880
使用 Retribert 的分词器,
with Retribert's tokenizer,
21
00:01:05,880 --> 00:01:08,010
我们可以看到它保留了字符
we can see that it keeps characters
22
00:01:08,010 --> 00:01:12,090
具有多种字体变体并保留多个空格,
with several font variants and keeps the multiple spaces,
23
00:01:12,090 --> 00:01:14,223
但它删除了所有的重音。
but it removes all the accents.
24
00:01:16,170 --> 00:01:18,870
如果我们继续测试这种标准化
And if we continue to test this normalization
25
00:01:18,870 --> 00:01:23,040
与模型相关的许多其他 tokenizer
of many other tokenizers associated to models
26
00:01:23,040 --> 00:01:25,110
我们可以在 Hub 上找到,
that we can find on the Hub,
27
00:01:25,110 --> 00:01:28,833
我们看到他们还提出了其他类型的正常化。
we see that they also propose other kind of normalization.
28
00:01:33,900 --> 00:01:35,850
使用快速 tokenizer ,
With the fast tokenizers,
29
00:01:35,850 --> 00:01:39,060
很容易观察到选择的规范化
it's very easy to observe the normalization chosen
30
00:01:39,060 --> 00:01:41,193
对于当前加载的分词器。
for the currently loaded tokenizer.
31
00:01:42,330 --> 00:01:46,140
事实上,每个快速 tokenizer 的实例
Indeed, each instance of a fast tokenizer
32
00:01:46,140 --> 00:01:48,030
有一个底层 tokenizer
has an underlying tokenizer
33
00:01:48,030 --> 00:01:51,390
来自存储的 HuggingFace Tokenizers 库
from the HuggingFace Tokenizers library stored
34
00:01:51,390 --> 00:01:53,643
在 backend_tokenizer 属性中。
in the backend_tokenizer attribute.
35
00:01:54,690 --> 00:01:58,470
这个对象本身有一个规范器属性
This object has itself a normalizer attribute
36
00:01:58,470 --> 00:02:01,830
我们可以使用, 多亏 normalize_str 方法
that we can use thanks to the normalize_str method
37
00:02:01,830 --> 00:02:03,153
以规范化一个字符串。
to normalize a string.
38
00:02:04,560 --> 00:02:08,700
因此,这种标准化非常实用,
It is thus very practical that this normalization,
39
00:02:08,700 --> 00:02:11,070
使用在训练时
which was used at the time of the training
40
00:02:11,070 --> 00:02:12,903
保存分词器,
of the tokenizer was saved,
41
00:02:13,857 --> 00:02:16,200
并且它会自动应用
and that it applies automatically
42
00:02:16,200 --> 00:02:19,233
当你要求训练过的 tokenizer 对文本进行分词时。
when you ask a trained tokenizer to tokenize a text.
43
00:02:21,000 --> 00:02:25,500
例如,如果我们没有包含 albert 标准化器,
For example, if we hadn't included the albert normalizer,
44
00:02:25,500 --> 00:02:28,770
我们会有很多未知的 token
we would have had a lot of unknown tokens
45
00:02:28,770 --> 00:02:30,930
通过标记这个句子
by tokenizing this sentence
46
00:02:30,930 --> 00:02:33,213
带有重音符号和大写字母。
with accents and capital letters.
47
00:02:35,730 --> 00:02:38,370
这种转变也可能是无法检测的
This transformation can also be undetectable
48
00:02:38,370 --> 00:02:40,050
通过简单的打印出来。
with a simple print.
49
00:02:40,050 --> 00:02:42,810
确实,请记住,对于计算机来说,
Indeed, keep in mind that for a computer,
50
00:02:42,810 --> 00:02:45,840
文本只是连续的 0 和 1,
text is only a succession of 0 and 1,
51
00:02:45,840 --> 00:02:47,820
碰巧是不同的承接
and it happens that different successions
52
00:02:47,820 --> 00:02:51,363
让 0 和 1 呈现相同的打印字符。
of 0 and 1 render the same printed character.
53
00:02:52,380 --> 00:02:56,403
0 和 1 以 8 个为一组组成一个字节。
The 0 and 1 go in group of 8 to form a byte.
54
00:02:57,480 --> 00:03:00,690
然后计算机必须解码这个字节序列
The computer must then decode this sequence of bytes
55
00:03:00,690 --> 00:03:02,493
成一系列代码点。
into a sequence of code points.
56
00:03:04,530 --> 00:03:09,530
在我们的示例中,2 个字节通过 UTF-8 解码
In our example, the 2 bytes is decoded using UTF-8
57
00:03:09,900 --> 00:03:11,403
成一个单一的代码点。
into a single code point.
58
00:03:12,450 --> 00:03:15,090
然后 unicode 标准允许我们
The unicode standard then allows us
59
00:03:15,090 --> 00:03:18,191
找到与此代码点对应的字符,
to find the character corresponding to this code point,
60
00:03:18,191 --> 00:03:20,283
c 音符。
the c cedilla.
61
00:03:21,499 --> 00:03:23,790
让我们重复同样的操作
Let's repeat the same operation
62
00:03:23,790 --> 00:03:26,577
有了这个由 3 个字节组成的新序列,
with this new sequence composed of 3 bytes,.
63
00:03:27,420 --> 00:03:30,543
这次转化为两个码点,
This time it is transformed into two code points,
64
00:03:31,410 --> 00:03:35,280
这也对应于 c cedilla 字符。
which also correspond to the c cedilla character.
65
00:03:35,280 --> 00:03:36,780
它实际上是组成
It is in fact the composition
66
00:03:36,780 --> 00:03:39,810
unicode 拉丁文小写字母 C
of the unicode Latin Small Letter C
67
00:03:39,810 --> 00:03:42,240
和组合的音符。
and the combining cedilla.
68
00:03:42,240 --> 00:03:45,000
但这很烦人,因为在我们看来
But it's annoying because what appears to us
69
00:03:45,000 --> 00:03:46,680
为了成为一个单一的字符
to be a single character
70
00:03:46,680 --> 00:03:49,653
对于计算机来说完全不是一回事。
is not at all the same thing for the computer.
71
00:03:52,470 --> 00:03:57,240
幸好有 unicode 标准化标准
Fortunately, there are unicode standardization standards
72
00:03:57,240 --> 00:04:02,130
称为 NFC、NFD、NFKC 或 NFKD
known as NFC, NFD, NFKC or NFKD
73
00:04:02,130 --> 00:04:04,893
这允许消除其中的一些差异。
that allow erasing some of these differences.
74
00:04:05,730 --> 00:04:08,223
这些标准通常由 tokenizer 使用。
These standards are often used by tokenizers.
75
00:04:09,900 --> 00:04:12,090
在所有这些前面的例子中,
On all these previous examples,
76
00:04:12,090 --> 00:04:15,510
即使规范化改变了文本的外观,
even if the normalizations changed the look of the text,
77
00:04:15,510 --> 00:04:17,970
他们没有改变内容;
they did not change the content;
78
00:04:17,970 --> 00:04:19,177
你仍然可以阅读,
you could still read,
79
00:04:19,177 --> 00:04:21,987
“Hello world,让我们规范一下这句话。”
"Hello world, let's normalize this sentence."
80
00:04:22,980 --> 00:04:25,980
但是,你必须知道一些规范化
However, you must be aware that some normalizations
81
00:04:25,980 --> 00:04:30,363
如果他们不适配他们的语料库,可能会非常有害。
can be very harmful if they are not adapted to their corpus.
82
00:04:31,620 --> 00:04:34,387
例如,如果你使用法语句子,
For example, if you take the French sentence,
83
00:04:34,387 --> 00:04:38,790
“Un pere indigne”,意思是 “愤怒的父亲”,
"Un pere indigne," which means "An indignant father,"
84
00:04:38,790 --> 00:04:42,510
并使用 bert-base-uncase tokenizer 对其进行规范化
and normalize it with the bert-base-uncase tokenizer
85
00:04:42,510 --> 00:04:44,313
这消除了口音,
which removes the accent,
86
00:04:45,150 --> 00:04:48,000
然后句子变成 “Un pere indigne”
then the sentence becomes "Un pere indigne"
87
00:04:48,000 --> 00:04:49,707
意思是 “一个不称职的父亲”。
which means "An unworthy father".
88
00:04:53,460 --> 00:04:56,760
如果你观看此视频是为了构建自己的 tokenizer ,
If you watched this video to build your own tokenizer,
89
00:04:56,760 --> 00:04:59,610
没有绝对的规则选择与否
there are no absolute rules to choose or not
90
00:04:59,610 --> 00:05:02,970
一个新 tokenizer 的规范化,
a normalization for a new tokenizer,
91
00:05:02,970 --> 00:05:06,210
但我建议你花时间选择它们
but I advise you to take the time to select them
92
00:05:06,210 --> 00:05:10,743
这样它们就不会让你丢失重要信息。
so that they do not make you lose important information.
93
00:05:12,296 --> 00:05:14,879
(微妙的爆炸)
(subtle blast)