subtitles/zh-CN/16_the-tokenization-pipeline.srt (305 lines of code) (raw):
1
00:00:00,479 --> 00:00:03,396
(物体呼啸)
(object whooshing)
2
00:00:05,610 --> 00:00:06,873
- 分词器的 pipeline 。
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- The tokenizer pipeline.
3
00:00:07,920 --> 00:00:10,570
在本视频中,我们将了解分词器如何转换
In this video, we'll look at how a tokenizer converts
4
00:00:11,433 --> 00:00:12,480
从原始文本到数字,
raw texts to numbers,
5
00:00:12,480 --> 00:00:14,970
Transformer 模型可以理解成,
that a Transformer model can make sense of,
6
00:00:14,970 --> 00:00:16,520
就像我们执行这段代码时一样。
like when we execute this code.
7
00:00:17,760 --> 00:00:18,690
这是一个快速概述
Here is a quick overview
8
00:00:18,690 --> 00:00:21,630
对于 tokenizer 对象内部发生的事情:
of what happens inside the tokenizer object:
9
00:00:21,630 --> 00:00:24,360
首先,文本被分成分词,
first, the text is split into tokens,
10
00:00:24,360 --> 00:00:27,453
它们是单词、单词的一部分或标点符号。
which are words, parts of words, or punctuation symbols.
11
00:00:28,440 --> 00:00:31,500
然后分词器添加潜在的特殊分词
Then the tokenizer adds potential special tokens
12
00:00:31,500 --> 00:00:34,680
并将每个分词转换为各自唯一的 ID
and converts each token to their unique respective ID
13
00:00:34,680 --> 00:00:36,843
由分词器的词汇表定义。
as defined by the tokenizer's vocabulary.
14
00:00:37,710 --> 00:00:40,380
正如我们将要看到的,它并不完全是按照这个顺序发生的,
As we'll see, it doesn't quite happen in this order,
15
00:00:40,380 --> 00:00:43,233
但这样更利于理解。
but doing it like this is better for understandings.
16
00:00:44,280 --> 00:00:47,670
第一步是将我们的输入文本拆分为分词。
The first step is to split our input text into tokens.
17
00:00:47,670 --> 00:00:49,653
为此,我们使用 tokenize 方法。
We use the tokenize method for this.
18
00:00:50,550 --> 00:00:54,030
为此,分词器可能首先执行一些操作,
To do that, the tokenizer may first perform some operations,
19
00:00:54,030 --> 00:00:56,880
比如将所有单词小写,然后遵循一组规则
like lowercasing all words, then follow a set of rules
20
00:00:56,880 --> 00:00:59,283
将结果分成小块文本。
to split the result in small chunks of text.
21
00:01:00,480 --> 00:01:02,286
大多数 Transformer 模型使用
Most of the Transformer models uses
22
00:01:02,286 --> 00:01:04,890
单词分词化算法,这意味着
a word tokenization algorithm, which means
23
00:01:04,890 --> 00:01:06,750
一个给定的单词可以被拆分
that one given word can be split
24
00:01:06,750 --> 00:01:10,050
在几个分词中,例如这里的分词。
in several tokens like tokenize here.
25
00:01:10,050 --> 00:01:12,570
查看下面的 “分词化算法” 视频链接
Look at the "Tokenization algorithms" video link below
26
00:01:12,570 --> 00:01:13,743
了解更多信息。
for more information.
27
00:01:14,760 --> 00:01:17,820
我们在 ize 前面看到的 ## 前缀是
The ## prefix we see in front of ize is
28
00:01:17,820 --> 00:01:19,830
BERT 的约定, 用来表示
a convention used by BERT to indicate
29
00:01:19,830 --> 00:01:22,762
这个分词不是单词的开头。
this token is not the beginning of the word.
30
00:01:22,762 --> 00:01:26,310
然而,其他分词器可能使用不同的约定:
Other tokenizers may use different conventions however:
31
00:01:26,310 --> 00:01:29,984
例如,ALBERT 分词器会添加一个长下划线
for instance, ALBERT tokenizers will add a long underscore
32
00:01:29,984 --> 00:01:31,620
在所有分词前
in front of all the tokens
33
00:01:31,620 --> 00:01:34,920
在他们之前增加了空间,这是一个共同的约定
that added space before them, which is a convention shared
34
00:01:34,920 --> 00:01:37,700
由所有句子分词器。
by all sentencepiece tokenizers.
35
00:01:38,580 --> 00:01:41,040
分词化管道的第二步是
The second step of the tokenization pipeline is
36
00:01:41,040 --> 00:01:43,470
将这些分词映射到它们各自的 ID
to map those tokens to their respective IDs
37
00:01:43,470 --> 00:01:45,770
由分词器的词汇定义的。
as defined by the vocabulary of the tokenizer.
38
00:01:46,770 --> 00:01:48,690
这就是我们需要下载文件的原因
This is why we need to download the file
39
00:01:48,690 --> 00:01:50,580
当我们实例化一个分词器时
when we instantiate a tokenizer
40
00:01:50,580 --> 00:01:52,400
使用 from_pretrained 方法。
with the from_pretrained method.
41
00:01:52,400 --> 00:01:54,390
我们必须确保我们使用相同的映射
We have to make sure we use the same mapping
42
00:01:54,390 --> 00:01:56,520
和模型被预训练时一致。
as when the model was pretrained.
43
00:01:56,520 --> 00:01:59,703
为此,我们使用 convert_tokens_to_ids 方法。
To do this, we use the convert_tokens_to_ids method.
44
00:02:01,050 --> 00:02:01,883
我们可能已经注意到
We may have noticed
45
00:02:01,883 --> 00:02:03,540
我们没有完全相同的结果
that we don't have the exact same results
46
00:02:03,540 --> 00:02:05,580
就像我们的第一张幻灯片一样, 或许
as in our first slide, or not
47
00:02:05,580 --> 00:02:07,920
因为这看起来像是一个随机数列表,
as this looks like a list of random numbers anyway,
48
00:02:07,920 --> 00:02:10,680
在这种情况下,请允许我重温一下你的回忆。
in which case, allow me to refresh your memory.
49
00:02:10,680 --> 00:02:15,350
我们有缺少开头的数字和一个结尾的数字
We had a the number at the beginning and a number at the end, that are missing,
50
00:02:15,350 --> 00:02:17,130
是特殊标记。
those are the special tokens.
51
00:02:17,130 --> 00:02:20,340
特殊标记由 prepare_for_model 方法添加
The special tokens are added by the prepare_for_model method
52
00:02:20,340 --> 00:02:22,350
它知道这个分词的索引
which knows the indices of this token
53
00:02:22,350 --> 00:02:25,680
在词汇表中, 并仅添加适当的数字。
in the vocabulary and just adds the proper numbers.
54
00:02:25,680 --> 00:02:27,243
在输入 ID 列表中。
in the input IDs list.
55
00:02:28,590 --> 00:02:29,541
你可以查看特殊标记
You can look at the special tokens
56
00:02:29,541 --> 00:02:30,990
更一般地说,
and, more generally,
57
00:02:30,990 --> 00:02:33,870
关于分词器如何更改你的文本,
at how the tokenizer has changed your text,
58
00:02:33,870 --> 00:02:35,280
通过使用 decode 方法
by using the decode method
59
00:02:35,280 --> 00:02:37,503
在分词器对象的输出上。
on the outputs of the tokenizer object.
60
00:02:38,490 --> 00:02:39,423
至于开头的前缀
As for the prefix for beginning
61
00:02:39,423 --> 00:02:44,160
词的/词的部分,对于特殊标记, 依赖于
of words/ part of words, for special tokens vary depending
62
00:02:44,160 --> 00:02:46,500
你正在使用哪个分词器。
on which tokenizer you're using.
63
00:02:46,500 --> 00:02:48,810
因此分词器使用 CLS 和 SEP,
So that tokenizer uses CLS and SEP,
64
00:02:48,810 --> 00:02:52,417
但是 RoBERTa 分词器使用类似 HTML 的锚点
but the RoBERTa tokenizer uses HTML-like anchors
65
00:02:52,417 --> 00:02:55,230
<s> 和 </s>
<s> and </s>.
66
00:02:55,230 --> 00:02:57,090
既然你知道分词器是如何工作的,
Now that you know how the tokenizer works,
67
00:02:57,090 --> 00:02:59,390
你可以忘记所有那些中间方法,
you can forget all those intermediate methods,
68
00:03:00,283 --> 00:03:01,650
然后你记得你只需要调用它
and then you remember that you just have to call it
69
00:03:01,650 --> 00:03:02,913
在你的输入文本上。
on your input texts.
70
00:03:03,870 --> 00:03:05,310
分词器的输出不只是
The output of a tokenizer don't
71
00:03:05,310 --> 00:03:07,853
仅包含输入 ID, 然而.
just contain the input IDs, however.
72
00:03:07,853 --> 00:03:09,750
要了解注意力掩码是什么,
To learn what the attention mask is,
73
00:03:09,750 --> 00:03:12,360
查看 “一起批量输入” 视频。
check out the "Batch input together" video.
74
00:03:12,360 --> 00:03:14,220
要了解分词类型 ID,
To learn about token type IDs,
75
00:03:14,220 --> 00:03:16,570
观看 “处理句子对” 视频。
look at the "Process pairs of sentences" video.
76
00:03:18,003 --> 00:03:20,920
(物体呼啸)
(object whooshing)