subtitles/zh-CN/50_what-is-pre-tokenization.srt (176 lines of code) (raw):
1
00:00:05,550 --> 00:00:08,910
- 分词化管线涉及几个步骤
- The tokenization pipeline involves several steps
2
00:00:08,910 --> 00:00:11,073
将原始文本转换为数字。
that converts raw text into numbers.
3
00:00:12,180 --> 00:00:14,280
在这段视频中,我们将看到会发生什么
In this video, we will see what happens
4
00:00:14,280 --> 00:00:16,293
在预分词化步骤中。
during the pre-tokenization step.
5
00:00:18,390 --> 00:00:22,110
预分词化操作是操作执行在
The pre-tokenization operation is the operation performed
6
00:00:22,110 --> 00:00:24,630
文本规范化后
after the normalization of the text
7
00:00:24,630 --> 00:00:27,633
在应用分词化算法之前。
and before the application of the tokenization algorithm.
8
00:00:29,112 --> 00:00:31,110
此步骤包括应用规则
This step consists in applying rules
9
00:00:31,110 --> 00:00:32,550
不需要学习的
that do not need to be learned
10
00:00:32,550 --> 00:00:34,563
执行文本的第一部分。
to perform a first division of the text.
11
00:00:38,160 --> 00:00:41,310
让我们看看几个分词器是如何
Let's look at how several tokenizers
12
00:00:41,310 --> 00:00:43,143
在此示例中进行预分词。
pre-tokenize in this example.
13
00:00:46,200 --> 00:00:50,820
gpt2 pre-tokenization 把文本划分到空格
The gpt2 pre-tokenization divides the text on spaces
14
00:00:50,820 --> 00:00:55,820
和一些标点符号,但不是撇号。
and some punctuation, but not on the apostrophe.
15
00:00:57,750 --> 00:01:01,170
我们还注意到空格已被替换
We also notice that spaces have been replaced
16
00:01:01,170 --> 00:01:03,813
上面有一个点的大写字母 G。
by capital G with a dot above.
17
00:01:07,170 --> 00:01:09,540
Albert 的预分词化将文本划分
Albert's pre-tokenization divides the text
18
00:01:09,540 --> 00:01:11,043
在空间层面,
at the level of spaces,
19
00:01:11,970 --> 00:01:15,300
在句子的开头加一个空格,
adds a space at the beginning of the sentence,
20
00:01:15,300 --> 00:01:18,873
并用特殊下划线替换空格。
and replaces spaces with a special underscore.
21
00:01:20,580 --> 00:01:24,780
最后,Bert 的 pre-tokenization 对文本进行分词
Finally, Bert's pre-tokenization divides the text
22
00:01:24,780 --> 00:01:28,083
在标点符号和空格的级别。
at the level of punctuation and spaces.
23
00:01:28,920 --> 00:01:31,260
但与之前的分词器不同,
But unlike the previous tokenizers,
24
00:01:31,260 --> 00:01:33,780
空格没有变换
spaces are not transformed
25
00:01:33,780 --> 00:01:37,293
并集成到使用此预标记器生成的 token 中。
and integrated into tokens produced with this pre-tokenizer.
26
00:01:40,080 --> 00:01:42,120
通过这三个例子,
Through this three example,
27
00:01:42,120 --> 00:01:45,330
我们可以观察到两种主要的操作类型
we could observe the two main type of operation
28
00:01:45,330 --> 00:01:47,073
预分词化带来的;
brought by the pre-tokenization;
29
00:01:48,420 --> 00:01:49,900
对案文进行一些改动
some change on the text
30
00:01:50,820 --> 00:01:54,180
并将字符串划分为标记
and the division of the string into tokens
31
00:01:54,180 --> 00:01:56,043
可以与单词相关联。
that can be associated to words.
32
00:01:59,430 --> 00:02:04,230
最后,快速分词器的后端分词器
Finally, the backend tokenizer of the fast tokenizers
33
00:02:04,230 --> 00:02:07,680
还允许测试预分词化操作
also allows to test the pre-tokenization operation
34
00:02:07,680 --> 00:02:11,253
非常容易,这要归功于它的 pre_tokenize_str 方法。
very easily, thanks to its pre_tokenize_str method.
35
00:02:12,630 --> 00:02:14,970
我们注意到这个操作的输出
We notice that the output of this operation
36
00:02:14,970 --> 00:02:18,450
由 token 和偏移量组成,
is composed of both tokens and offsets,
37
00:02:18,450 --> 00:02:21,960
允许将 token 链接到它在文本中的位置
which allow to link the tokens to its position in the text
38
00:02:21,960 --> 00:02:23,943
在方法的输入中给出。
given in input of the method.
39
00:02:25,650 --> 00:02:28,860
此操作定义了最大的 token
This operation defines the largest tokens
40
00:02:28,860 --> 00:02:31,740
可以通过分词化产生,
that can be produced by the tokenization,
41
00:02:31,740 --> 00:02:36,090
或者换句话说,子 token 的障碍
or in those words, the barriers of the sub-tokens
42
00:02:36,090 --> 00:02:37,653
届时将产生。
which will be produced then.
43
00:02:40,050 --> 00:02:41,850
这就是特性的全部
And that's all for the characteristic
44
00:02:41,850 --> 00:02:43,203
对预分词器。
of the pre-tokenizers.