subtitles/zh-CN/50_what-is-pre-tokenization.srt (176 lines of code) (raw):

1 00:00:05,550 --> 00:00:08,910 - 分词化管线涉及几个步骤 - The tokenization pipeline involves several steps 2 00:00:08,910 --> 00:00:11,073 将原始文本转换为数字。 that converts raw text into numbers. 3 00:00:12,180 --> 00:00:14,280 在这段视频中,我们将看到会发生什么 In this video, we will see what happens 4 00:00:14,280 --> 00:00:16,293 在预分词化步骤中。 during the pre-tokenization step. 5 00:00:18,390 --> 00:00:22,110 预分词化操作是操作执行在 The pre-tokenization operation is the operation performed 6 00:00:22,110 --> 00:00:24,630 文本规范化后 after the normalization of the text 7 00:00:24,630 --> 00:00:27,633 在应用分词化算法之前。 and before the application of the tokenization algorithm. 8 00:00:29,112 --> 00:00:31,110 此步骤包括应用规则 This step consists in applying rules 9 00:00:31,110 --> 00:00:32,550 不需要学习的 that do not need to be learned 10 00:00:32,550 --> 00:00:34,563 执行文本的第一部分。 to perform a first division of the text. 11 00:00:38,160 --> 00:00:41,310 让我们看看几个分词器是如何 Let's look at how several tokenizers 12 00:00:41,310 --> 00:00:43,143 在此示例中进行预分词。 pre-tokenize in this example. 13 00:00:46,200 --> 00:00:50,820 gpt2 pre-tokenization 把文本划分到空格 The gpt2 pre-tokenization divides the text on spaces 14 00:00:50,820 --> 00:00:55,820 和一些标点符号,但不是撇号。 and some punctuation, but not on the apostrophe. 15 00:00:57,750 --> 00:01:01,170 我们还注意到空格已被替换 We also notice that spaces have been replaced 16 00:01:01,170 --> 00:01:03,813 上面有一个点的大写字母 G。 by capital G with a dot above. 17 00:01:07,170 --> 00:01:09,540 Albert 的预分词化将文本划分 Albert's pre-tokenization divides the text 18 00:01:09,540 --> 00:01:11,043 在空间层面, at the level of spaces, 19 00:01:11,970 --> 00:01:15,300 在句子的开头加一个空格, adds a space at the beginning of the sentence, 20 00:01:15,300 --> 00:01:18,873 并用特殊下划线替换空格。 and replaces spaces with a special underscore. 21 00:01:20,580 --> 00:01:24,780 最后,Bert 的 pre-tokenization 对文本进行分词 Finally, Bert's pre-tokenization divides the text 22 00:01:24,780 --> 00:01:28,083 在标点符号和空格的级别。 at the level of punctuation and spaces. 23 00:01:28,920 --> 00:01:31,260 但与之前的分词器不同, But unlike the previous tokenizers, 24 00:01:31,260 --> 00:01:33,780 空格没有变换 spaces are not transformed 25 00:01:33,780 --> 00:01:37,293 并集成到使用此预标记器生成的 token 中。 and integrated into tokens produced with this pre-tokenizer. 26 00:01:40,080 --> 00:01:42,120 通过这三个例子, Through this three example, 27 00:01:42,120 --> 00:01:45,330 我们可以观察到两种主要的操作类型 we could observe the two main type of operation 28 00:01:45,330 --> 00:01:47,073 预分词化带来的; brought by the pre-tokenization; 29 00:01:48,420 --> 00:01:49,900 对案文进行一些改动 some change on the text 30 00:01:50,820 --> 00:01:54,180 并将字符串划分为标记 and the division of the string into tokens 31 00:01:54,180 --> 00:01:56,043 可以与单词相关联。 that can be associated to words. 32 00:01:59,430 --> 00:02:04,230 最后,快速分词器的后端分词器 Finally, the backend tokenizer of the fast tokenizers 33 00:02:04,230 --> 00:02:07,680 还允许测试预分词化操作 also allows to test the pre-tokenization operation 34 00:02:07,680 --> 00:02:11,253 非常容易,这要归功于它的 pre_tokenize_str 方法。 very easily, thanks to its pre_tokenize_str method. 35 00:02:12,630 --> 00:02:14,970 我们注意到这个操作的输出 We notice that the output of this operation 36 00:02:14,970 --> 00:02:18,450 由 token 和偏移量组成, is composed of both tokens and offsets, 37 00:02:18,450 --> 00:02:21,960 允许将 token 链接到它在文本中的位置 which allow to link the tokens to its position in the text 38 00:02:21,960 --> 00:02:23,943 在方法的输入中给出。 given in input of the method. 39 00:02:25,650 --> 00:02:28,860 此操作定义了最大的 token This operation defines the largest tokens 40 00:02:28,860 --> 00:02:31,740 可以通过分词化产生, that can be produced by the tokenization, 41 00:02:31,740 --> 00:02:36,090 或者换句话说,子 token 的障碍 or in those words, the barriers of the sub-tokens 42 00:02:36,090 --> 00:02:37,653 届时将产生。 which will be produced then. 43 00:02:40,050 --> 00:02:41,850 这就是特性的全部 And that's all for the characteristic 44 00:02:41,850 --> 00:02:43,203 对预分词器。 of the pre-tokenizers.