subtitles/zh-CN/12_tokenizers-overview.srt (93 lines of code) (raw):

1 00:00:00,450 --> 00:00:01,509 (介绍呼呼) (intro whooshing) 2 00:00:01,509 --> 00:00:02,720 (笑脸拍打) (smiley snapping) 3 00:00:02,720 --> 00:00:03,930 (话语嘶哑) (words whooshing) 4 00:00:03,930 --> 00:00:04,920 - 在接下来的几个视频中, - In the next few videos, 5 00:00:04,920 --> 00:00:06,720 我们将看一下分词器 *[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳] we'll take a look at the tokenizers. 6 00:00:07,860 --> 00:00:09,240 在自然语言处理中, In natural language processing, 7 00:00:09,240 --> 00:00:12,930 我们处理的大部分数据都是原始文本。 most of the data that we handle consists of raw text. 8 00:00:12,930 --> 00:00:14,280 然而,机器学习模型 However, machine learning models 9 00:00:14,280 --> 00:00:17,103 无法阅读或理解原始形式的文本, cannot read or understand text in its raw form, 10 00:00:18,540 --> 00:00:20,253 他们只能理解数字。 they can only work with numbers. 11 00:00:21,360 --> 00:00:23,220 所以分词器的目标 So the tokenizer's objective 12 00:00:23,220 --> 00:00:25,923 将是把文本翻译成数字。 will be to translate the text into numbers. 13 00:00:27,600 --> 00:00:30,240 这种转换有几种可能的方法, There are several possible approaches to this conversion, 14 00:00:30,240 --> 00:00:31,110 并且目标 and the objective 15 00:00:31,110 --> 00:00:33,453 是找到最有意义的表示。 is to find the most meaningful representation. 16 00:00:36,240 --> 00:00:39,390 我们将看看三种不同的分词化算法。 We'll take a look at three distinct tokenization algorithms. 17 00:00:39,390 --> 00:00:40,530 我们对其一对一比较, We compare them one to one, 18 00:00:40,530 --> 00:00:42,600 所以我们建议你看一下视频 so we recommend you take a look at the videos 19 00:00:42,600 --> 00:00:44,040 按以下顺序。 in the following order. 20 00:00:44,040 --> 00:00:45,390 首先,“基于单词”, First, "Word-based," 21 00:00:45,390 --> 00:00:46,800 其次是 “基于字符”, followed by "Character-based," 22 00:00:46,800 --> 00:00:48,877 最后,“基于子词”。 and finally, "Subword-based." 23 00:00:48,877 --> 00:00:51,794 (结尾嘶嘶声) (outro whooshing)