subtitles/zh-CN/12_tokenizers-overview.srt (93 lines of code) (raw):
1
00:00:00,450 --> 00:00:01,509
(介绍呼呼)
(intro whooshing)
2
00:00:01,509 --> 00:00:02,720
(笑脸拍打)
(smiley snapping)
3
00:00:02,720 --> 00:00:03,930
(话语嘶哑)
(words whooshing)
4
00:00:03,930 --> 00:00:04,920
- 在接下来的几个视频中,
- In the next few videos,
5
00:00:04,920 --> 00:00:06,720
我们将看一下分词器
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
we'll take a look at the tokenizers.
6
00:00:07,860 --> 00:00:09,240
在自然语言处理中,
In natural language processing,
7
00:00:09,240 --> 00:00:12,930
我们处理的大部分数据都是原始文本。
most of the data that we handle consists of raw text.
8
00:00:12,930 --> 00:00:14,280
然而,机器学习模型
However, machine learning models
9
00:00:14,280 --> 00:00:17,103
无法阅读或理解原始形式的文本,
cannot read or understand text in its raw form,
10
00:00:18,540 --> 00:00:20,253
他们只能理解数字。
they can only work with numbers.
11
00:00:21,360 --> 00:00:23,220
所以分词器的目标
So the tokenizer's objective
12
00:00:23,220 --> 00:00:25,923
将是把文本翻译成数字。
will be to translate the text into numbers.
13
00:00:27,600 --> 00:00:30,240
这种转换有几种可能的方法,
There are several possible approaches to this conversion,
14
00:00:30,240 --> 00:00:31,110
并且目标
and the objective
15
00:00:31,110 --> 00:00:33,453
是找到最有意义的表示。
is to find the most meaningful representation.
16
00:00:36,240 --> 00:00:39,390
我们将看看三种不同的分词化算法。
We'll take a look at three distinct tokenization algorithms.
17
00:00:39,390 --> 00:00:40,530
我们对其一对一比较,
We compare them one to one,
18
00:00:40,530 --> 00:00:42,600
所以我们建议你看一下视频
so we recommend you take a look at the videos
19
00:00:42,600 --> 00:00:44,040
按以下顺序。
in the following order.
20
00:00:44,040 --> 00:00:45,390
首先,“基于单词”,
First, "Word-based,"
21
00:00:45,390 --> 00:00:46,800
其次是 “基于字符”,
followed by "Character-based,"
22
00:00:46,800 --> 00:00:48,877
最后,“基于子词”。
and finally, "Subword-based."
23
00:00:48,877 --> 00:00:51,794
(结尾嘶嘶声)
(outro whooshing)