subtitles/zh-CN/13_word-based-tokenizers.srt (241 lines of code) (raw):
1
00:00:00,165 --> 00:00:01,416
(屏幕呼啸)
(screen whooshing)
2
00:00:01,416 --> 00:00:02,716
(贴纸弹出)
(sticker popping)
3
00:00:02,716 --> 00:00:03,549
(屏幕呼啸)
(screen whooshing)
4
00:00:03,549 --> 00:00:05,603
- 让我们来看看基于单词的分词。
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Let's take a look at word-based tokenization.
5
00:00:07,650 --> 00:00:09,780
基于单词的分词化的想法是
Word-based tokenization is the idea
6
00:00:09,780 --> 00:00:11,940
将原始文本拆分成单词
of splitting the raw text into words
7
00:00:11,940 --> 00:00:14,673
通过按空格或其他特定规则拆分,
by splitting on spaces or other specific rules,
8
00:00:16,020 --> 00:00:17,163
比如标点符号。
like punctuation.
9
00:00:18,900 --> 00:00:21,810
在这个算法中,每个单词都有一个特定的数字
In this algorithm, each word has a specific number
10
00:00:21,810 --> 00:00:23,463
或者说他的 ID。
or ID attributed to it.
11
00:00:24,360 --> 00:00:27,270
在这里,"let's" 的 ID 是 250,
Here, let's has the ID 250,
12
00:00:27,270 --> 00:00:30,150
"do" 是 861,并且分词化
do has 861, and tokenization
13
00:00:30,150 --> 00:00:33,393
后面跟感叹号的是 345。
followed by an exclamation mark has 345.
14
00:00:34,380 --> 00:00:36,000
这个方法很有趣
This approach is interesting
15
00:00:36,000 --> 00:00:38,100
因为模型有表示
as the model has representations
16
00:00:38,100 --> 00:00:40,233
是基于整个单词的。
that are based on entire words.
17
00:00:42,720 --> 00:00:45,960
单个号码所持有的信息量高,
The information held in a single number is high,
18
00:00:45,960 --> 00:00:48,240
因为一个词包含很多上下文
as a word contains a lot of contextual
19
00:00:48,240 --> 00:00:49,803
和语义信息。
and semantic information.
20
00:00:53,070 --> 00:00:55,473
然而,这种方法确实有其局限性。
However, this approach does have its limits.
21
00:00:56,610 --> 00:01:00,570
比如 dog 这个词和 dogs 这个词很相似
For example, the word dog and the word dogs are very similar
22
00:01:00,570 --> 00:01:01,923
他们的意思很接近。
and their meaning is close.
23
00:01:03,210 --> 00:01:05,550
然而,基于单词的分词化,
The word-based tokenization, however,
24
00:01:05,550 --> 00:01:08,520
会给这两个词赋予完全不同的 ID
will attribute entirely different IDs to these two words
25
00:01:08,520 --> 00:01:10,110
因此模型将学习
and the model will therefore learn
26
00:01:10,110 --> 00:01:12,930
这两个词的两个不同的嵌入。
two different embeddings for these two words.
27
00:01:12,930 --> 00:01:15,090
这很不幸,因为我们想要这个模型
This is unfortunate as we would like the model
28
00:01:15,090 --> 00:01:18,240
理解这些词是确实相关的,
to understand that these words are indeed related,
29
00:01:18,240 --> 00:01:21,483
而 dogs 只是 dog 这个词的复数形式。
and that dogs is simply the plural form of the word dog.
30
00:01:22,980 --> 00:01:24,480
这种方法的另一个问题是,
Another issue with this approach,
31
00:01:24,480 --> 00:01:28,050
是语言中有很多不同的词。
is that there are a lot of different words in the language.
32
00:01:28,050 --> 00:01:29,490
如果我们想让我们的模型理解
If we want our model to understand
33
00:01:29,490 --> 00:01:32,160
该语言中所有可能的句子,
all possible sentences in that language,
34
00:01:32,160 --> 00:01:35,850
那么我们需要为每个不同的词设置一个 ID。
then we will need to have an ID for each different word.
35
00:01:35,850 --> 00:01:37,380
以及总字数,
And the total number of words,
36
00:01:37,380 --> 00:01:40,080
也称为词汇量大小,
which is also known as the vocabulary size,
37
00:01:40,080 --> 00:01:41,913
可以很快变得非常大。
can quickly become very large.
38
00:01:44,400 --> 00:01:47,640
这是一个问题,因为每个 ID 都映射到一个大向量
This is an issue because each ID is mapped to a large vector
39
00:01:47,640 --> 00:01:50,190
代表这个词的意思,
that represents the word's meaning,
40
00:01:50,190 --> 00:01:52,170
并实现保持这些映射
and keeping track of these mappings
41
00:01:52,170 --> 00:01:54,990
需要很大的模型
requires an enormous number of weights
42
00:01:54,990 --> 00:01:57,123
当词汇量很大时。
when the vocabulary size is very large.
43
00:01:59,160 --> 00:02:00,960
如果我们希望我们的模型保持精简,
If we want our models to stay lean,
44
00:02:00,960 --> 00:02:04,440
我们可以选择让分词器忽略某些词
we can opt for our tokenizer to ignore certain words
45
00:02:04,440 --> 00:02:06,093
我们不一定需要的。
that we don't necessarily need.
46
00:02:08,400 --> 00:02:11,970
例如,在这里,当在文本上训练我们的分词器时,
For example, here, when training our tokenizer on a text,
47
00:02:11,970 --> 00:02:15,020
我们可能只想使用 10,000 个最常用的单词
we might want to take only the 10,000 most frequent words
48
00:02:15,020 --> 00:02:16,320
在该文本中。
in that text.
49
00:02:16,320 --> 00:02:18,600
而不是从该文本中提取所有单词
Rather than taking all words from in that text
50
00:02:18,600 --> 00:02:22,503
或所有语言的单词来创建我们的基本词汇。
or all languages words to create our basic vocabulary.
51
00:02:23,790 --> 00:02:26,520
分词器将知道如何转换这 10,000 个单词
The tokenizer will know how to convert those 10,000 words
52
00:02:26,520 --> 00:02:29,370
转换成数字,但任何其他词都会被转换
into numbers, but any other word will be converted
53
00:02:29,370 --> 00:02:31,530
作为词汇外的词,
to the out-of-vocabulary word,
54
00:02:31,530 --> 00:02:33,783
或者像这里显示的那样,未知的词。
or like shown here, the unknown word.
55
00:02:35,280 --> 00:02:37,440
不幸的是,这是一种妥协。
Unfortunately, this is a compromise.
56
00:02:37,440 --> 00:02:39,900
该模型将具有完全相同的表示
The model will have the exact same representation
57
00:02:39,900 --> 00:02:42,390
对于所有它不知道的单词,
for all words that it doesn't know,
58
00:02:42,390 --> 00:02:45,210
这可能会导致大量信息丢失
which can result in a lot of lost information
59
00:02:45,210 --> 00:02:47,664
如果存在许多未知单词。
if many unknown words are present.
60
00:02:47,664 --> 00:02:50,581
(屏幕呼啸)
(screen whooshing)