subtitles/zh-CN/14_character-based-tokenizers.srt (245 lines of code) (raw):
1
00:00:00,234 --> 00:00:02,901
(翻页)
(page whirring)
2
00:00:04,260 --> 00:00:07,200
- 在深入研究基于字符的分词化之前,
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Before diving in character-based tokenization,
3
00:00:07,200 --> 00:00:10,350
理解为什么这种分词化很有趣
understanding why this kind of tokenization is interesting
4
00:00:10,350 --> 00:00:13,533
需要了解基于单词的分词化的缺陷。
requires understanding the flaws of word-based tokenization.
5
00:00:14,640 --> 00:00:16,320
如果你还没有看过第一个视频,
If you haven't seen the first video
6
00:00:16,320 --> 00:00:17,880
基于词的分词
on word-based tokenization
7
00:00:17,880 --> 00:00:21,450
我们建议你在观看此视频之前看一下。
we recommend you check it out before looking at this video.
8
00:00:21,450 --> 00:00:24,250
好的,让我们看一下基于字符的分词化。
Okay, let's take a look at character-based tokenization.
9
00:00:25,650 --> 00:00:28,560
我们现在将文本拆分为单个字符,
We now split our text into individual characters,
10
00:00:28,560 --> 00:00:29,673
而不是文字。
rather than words.
11
00:00:32,850 --> 00:00:35,550
语言中通常有很多不同的词,
There are generally a lot of different words in languages,
12
00:00:35,550 --> 00:00:37,743
而字符数保持较低。
while the number of characters stays low.
13
00:00:38,610 --> 00:00:41,313
首先, 让我们看一下英语,
To begin let's take a look at the English language,
14
00:00:42,210 --> 00:00:45,540
它估计有 170,000 个不同的词,
it has an estimated 170,000 different words,
15
00:00:45,540 --> 00:00:47,730
所以我们需要非常大的词汇量
so we would need a very large vocabulary
16
00:00:47,730 --> 00:00:49,413
来包含所有单词。
to encompass all words.
17
00:00:50,280 --> 00:00:52,200
使用基于字符的词汇表,
With a character-based vocabulary,
18
00:00:52,200 --> 00:00:55,440
我们可以只用 256 个字符,
we can get by with only 256 characters,
19
00:00:55,440 --> 00:00:58,683
其中包括字母、数字和特殊字符。
which includes letters, numbers and special characters.
20
00:00:59,760 --> 00:01:02,190
即使是有大量不同字符的语言
Even languages with a lot of different characters
21
00:01:02,190 --> 00:01:04,800
就像中文可以有字典一样
like the Chinese languages can have dictionaries
22
00:01:04,800 --> 00:01:08,130
多达 20,000 个不同的汉字
with up to 20,000 different characters
23
00:01:08,130 --> 00:01:11,523
超过 375,000 个不同的词语。
but more than 375,000 different words.
24
00:01:12,480 --> 00:01:14,310
所以基于字符的词汇
So character-based vocabularies
25
00:01:14,310 --> 00:01:16,293
让我们使用更少的不同分词
let us use fewer different tokens
26
00:01:16,293 --> 00:01:19,050
比基于单词的分词词典
than the word-based tokenization dictionaries
27
00:01:19,050 --> 00:01:20,523
否则我们会使用。
we would otherwise use.
28
00:01:23,250 --> 00:01:25,830
这些词汇也更全面
These vocabularies are also more complete
29
00:01:25,830 --> 00:01:28,950
相较于其基于单词的词汇。
than their word-based vocabularies counterparts.
30
00:01:28,950 --> 00:01:31,410
由于我们的词汇表包含所有字符
As our vocabulary contains all characters
31
00:01:31,410 --> 00:01:33,960
在一种语言中的,甚至是看不见的词
used in a language, even words unseen
32
00:01:33,960 --> 00:01:36,990
在分词器训练期间仍然可以分词,
during the tokenizer training can still be tokenized,
33
00:01:36,990 --> 00:01:39,633
因此溢出的分词将不那么频繁。
so out-of-vocabulary tokens will be less frequent.
34
00:01:40,680 --> 00:01:42,840
这包括正确分词化
This includes the ability to correctly tokenize
35
00:01:42,840 --> 00:01:45,210
拼错的单词,而不是丢弃它们
misspelled words, rather than discarding them
36
00:01:45,210 --> 00:01:46,623
作为未知的。
as unknown straight away.
37
00:01:48,240 --> 00:01:52,380
然而,这个算法也不完美。
However, this algorithm isn't perfect either.
38
00:01:52,380 --> 00:01:54,360
直觉上,字符不成立
Intuitively, characters do not hold
39
00:01:54,360 --> 00:01:57,990
一个词所能包含的信息量。
as much information individually as a word would hold.
40
00:01:57,990 --> 00:02:00,930
例如,“让我们” 包含更多信息
For example, "Let's" holds more information
41
00:02:00,930 --> 00:02:03,570
比它的第一个字母 “l”。
than it's first letter "l".
42
00:02:03,570 --> 00:02:05,880
当然,并非所有语言都如此,
Of course, this is not true for all languages,
43
00:02:05,880 --> 00:02:08,880
作为一些语言,比如基于表意文字的语言
as some languages like ideogram-based languages
44
00:02:08,880 --> 00:02:11,523
有很多信息保存在单个字符中,
have a lot of information held in single characters,
45
00:02:12,750 --> 00:02:15,360
但对于其他像基于字母的语言,
but for others like roman-based languages,
46
00:02:15,360 --> 00:02:17,760
该模型必须一次性理解多个分词
the model will have to make sense of multiple tokens at a time
47
00:02:17,760 --> 00:02:20,670
以获取信息
to get the information otherwise held
48
00:02:20,670 --> 00:02:21,753
在一句话中。
in a single word.
49
00:02:23,760 --> 00:02:27,000
这导致了基于字符的分词器的另一个问题,
This leads to another issue with character-based tokenizers,
50
00:02:27,000 --> 00:02:29,520
他们的序列被翻译成非常大量
their sequences are translated into very large amount
51
00:02:29,520 --> 00:02:31,593
模型要处理的分词。
of tokens to be processed by the model.
52
00:02:33,090 --> 00:02:36,810
这会对上下文的大小产生影响
And this can have an impact on the size of the context
53
00:02:36,810 --> 00:02:40,020
该模型将装载,并会减小
the model will carry around, and will reduce the size
54
00:02:40,020 --> 00:02:42,030
可以用作模型输入文本的尺寸,
of the text we can use as input for our model,
55
00:02:42,030 --> 00:02:43,233
这通常是有限的。
which is often limited.
56
00:02:44,100 --> 00:02:46,650
这种标记化虽然存在一些问题,
This tokenization, while it has some issues,
57
00:02:46,650 --> 00:02:48,720
但在过去看到了一些非常好的结果
has seen some very good results in the past
58
00:02:48,720 --> 00:02:50,490
所以应该被考虑
and so it should be considered
59
00:02:50,490 --> 00:02:52,680
当碰到新问题时, 来解决问题.
when approaching a new problem as it solves issues
60
00:02:52,680 --> 00:02:54,843
在基于词的算法中遇到的。
encountered in the word-based algorithm.
61
00:02:56,107 --> 00:02:58,774
(翻页)
(page whirring)