1
00:00:00,234 --> 00:00:02,901
（翻页）
(page whirring)

2
00:00:04,260 --> 00:00:07,200
- 在深入研究基于字符的分词化之前，
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Before diving in character-based tokenization,

3
00:00:07,200 --> 00:00:10,350
理解为什么这种分词化很有趣
understanding why this kind of tokenization is interesting

4
00:00:10,350 --> 00:00:13,533
需要了解基于单词的分词化的缺陷。
requires understanding the flaws of word-based tokenization.

5
00:00:14,640 --> 00:00:16,320
如果你还没有看过第一个视频, 
If you haven't seen the first video

6
00:00:16,320 --> 00:00:17,880
基于词的分词
on word-based tokenization

7
00:00:17,880 --> 00:00:21,450
我们建议你在观看此视频之前看一下。
we recommend you check it out before looking at this video.

8
00:00:21,450 --> 00:00:24,250
好的，让我们看一下基于字符的分词化。
Okay, let's take a look at character-based tokenization.

9
00:00:25,650 --> 00:00:28,560
我们现在将文本拆分为单个字符，
We now split our text into individual characters,

10
00:00:28,560 --> 00:00:29,673
而不是文字。
rather than words.

11
00:00:32,850 --> 00:00:35,550
语言中通常有很多不同的词，
There are generally a lot of different words in languages,

12
00:00:35,550 --> 00:00:37,743
而字符数保持较低。
while the number of characters stays low.

13
00:00:38,610 --> 00:00:41,313
首先, 让我们看一下英语，
To begin let's take a look at the English language,

14
00:00:42,210 --> 00:00:45,540
它估计有 170,000 个不同的词，
it has an estimated 170,000 different words,

15
00:00:45,540 --> 00:00:47,730
所以我们需要非常大的词汇量
so we would need a very large vocabulary

16
00:00:47,730 --> 00:00:49,413
来包含所有单词。
to encompass all words.

17
00:00:50,280 --> 00:00:52,200
使用基于字符的词汇表，
With a character-based vocabulary,

18
00:00:52,200 --> 00:00:55,440
我们可以只用 256 个字符，
we can get by with only 256 characters,

19
00:00:55,440 --> 00:00:58,683
其中包括字母、数字和特殊字符。
which includes letters, numbers and special characters.

20
00:00:59,760 --> 00:01:02,190
即使是有大量不同字符的语言
Even languages with a lot of different characters

21
00:01:02,190 --> 00:01:04,800
就像中文可以有字典一样
like the Chinese languages can have dictionaries

22
00:01:04,800 --> 00:01:08,130
多达 20,000 个不同的汉字
with up to 20,000 different characters

23
00:01:08,130 --> 00:01:11,523
超过 375,000 个不同的词语。
but more than 375,000 different words.

24
00:01:12,480 --> 00:01:14,310
所以基于字符的词汇
So character-based vocabularies

25
00:01:14,310 --> 00:01:16,293
让我们使用更少的不同分词
let us use fewer different tokens

26
00:01:16,293 --> 00:01:19,050
比基于单词的分词词典
than the word-based tokenization dictionaries

27
00:01:19,050 --> 00:01:20,523
否则我们会使用。
we would otherwise use.

28
00:01:23,250 --> 00:01:25,830
这些词汇也更全面
These vocabularies are also more complete

29
00:01:25,830 --> 00:01:28,950
相较于其基于单词的词汇。
than their word-based vocabularies counterparts.

30
00:01:28,950 --> 00:01:31,410
由于我们的词汇表包含所有字符
As our vocabulary contains all characters

31
00:01:31,410 --> 00:01:33,960
在一种语言中的，甚至是看不见的词
used in a language, even words unseen

32
00:01:33,960 --> 00:01:36,990
在分词器训练期间仍然可以分词，
during the tokenizer training can still be tokenized,

33
00:01:36,990 --> 00:01:39,633
因此溢出的分词将不那么频繁。
so out-of-vocabulary tokens will be less frequent.

34
00:01:40,680 --> 00:01:42,840
这包括正确分词化
This includes the ability to correctly tokenize

35
00:01:42,840 --> 00:01:45,210
拼错的单词，而不是丢弃它们
misspelled words, rather than discarding them

36
00:01:45,210 --> 00:01:46,623
作为未知的。
as unknown straight away.

37
00:01:48,240 --> 00:01:52,380
然而，这个算法也不完美。
However, this algorithm isn't perfect either.

38
00:01:52,380 --> 00:01:54,360
直觉上，字符不成立
Intuitively, characters do not hold

39
00:01:54,360 --> 00:01:57,990
一个词所能包含的信息量。
as much information individually as a word would hold.

40
00:01:57,990 --> 00:02:00,930
例如，“让我们” 包含更多信息
For example, "Let's" holds more information

41
00:02:00,930 --> 00:02:03,570
比它的第一个字母 “l”。
than it's first letter "l".

42
00:02:03,570 --> 00:02:05,880
当然，并非所有语言都如此，
Of course, this is not true for all languages,

43
00:02:05,880 --> 00:02:08,880
作为一些语言，比如基于表意文字的语言
as some languages like ideogram-based languages

44
00:02:08,880 --> 00:02:11,523
有很多信息保存在单个字符中，
have a lot of information held in single characters,

45
00:02:12,750 --> 00:02:15,360
但对于其他像基于字母的语言，
but for others like roman-based languages,

46
00:02:15,360 --> 00:02:17,760
该模型必须一次性理解多个分词
the model will have to make sense of multiple tokens at a time 

47
00:02:17,760 --> 00:02:20,670
以获取信息
to get the information otherwise held

48
00:02:20,670 --> 00:02:21,753
在一句话中。
in a single word.

49
00:02:23,760 --> 00:02:27,000
这导致了基于字符的分词器的另一个问题，
This leads to another issue with character-based tokenizers,

50
00:02:27,000 --> 00:02:29,520
他们的序列被翻译成非常大量
their sequences are translated into very large amount

51
00:02:29,520 --> 00:02:31,593
模型要处理的分词。
of tokens to be processed by the model.

52
00:02:33,090 --> 00:02:36,810
这会对上下文的大小产生影响
And this can have an impact on the size of the context

53
00:02:36,810 --> 00:02:40,020
该模型将装载，并会减小
the model will carry around, and will reduce the size

54
00:02:40,020 --> 00:02:42,030
可以用作模型输入文本的尺寸，
of the text we can use as input for our model,

55
00:02:42,030 --> 00:02:43,233
这通常是有限的。
which is often limited.

56
00:02:44,100 --> 00:02:46,650
这种标记化虽然存在一些问题，
This tokenization, while it has some issues,

57
00:02:46,650 --> 00:02:48,720
但在过去看到了一些非常好的结果
has seen some very good results in the past

58
00:02:48,720 --> 00:02:50,490
所以应该被考虑
and so it should be considered
59
00:02:50,490 --> 00:02:52,680
当碰到新问题时, 来解决问题.
when approaching a new problem as it solves issues

60
00:02:52,680 --> 00:02:54,843
在基于词的算法中遇到的。
encountered in the word-based algorithm.

61
00:02:56,107 --> 00:02:58,774
（翻页）
(page whirring)