1
00:00:06,450 --> 00:00:09,540
- 让我们来看看基于子词的分词。
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Let's take a look at subword based tokenization.

2
00:00:09,540 --> 00:00:11,610
了解为什么基于子词的分词是
Understanding why subword based tokenization

3
00:00:11,610 --> 00:00:13,980
是有趣的需要理解
is interesting requires understanding the flaws


4
00:00:13,980 --> 00:00:17,340
基于单词和基于校正器分词化的缺陷。
of word based and corrector based tokenization.

5
00:00:17,340 --> 00:00:18,780
如果你还没有看过第一个视频
If you haven't seen the first videos

6
00:00:18,780 --> 00:00:22,020
关于基于单词和基于字符的标记化
on word based and character based tokenization

7
00:00:22,020 --> 00:00:23,130
我们建议你观看它们
we recommend you check them out

8
00:00:23,130 --> 00:00:24,780
在看这个视频之前。
before looking at this video.

9
00:00:27,840 --> 00:00:31,493
基于子词的分词化介于基于字符
Subword based tokenization lies in between character based

10
00:00:31,493 --> 00:00:35,280
和基于单词的分词算法之间。
and word based tokenization algorithms.

11
00:00:35,280 --> 00:00:37,410
这个想法是找到一个中间场
The idea is to find a middle ground

12
00:00:37,410 --> 00:00:39,486
在很大的词汇表,
between very large vocabularies,

13
00:00:39,486 --> 00:00:42,600
大量的词汇分词
a large quantity of out vocabulary tokens

14
00:00:42,600 --> 00:00:45,360
还有在非常相似的词之间意义差
and a loss of meaning across very similar words

15
00:00:45,360 --> 00:00:48,630
对基于单词的分词器和非常长的序列
for word based tokenizers and very long sequences

16
00:00:48,630 --> 00:00:51,330
以及意义不大的单个标记
as well as less meaningful individual tokens

17
00:00:51,330 --> 00:00:53,133
对于基于字符的分词器之间。
for character based tokenizers.

18
00:00:54,840 --> 00:00:57,960
这些算法依赖于以下原则。
These algorithms rely on the following principle.

19
00:00:57,960 --> 00:01:00,000
常用词不宜拆分
Frequently used words should not be split

20
00:01:00,000 --> 00:01:01,500
成更小的子词
into smaller subwords

21
00:01:01,500 --> 00:01:03,433
而稀有词应该被分解
while rare words should be decomposed

22
00:01:03,433 --> 00:01:05,103
成有意义的子词。
into meaningful subwords.

23
00:01:06,510 --> 00:01:08,460
一个例子是 dog 这个词。
An example is the word dog.

24
00:01:08,460 --> 00:01:11,190
我们想让我们的分词器有一个单一的 ID
We would like to have our tokenizer to have a single ID

25
00:01:11,190 --> 00:01:12,600
对于 dog 这个词
for the word dog rather

26
00:01:12,600 --> 00:01:15,363
而不是将其拆分为字母 d o 和 g。
than splitting it into characters d o and g.

27
00:01:16,650 --> 00:01:19,260
然而，当遇到 dog 这个词时
However, when encountering the word dogs

28
00:01:19,260 --> 00:01:22,710
我们希望我们的分词从词根上理解这一点
we would like our tokenize to understand that at the root

29
00:01:22,710 --> 00:01:24,120
这还是 dog 这个词。
this is still the word dog.

30
00:01:24,120 --> 00:01:27,030
添加 s 后，意思略有改变
With an added "s", that slightly changes the meaning

31
00:01:27,030 --> 00:01:28,923
同时保持最初的意思。
while keeping the original idea.

32
00:01:30,600 --> 00:01:34,080
另一个例子是像 tokenization 这样的复杂词
Another example is a complex word like tokenization

33
00:01:34,080 --> 00:01:37,140
可以拆分成有意义的子词。
which can be split into meaningful subwords.

34
00:01:37,140 --> 00:01:37,973
这个词的根
The root of the word 

35
00:01:37,973 --> 00:01:40,590
是 token ，以及 -ization 完整了这个词
is token and -ization completes the root

36
00:01:40,590 --> 00:01:42,870
以赋予它稍微不同的含义。
to give it a slightly different meaning.

37
00:01:42,870 --> 00:01:44,430
拆分这个词是有道理的
It makes sense to split the word

38
00:01:44,430 --> 00:01:47,640
一分为二，token 作为词根，
into two, token as the root of the word,

39
00:01:47,640 --> 00:01:49,950
标记为单词的开头
labeled as the start of the word

40
00:01:49,950 --> 00:01:52,530
和 -ization 作为标记的附加信息
and -ization as additional information labeled

41
00:01:52,530 --> 00:01:54,393
作为单词的完整化。
as a completion of the word.

42
00:01:55,826 --> 00:01:58,740
如此一来，该模型现在将能够有作用
In turn, the model will now be able to make sense

43
00:01:58,740 --> 00:02:01,080
在不同情况下的分词。
of token in different situations.

44
00:02:01,080 --> 00:02:04,602
它会理解这个词的形式: token, tokens, tokenizing
It will understand that the word's token, tokens, tokenizing

45
00:02:04,602 --> 00:02:08,760
和 tokenization 具有相似的含义并且是相关联的。
and tokenization have a similar meaning and are linked.

46
00:02:08,760 --> 00:02:12,450
它还将理解 tokenization 、modernization
It's will also understand that tokenization, modernization

47
00:02:12,450 --> 00:02:16,200
和 immunization ，都有相同的后缀
and immunization, which all have the same suffixes

48
00:02:16,200 --> 00:02:19,383
可能在相同的句法情况下使用。
are probably used in the same syntactic situations.

49
00:02:20,610 --> 00:02:23,130
基于子词的分词器通常有办法来
Subword based tokenizers generally have a way to

50
00:02:23,130 --> 00:02:25,890
识别哪些分词是单词的开头
identify which tokens are a start of word

51
00:02:25,890 --> 00:02:28,443
以及哪些分词完成了单词的开始。
and which tokens complete start of words.

52
00:02:29,520 --> 00:02:31,140
所以这里以分词作为单词的开始
So here token as the start of a word 

53 
00:02:31,140 --> 00:02:35,100
以及 ##ization 的开始作为单词的完成。
and ##ization as completion of a word.

54
00:02:35,100 --> 00:02:38,103
这里 ## 表示 -ization 是单词的一部分
Here, the ## prefix indicates that ization is part of a word 

55
00:02:38,103 --> 00:02:41,013
而不是它的开始。
rather than the beginning of it.

56
00:02:41,910 --> 00:02:43,110
 ## 记号
The ## comes

57
00:02:43,110 --> 00:02:47,013
来自基于单词片算法的 BERT 分词器。
from the BERT tokenizer based on the word piece algorithm.

58
00:02:47,850 --> 00:02:50,700
其他分词器使用其他前缀, 可以是
Other tokenizes use other prefixes which can be

59
00:02:50,700 --> 00:02:52,200
用来表示单词的一部分
placed to indicate part of words

60
00:02:52,200 --> 00:02:55,083
比如在这里或单词的开头。
like in here or start of words instead.

61
00:02:56,250 --> 00:02:57,083
有很多
There are a lot

62
00:02:57,083 --> 00:02:58,740
可以使用的不同算法
of different algorithms that can be used

63
00:02:58,740 --> 00:03:00,090
用于子词分词化
for subword tokenization
64
00:03:00,090 --> 00:03:02,670
大多数模型都获得了目前最先进的结果
and most models obtaining state-of-the-art results

65
00:03:02,670 --> 00:03:03,780
今日英语
in English today

66
00:03:03,780 --> 00:03:06,663
使用一些子词标记化算法。
use some kind of subword tokenization algorithms.

67
00:03:07,620 --> 00:03:10,953
这些方法有助于减少词汇量
These approaches help in reducing the vocabulary sizes

68
00:03:10,953 --> 00:03:13,636
通过不同词之间共享的信息
by sharing information across different words

69
00:03:13,636 --> 00:03:15,960
有能力以理解前缀
having the ability to have prefixes

70
00:03:15,960 --> 00:03:18,630
和后缀如斯。
and suffixes understood as such.

71
00:03:18,630 --> 00:03:20,700
他们在非常相似的词中保留意义
They keep meaning across very similar words

72
00:03:20,700 --> 00:03:23,103
通过识别相似的分词，将它们组合起来。
by recognizing similar tokens, making them up.