subtitles/zh-CN/44_fast-tokenizer-superpowers.srt (300 lines of code) (raw):
1
00:00:05,010 --> 00:00:06,270
- 快速 tokenizer
- The fast tokenizers
2
00:00:06,270 --> 00:00:08,580
在 transformers 库中的, 速度很快,
of the Transformers library are fast,
3
00:00:08,580 --> 00:00:11,490
同时他们也实现了非常有用的功能
but they also implement features that will be super useful
4
00:00:11,490 --> 00:00:14,536
用于数据预处理和后处理。
for data pre-processing and post-processing.
5
00:00:14,536 --> 00:00:17,239
让我们来看看吧!
Let's have a look at them!
6
00:00:17,239 --> 00:00:18,650
首先,让我们来看看
First, let's have a look
7
00:00:18,650 --> 00:00:21,690
在 tokenizer 的通常输出中。
at the usual output of a tokenizer.
8
00:00:21,690 --> 00:00:24,278
我们得到对应于 token 的输入 ID,
We get input IDs that correspond to token,
9
00:00:24,278 --> 00:00:27,960
但是我们在这个过程中丢失了很多信息。
but we lose a lot of information in the process.
10
00:00:27,960 --> 00:00:29,010
例如,
For instance,
11
00:00:29,010 --> 00:00:31,856
这里两个句子的分词化是相同的
here the tokenization is the same for the two sentences
12
00:00:31,856 --> 00:00:35,373
即使一个比另一个多几个空间。
even if one has several more spaces than the other.
13
00:00:36,300 --> 00:00:39,150
因此,仅仅拥有输入 ID 是不够的
Just having the input IDs is thus not enough
14
00:00:39,150 --> 00:00:42,330
如果我们想将一些 token 与一段文本相匹配,
if we want to match some tokens with a span of text,
15
00:00:42,330 --> 00:00:43,320
我们将需要做的事情
something we'll need to do
16
00:00:43,320 --> 00:00:46,111
例如,在处理问答时。
when tackling question answering, for instance.
17
00:00:46,111 --> 00:00:47,592
也很难知道
It's also difficult to know
18
00:00:47,592 --> 00:00:50,850
当两个 token 是否属于同一个词时。
when two tokens belong to the same word or not.
19
00:00:50,850 --> 00:00:52,860
是容易的, 当你只看
It looks easy when you just look at the output
20
00:00:52,860 --> 00:00:55,650
一个 BERT 分词器的输出,我们只需要看一下
of a BERT tokenizer where we just need to look
21
00:00:55,650 --> 00:00:56,779
对于 ## 。
for the ##.
22
00:00:56,779 --> 00:00:59,040
但是其他分词器有不同的方式
But other tokenizers have different ways
23
00:00:59,040 --> 00:01:00,987
分词化部分单词。
to tokenize parts of words.
24
00:01:00,987 --> 00:01:04,470
例如,RoBERTa 添加了这个特殊的 G 符号
For instance, RoBERTa adds this special G symbol
25
00:01:04,470 --> 00:01:06,491
在单词的开头 token 标记
to mark the tokens at the beginning of the word
26
00:01:06,491 --> 00:01:09,570
T5 使用这个特殊的下划线符号
and T5 uses this special underscore symbol
27
00:01:09,570 --> 00:01:11,150
为了同样的目的。
for the same purpose.
28
00:01:11,150 --> 00:01:14,760
值得庆幸的是,快速 tokenizer 会跟踪这个词
Thankfully, the fast tokenizers keep track of the word
29
00:01:14,760 --> 00:01:16,230
每个 token 来自,
each token comes from,
30
00:01:16,230 --> 00:01:19,571
使用 word_ids 方法,你可以在它们的输出上使用。
with a word_ids method you can use on their outputs.
31
00:01:19,571 --> 00:01:21,870
输出不一定清楚,
The output is not necessarily clear,
32
00:01:21,870 --> 00:01:24,076
但像这样聚集在一张漂亮的表格上,
but assembled together in a nice table like this,
33
00:01:24,076 --> 00:01:26,853
我们可以查看每个 token 的单词位置。
we can look at the word position for each token.
34
00:01:27,930 --> 00:01:30,220
更好的是,快速 tokenizer 会跟踪
Even better, the fast tokenizers keep track
35
00:01:30,220 --> 00:01:33,198
每个 token 来自的字符范围,
of the span of characters each token comes from,
36
00:01:33,198 --> 00:01:35,760
我们可以在调用它时得到它们
and we can get them when calling it on one
37
00:01:35,760 --> 00:01:37,221
或几个文本通过添加
or several text by adding
38
00:01:37,221 --> 00:01:40,470
return_offsets_mapping=True 参数。
the return_offsets_mapping=True argument.
39
00:01:40,470 --> 00:01:42,312
在这种情况下,我们可以看到我们是如何变换位置的
In this instance, we can see how we jump positions
40
00:01:42,312 --> 00:01:45,650
在 ## token 和 super token 之间,
between the ## token and the super token,
41
00:01:45,650 --> 00:01:49,992
因为首句中有多个空格。
because of the multiple spaces in the initial sentence.
42
00:01:49,992 --> 00:01:52,110
为了实现这一点,快速 tokenizer
To enable this, the fast tokenizers
43
00:01:52,110 --> 00:01:54,270
在每一步存储附加信息
store additional information at each step
44
00:01:54,270 --> 00:01:55,440
他们的内部管线( pipeline )。
of their internal pipeline.
45
00:01:55,440 --> 00:01:57,951
该内部管线包括规范化,
That internal pipeline consists of normalization,
46
00:01:57,951 --> 00:02:00,360
我们对文本进行一些整理,
where we apply some cleaning to the text,
47
00:02:00,360 --> 00:02:02,621
比如小写或删除口音;
like lower casing or removing the accents;
48
00:02:02,621 --> 00:02:04,088
预分词化,
pre-tokenization,
49
00:02:04,088 --> 00:02:06,530
这是我们将文本分成单词的地方;
which is where we split the texts into words;
50
00:02:06,530 --> 00:02:09,360
然后我们应用 tokenizer 的模型,
then we apply the model of the tokenizer,
51
00:02:09,360 --> 00:02:11,725
这是单词被分成 token 的地方,
which is where the words are split into tokens,
52
00:02:11,725 --> 00:02:13,748
在最终进行后期处理之前,
before finally doing the post processing,
53
00:02:13,748 --> 00:02:16,023
添加特殊 token 的地方。
where special tokens are added.
54
00:02:17,100 --> 00:02:19,050
从管线的开始到结束,
From the beginning to the end of the pipeline,
55
00:02:19,050 --> 00:02:21,390
tokenizer 跟踪每个文本范围
the tokenizer keeps track of each span of text
56
00:02:21,390 --> 00:02:23,853
对应于每个单词,然后是每个 token 。
that corresponds to each word, then each token.
57
00:02:24,990 --> 00:02:26,100
我们会看到它有多有用
We'll see how useful it is
58
00:02:26,100 --> 00:02:27,990
当我们处理以下任务时:
when we tackle the following tasks:
59
00:02:27,990 --> 00:02:29,549
在进行掩码语言建模时
when doing masked language modeling
60
00:02:29,549 --> 00:02:32,407
一种获得最 SOTA 的变体
one variation that gets state-of-the-art results
61
00:02:32,407 --> 00:02:35,040
是屏蔽给定单词的所有 token
is to mask all the tokens of a given word
62
00:02:35,040 --> 00:02:37,440
而不是随机选择的单词。
instead of randomly chosen words.
63
00:02:37,440 --> 00:02:40,800
这将要求我们使用我们看到的单词 ID。
This will require us to use the word IDs we saw.
64
00:02:40,800 --> 00:02:42,329
在做 token 分类的时候,
When doing token classification,
65
00:02:42,329 --> 00:02:45,090
我们需要转换我们在单词上的标签,
we'll need to convert the labels we have on words,
66
00:02:45,090 --> 00:02:47,250
每个 token 上的标签。
to labels on each tokens.
67
00:02:47,250 --> 00:02:48,480
至于偏移映射,
As for the offset mappings,
68
00:02:48,480 --> 00:02:50,610
当我们需要转换时它会非常有用
it will be super useful when we need to convert
69
00:02:50,610 --> 00:02:53,436
将句子中的 token 位置转换为一段文本,
token positions in a sentence into a span of text,
70
00:02:53,436 --> 00:02:55,800
我们在寻找时需要知道
which we'll need to know when we're looking
71
00:02:55,800 --> 00:02:56,813
在问答时
at question answering
72
00:02:56,813 --> 00:02:58,680
或者在对相应的 token 进行分组时
or when grouping the tokens corresponding
73
00:02:58,680 --> 00:03:01,023
到 token 分类中的同一实体。
to the same entity in token classification.
74
00:03:02,160 --> 00:03:03,450
要查看这些任务,
To have a look at these tasks,
75
00:03:03,450 --> 00:03:04,950
查看下面链接的视频!
check the videos linked below!