subtitles/zh-CN/44_fast-tokenizer-superpowers.srt (300 lines of code) (raw):

1 00:00:05,010 --> 00:00:06,270 - 快速 tokenizer - The fast tokenizers 2 00:00:06,270 --> 00:00:08,580 在 transformers 库中的, 速度很快, of the Transformers library are fast, 3 00:00:08,580 --> 00:00:11,490 同时他们也实现了非常有用的功能 but they also implement features that will be super useful 4 00:00:11,490 --> 00:00:14,536 用于数据预处理和后处理。 for data pre-processing and post-processing. 5 00:00:14,536 --> 00:00:17,239 让我们来看看吧! Let's have a look at them! 6 00:00:17,239 --> 00:00:18,650 首先,让我们来看看 First, let's have a look 7 00:00:18,650 --> 00:00:21,690 在 tokenizer 的通常输出中。 at the usual output of a tokenizer. 8 00:00:21,690 --> 00:00:24,278 我们得到对应于 token 的输入 ID, We get input IDs that correspond to token, 9 00:00:24,278 --> 00:00:27,960 但是我们在这个过程中丢失了很多信息。 but we lose a lot of information in the process. 10 00:00:27,960 --> 00:00:29,010 例如, For instance, 11 00:00:29,010 --> 00:00:31,856 这里两个句子的分词化是相同的 here the tokenization is the same for the two sentences 12 00:00:31,856 --> 00:00:35,373 即使一个比另一个多几个空间。 even if one has several more spaces than the other. 13 00:00:36,300 --> 00:00:39,150 因此,仅仅拥有输入 ID 是不够的 Just having the input IDs is thus not enough 14 00:00:39,150 --> 00:00:42,330 如果我们想将一些 token 与一段文本相匹配, if we want to match some tokens with a span of text, 15 00:00:42,330 --> 00:00:43,320 我们将需要做的事情 something we'll need to do 16 00:00:43,320 --> 00:00:46,111 例如,在处理问答时。 when tackling question answering, for instance. 17 00:00:46,111 --> 00:00:47,592 也很难知道 It's also difficult to know 18 00:00:47,592 --> 00:00:50,850 当两个 token 是否属于同一个词时。 when two tokens belong to the same word or not. 19 00:00:50,850 --> 00:00:52,860 是容易的, 当你只看 It looks easy when you just look at the output 20 00:00:52,860 --> 00:00:55,650 一个 BERT 分词器的输出,我们只需要看一下 of a BERT tokenizer where we just need to look 21 00:00:55,650 --> 00:00:56,779 对于 ## 。 for the ##. 22 00:00:56,779 --> 00:00:59,040 但是其他分词器有不同的方式 But other tokenizers have different ways 23 00:00:59,040 --> 00:01:00,987 分词化部分单词。 to tokenize parts of words. 24 00:01:00,987 --> 00:01:04,470 例如,RoBERTa 添加了这个特殊的 G 符号 For instance, RoBERTa adds this special G symbol 25 00:01:04,470 --> 00:01:06,491 在单词的开头 token 标记 to mark the tokens at the beginning of the word 26 00:01:06,491 --> 00:01:09,570 T5 使用这个特殊的下划线符号 and T5 uses this special underscore symbol 27 00:01:09,570 --> 00:01:11,150 为了同样的目的。 for the same purpose. 28 00:01:11,150 --> 00:01:14,760 值得庆幸的是,快速 tokenizer 会跟踪这个词 Thankfully, the fast tokenizers keep track of the word 29 00:01:14,760 --> 00:01:16,230 每个 token 来自, each token comes from, 30 00:01:16,230 --> 00:01:19,571 使用 word_ids 方法,你可以在它们的输出上使用。 with a word_ids method you can use on their outputs. 31 00:01:19,571 --> 00:01:21,870 输出不一定清楚, The output is not necessarily clear, 32 00:01:21,870 --> 00:01:24,076 但像这样聚集在一张漂亮的表格上, but assembled together in a nice table like this, 33 00:01:24,076 --> 00:01:26,853 我们可以查看每个 token 的单词位置。 we can look at the word position for each token. 34 00:01:27,930 --> 00:01:30,220 更好的是,快速 tokenizer 会跟踪 Even better, the fast tokenizers keep track 35 00:01:30,220 --> 00:01:33,198 每个 token 来自的字符范围, of the span of characters each token comes from, 36 00:01:33,198 --> 00:01:35,760 我们可以在调用它时得到它们 and we can get them when calling it on one 37 00:01:35,760 --> 00:01:37,221 或几个文本通过添加 or several text by adding 38 00:01:37,221 --> 00:01:40,470 return_offsets_mapping=True 参数。 the return_offsets_mapping=True argument. 39 00:01:40,470 --> 00:01:42,312 在这种情况下,我们可以看到我们是如何变换位置的 In this instance, we can see how we jump positions 40 00:01:42,312 --> 00:01:45,650 在 ## token 和 super token 之间, between the ## token and the super token, 41 00:01:45,650 --> 00:01:49,992 因为首句中有多个空格。 because of the multiple spaces in the initial sentence. 42 00:01:49,992 --> 00:01:52,110 为了实现这一点,快速 tokenizer To enable this, the fast tokenizers 43 00:01:52,110 --> 00:01:54,270 在每一步存储附加信息 store additional information at each step 44 00:01:54,270 --> 00:01:55,440 他们的内部管线( pipeline )。 of their internal pipeline. 45 00:01:55,440 --> 00:01:57,951 该内部管线包括规范化, That internal pipeline consists of normalization, 46 00:01:57,951 --> 00:02:00,360 我们对文本进行一些整理, where we apply some cleaning to the text, 47 00:02:00,360 --> 00:02:02,621 比如小写或删除口音; like lower casing or removing the accents; 48 00:02:02,621 --> 00:02:04,088 预分词化, pre-tokenization, 49 00:02:04,088 --> 00:02:06,530 这是我们将文本分成单词的地方; which is where we split the texts into words; 50 00:02:06,530 --> 00:02:09,360 然后我们应用 tokenizer 的模型, then we apply the model of the tokenizer, 51 00:02:09,360 --> 00:02:11,725 这是单词被分成 token 的地方, which is where the words are split into tokens, 52 00:02:11,725 --> 00:02:13,748 在最终进行后期处理之前, before finally doing the post processing, 53 00:02:13,748 --> 00:02:16,023 添加特殊 token 的地方。 where special tokens are added. 54 00:02:17,100 --> 00:02:19,050 从管线的开始到结束, From the beginning to the end of the pipeline, 55 00:02:19,050 --> 00:02:21,390 tokenizer 跟踪每个文本范围 the tokenizer keeps track of each span of text 56 00:02:21,390 --> 00:02:23,853 对应于每个单词,然后是每个 token 。 that corresponds to each word, then each token. 57 00:02:24,990 --> 00:02:26,100 我们会看到它有多有用 We'll see how useful it is 58 00:02:26,100 --> 00:02:27,990 当我们处理以下任务时: when we tackle the following tasks: 59 00:02:27,990 --> 00:02:29,549 在进行掩码语言建模时 when doing masked language modeling 60 00:02:29,549 --> 00:02:32,407 一种获得最 SOTA 的变体 one variation that gets state-of-the-art results 61 00:02:32,407 --> 00:02:35,040 是屏蔽给定单词的所有 token is to mask all the tokens of a given word 62 00:02:35,040 --> 00:02:37,440 而不是随机选择的单词。 instead of randomly chosen words. 63 00:02:37,440 --> 00:02:40,800 这将要求我们使用我们看到的单词 ID。 This will require us to use the word IDs we saw. 64 00:02:40,800 --> 00:02:42,329 在做 token 分类的时候, When doing token classification, 65 00:02:42,329 --> 00:02:45,090 我们需要转换我们在单词上的标签, we'll need to convert the labels we have on words, 66 00:02:45,090 --> 00:02:47,250 每个 token 上的标签。 to labels on each tokens. 67 00:02:47,250 --> 00:02:48,480 至于偏移映射, As for the offset mappings, 68 00:02:48,480 --> 00:02:50,610 当我们需要转换时它会非常有用 it will be super useful when we need to convert 69 00:02:50,610 --> 00:02:53,436 将句子中的 token 位置转换为一段文本, token positions in a sentence into a span of text, 70 00:02:53,436 --> 00:02:55,800 我们在寻找时需要知道 which we'll need to know when we're looking 71 00:02:55,800 --> 00:02:56,813 在问答时 at question answering 72 00:02:56,813 --> 00:02:58,680 或者在对相应的 token 进行分组时 or when grouping the tokens corresponding 73 00:02:58,680 --> 00:03:01,023 到 token 分类中的同一实体。 to the same entity in token classification. 74 00:03:02,160 --> 00:03:03,450 要查看这些任务, To have a look at these tasks, 75 00:03:03,450 --> 00:03:04,950 查看下面链接的视频! check the videos linked below!