subtitles/en/21_preprocessing-sentence-pairs-(pytorch).srt

1 00:00:00,000 --> 00:00:03,083 (graphics whooshing) 2 00:00:05,370 --> 00:00:07,413 - How to pre-process pairs of sentences. 3 00:00:09,150 --> 00:00:11,340 We have seen how to tokenize single sentences 4 00:00:11,340 --> 00:00:12,877 and batch them together in the, 5 00:00:12,877 --> 00:00:15,810 "Batching inputs together video." 6 00:00:15,810 --> 00:00:18,330 If this code look unfamiliar to you, 7 00:00:18,330 --> 00:00:20,030 be sure to check that video again. 8 00:00:21,330 --> 00:00:24,543 Here will focus on tasks that classify pair of sentences. 9 00:00:25,620 --> 00:00:28,470 For instance, we may want to classify whether two texts 10 00:00:28,470 --> 00:00:30,360 are paraphrased or not. 11 00:00:30,360 --> 00:00:32,880 Here is an example taken from the Quora Question Pairs 12 00:00:32,880 --> 00:00:37,530 dataset, which focuses on identifying duplicate questions. 13 00:00:37,530 --> 00:00:40,650 In the first pair, the two questions are duplicates, 14 00:00:40,650 --> 00:00:42,000 in the second they are not. 15 00:00:43,283 --> 00:00:45,540 Another pair classification problem is 16 00:00:45,540 --> 00:00:47,400 when we want to know if two sentences are 17 00:00:47,400 --> 00:00:49,590 logically related or not, 18 00:00:49,590 --> 00:00:53,970 a problem called natural language inference or NLI. 19 00:00:53,970 --> 00:00:57,000 In this example, taken from the MultiNLI data set, 20 00:00:57,000 --> 00:00:59,880 we have a pair of sentences for each possible label. 21 00:00:59,880 --> 00:01:02,490 Contradiction, natural or entailment, 22 00:01:02,490 --> 00:01:04,680 which is a fancy way of saying the first sentence 23 00:01:04,680 --> 00:01:05,793 implies the second. 24 00:01:06,930 --> 00:01:08,820 So classifying pairs of sentences is a problem 25 00:01:08,820 --> 00:01:10,260 worth studying. 26 00:01:10,260 --> 00:01:12,630 In fact, in the GLUE benchmark, 27 00:01:12,630 --> 00:01:15,750 which is an academic benchmark for text classification 28 00:01:15,750 --> 00:01:17,910 eight of the 10 data sets are focused 29 00:01:17,910 --> 00:01:19,953 on tasks using pairs of sentences. 30 00:01:20,910 --> 00:01:22,560 That's why models like BERT 31 00:01:22,560 --> 00:01:25,320 are often pre-trained with a dual objective. 32 00:01:25,320 --> 00:01:27,660 On top of the language modeling objective, 33 00:01:27,660 --> 00:01:31,230 they often have an objective related to sentence pairs. 34 00:01:31,230 --> 00:01:34,320 For instance, during pretraining BERT is shown 35 00:01:34,320 --> 00:01:36,810 pairs of sentences and must predict both 36 00:01:36,810 --> 00:01:39,930 the value of randomly masked tokens, and whether the second 37 00:01:39,930 --> 00:01:41,830 sentence follow from the first or not. 38 00:01:43,084 --> 00:01:45,930 Fortunately, the tokenizer from the Transformers library 39 00:01:45,930 --> 00:01:49,170 has a nice API to deal with pairs of sentences. 40 00:01:49,170 --> 00:01:51,270 You just have to pass them as two arguments 41 00:01:51,270 --> 00:01:52,120 to the tokenizer. 42 00:01:53,430 --> 00:01:55,470 On top of the input IDs and the attention mask 43 00:01:55,470 --> 00:01:56,970 we studied already, 44 00:01:56,970 --> 00:01:59,910 it returns a new field called token type IDs, 45 00:01:59,910 --> 00:02:01,790 which tells the model which tokens belong 46 00:02:01,790 --> 00:02:03,630 to the first sentence, 47 00:02:03,630 --> 00:02:05,943 and which ones belong to the second sentence. 48 00:02:07,290 --> 00:02:09,840 Zooming in a little bit, here has an input IDs 49 00:02:09,840 --> 00:02:12,180 aligned with the tokens they correspond to, 50 00:02:12,180 --> 00:02:15,213 their respective token type ID and attention mask. 51 00:02:16,080 --> 00:02:19,260 We can see the tokenizer also added special tokens. 52 00:02:19,260 --> 00:02:22,620 So we have a CLS token, the tokens from the first sentence, 53 00:02:22,620 --> 00:02:25,770 a SEP token, the tokens from the second sentence, 54 00:02:25,770 --> 00:02:27,003 and a final SEP token. 55 00:02:28,500 --> 00:02:30,570 If we have several pairs of sentences, 56 00:02:30,570 --> 00:02:32,840 we can tokenize them together by passing the list 57 00:02:32,840 --> 00:02:36,630 of first sentences, then the list of second sentences 58 00:02:36,630 --> 00:02:39,300 and all the keyword arguments we studied already 59 00:02:39,300 --> 00:02:40,353 like padding=True. 60 00:02:41,940 --> 00:02:43,140 Zooming in at the result, 61 00:02:43,140 --> 00:02:45,030 we can see also tokenize added padding 62 00:02:45,030 --> 00:02:48,090 to the second pair sentences to make the two outputs 63 00:02:48,090 --> 00:02:51,360 the same length, and properly dealt with token type IDs 64 00:02:51,360 --> 00:02:53,643 and attention masks for the two sentences. 65 00:02:54,900 --> 00:02:57,573 This is then all ready to pass through our model.

subtitles/en/21_preprocessing-sentence-pairs-(pytorch).srt (229 lines of code) (raw):