subtitles/en/22_preprocessing-sentence-pairs-(tensorflow).srt (238 lines of code) (raw):

1 00:00:00,225 --> 00:00:02,892 (air whooshing) 2 00:00:05,578 --> 00:00:09,180 - How to preprocess pairs of sentences? 3 00:00:09,180 --> 00:00:11,490 We have seen how to tokenize single sentences 4 00:00:11,490 --> 00:00:13,020 and batch them together 5 00:00:13,020 --> 00:00:15,660 in the "Batching inputs together" video. 6 00:00:15,660 --> 00:00:18,060 If this code looks unfamiliar to you, 7 00:00:18,060 --> 00:00:19,760 be sure to check that video again! 8 00:00:21,101 --> 00:00:22,110 Here, we will focus on tasks 9 00:00:22,110 --> 00:00:24,033 that classify pairs of sentences. 10 00:00:24,900 --> 00:00:27,030 For instance, we may want to classify 11 00:00:27,030 --> 00:00:29,820 whether two texts are paraphrases or not. 12 00:00:29,820 --> 00:00:30,900 Here is an example taken 13 00:00:30,900 --> 00:00:33,180 from the Quora Question Pairs dataset, 14 00:00:33,180 --> 00:00:36,033 which focuses on identifying duplicate questions. 15 00:00:37,110 --> 00:00:40,650 In the first pair, the two questions are duplicates; 16 00:00:40,650 --> 00:00:43,620 in the second, they are not. 17 00:00:43,620 --> 00:00:44,730 Another classification problem 18 00:00:44,730 --> 00:00:46,980 is when we want to know if two sentences 19 00:00:46,980 --> 00:00:49,290 are logically related or not, 20 00:00:49,290 --> 00:00:52,173 a problem called Natural Language Inference or NLI. 21 00:00:53,100 --> 00:00:55,830 In this example taken from the MultiNLI dataset, 22 00:00:55,830 --> 00:00:59,460 we have a pair of sentences for each possible label: 23 00:00:59,460 --> 00:01:02,400 contradiction, neutral or entailment, 24 00:01:02,400 --> 00:01:04,680 which is a fancy way of saying the first sentence 25 00:01:04,680 --> 00:01:05,853 implies the second. 26 00:01:07,140 --> 00:01:09,000 So classifying pairs of sentences 27 00:01:09,000 --> 00:01:10,533 is a problem worth studying. 28 00:01:11,370 --> 00:01:13,770 In fact, in the GLUE benchmark, 29 00:01:13,770 --> 00:01:16,830 which is an academic benchmark for text classification, 30 00:01:16,830 --> 00:01:19,680 eight of the 10 datasets are focused on tasks 31 00:01:19,680 --> 00:01:20,973 using pairs of sentences. 32 00:01:22,110 --> 00:01:24,720 That's why models like BERT are often pretrained 33 00:01:24,720 --> 00:01:26,520 with a dual objective: 34 00:01:26,520 --> 00:01:28,890 on top of the language modeling objective, 35 00:01:28,890 --> 00:01:32,010 they often have an objective related to sentence pairs. 36 00:01:32,010 --> 00:01:34,560 For instance, during pretraining, 37 00:01:34,560 --> 00:01:36,690 BERT is shown pairs of sentences 38 00:01:36,690 --> 00:01:39,900 and must predict both the value of randomly masked tokens 39 00:01:39,900 --> 00:01:41,250 and whether the second sentence 40 00:01:41,250 --> 00:01:42,903 follows from the first or not. 41 00:01:44,070 --> 00:01:47,100 Fortunately, the tokenizer from the Transformers library 42 00:01:47,100 --> 00:01:50,550 has a nice API to deal with pairs of sentences: 43 00:01:50,550 --> 00:01:52,650 you just have to pass them as two arguments 44 00:01:52,650 --> 00:01:53,613 to the tokenizer. 45 00:01:54,900 --> 00:01:56,040 On top of the input IDs 46 00:01:56,040 --> 00:01:58,440 and the attention mask we studied already, 47 00:01:58,440 --> 00:02:01,530 it returns a new field called token type IDs, 48 00:02:01,530 --> 00:02:03,210 which tells the model which tokens 49 00:02:03,210 --> 00:02:05,100 belong to the first sentence 50 00:02:05,100 --> 00:02:07,350 and which ones belong to the second sentence. 51 00:02:08,670 --> 00:02:11,430 Zooming in a little bit, here are the input IDs, 52 00:02:11,430 --> 00:02:13,710 aligned with the tokens they correspond to, 53 00:02:13,710 --> 00:02:17,193 their respective token type ID and attention mask. 54 00:02:18,540 --> 00:02:21,300 We can see the tokenizer also added special tokens 55 00:02:21,300 --> 00:02:25,230 so we have a CLS token, the tokens from the first sentence, 56 00:02:25,230 --> 00:02:28,590 a SEP token, the tokens from the second sentence, 57 00:02:28,590 --> 00:02:30,153 and a final SEP token. 58 00:02:31,680 --> 00:02:33,720 If we have several pairs of sentences, 59 00:02:33,720 --> 00:02:35,640 we can tokenize them together 60 00:02:35,640 --> 00:02:38,280 by passing the list of first sentences, 61 00:02:38,280 --> 00:02:40,710 then the list of second sentences 62 00:02:40,710 --> 00:02:43,050 and all the keyword arguments we studied already, 63 00:02:43,050 --> 00:02:44,133 like padding=True. 64 00:02:45,510 --> 00:02:46,770 Zooming in at the result, 65 00:02:46,770 --> 00:02:49,050 we can see how the tokenizer added padding 66 00:02:49,050 --> 00:02:50,940 to the second pair of sentences, 67 00:02:50,940 --> 00:02:53,490 to make the two outputs the same length. 68 00:02:53,490 --> 00:02:55,620 It also properly dealt with token type IDS 69 00:02:55,620 --> 00:02:57,720 and attention masks for the two sentences. 70 00:02:59,010 --> 00:03:01,460 This is then all ready to pass through our model! 71 00:03:03,799 --> 00:03:06,466 (air whooshing)