subtitles/en/17_batching-inputs-together-(pytorch).srt (225 lines of code) (raw):

1 00:00:00,373 --> 00:00:02,956 (subtle blast) 2 00:00:05,400 --> 00:00:07,590 - How to batch inputs together. 3 00:00:07,590 --> 00:00:09,240 In this video, we will see how 4 00:00:09,240 --> 00:00:11,073 to batch input sequences together. 5 00:00:12,137 --> 00:00:15,420 In general, the sentences we want to pass through our model 6 00:00:15,420 --> 00:00:17,670 won't all have the same lengths. 7 00:00:17,670 --> 00:00:19,740 Here, we are using the model we saw 8 00:00:19,740 --> 00:00:22,080 in the sentiment analysis pipeline 9 00:00:22,080 --> 00:00:24,063 and want to classify two sentences. 10 00:00:24,900 --> 00:00:27,360 When tokenizing them and mapping each token 11 00:00:27,360 --> 00:00:29,610 to its corresponding input IDs, 12 00:00:29,610 --> 00:00:31,593 we get two lists of different lengths. 13 00:00:33,240 --> 00:00:35,340 Trying to create a tensor or a NumPy array 14 00:00:35,340 --> 00:00:38,220 from those two lists will result in an error, 15 00:00:38,220 --> 00:00:41,043 because all arrays and tensors should be rectangular. 16 00:00:42,240 --> 00:00:44,160 One way to overcome this limit 17 00:00:44,160 --> 00:00:45,690 is to make the second sentence 18 00:00:45,690 --> 00:00:47,640 the same length as the first 19 00:00:47,640 --> 00:00:50,463 by adding a special token as many times as necessary. 20 00:00:51,600 --> 00:00:53,970 Another way would be to truncate the first sequence 21 00:00:53,970 --> 00:00:55,710 to the length of the second, 22 00:00:55,710 --> 00:00:58,140 but we would them lose a lot of information 23 00:00:58,140 --> 00:01:01,083 that might be necessary to properly classify the sentence. 24 00:01:02,190 --> 00:01:04,830 In general, we only truncate sentences 25 00:01:04,830 --> 00:01:06,840 when they are longer than the maximum length 26 00:01:06,840 --> 00:01:08,073 the model can handle. 27 00:01:09,720 --> 00:01:11,850 The value used to pad the second sentence 28 00:01:11,850 --> 00:01:13,740 should not be picked randomly; 29 00:01:13,740 --> 00:01:16,680 the model has been pretrained with a certain padding ID, 30 00:01:16,680 --> 00:01:19,533 which you can find in tokenizer.pad_token_id. 31 00:01:21,090 --> 00:01:22,800 Now that we have padded our sentences, 32 00:01:22,800 --> 00:01:24,303 we can make a batch with them. 33 00:01:25,380 --> 00:01:28,320 If we pass the two sentences to the model separately 34 00:01:28,320 --> 00:01:30,120 and batched together however, 35 00:01:30,120 --> 00:01:32,100 we notice that we don't get the same results 36 00:01:32,100 --> 00:01:34,060 for the sentence that is padded, 37 00:01:34,060 --> 00:01:35,403 here, the second one. 38 00:01:36,390 --> 00:01:39,420 It's at the back in the Transformers Library? No. 39 00:01:39,420 --> 00:01:40,770 If you remember that Transformer models 40 00:01:40,770 --> 00:01:42,810 make heavy use of attention layers, 41 00:01:42,810 --> 00:01:45,210 this should not come as a total surprise; 42 00:01:45,210 --> 00:01:48,277 when computing the contextual representation of each token, 43 00:01:48,277 --> 00:01:50,910 the attention layers look at all the other words 44 00:01:50,910 --> 00:01:52,410 in the sentence. 45 00:01:52,410 --> 00:01:53,850 If we have just the sentence 46 00:01:53,850 --> 00:01:56,970 or the sentence with several padding tokens added, 47 00:01:56,970 --> 00:01:59,073 it's logical we don't get the same values. 48 00:02:00,270 --> 00:02:03,030 To get the same results with or without padding, 49 00:02:03,030 --> 00:02:05,340 we need to indicate to the attention layers 50 00:02:05,340 --> 00:02:08,070 that they should ignore those padding tokens. 51 00:02:08,070 --> 00:02:10,620 This is done by creating an attention mask, 52 00:02:10,620 --> 00:02:13,320 a tensor with the same shape as the input IDs, 53 00:02:13,320 --> 00:02:14,733 with zeros and ones. 54 00:02:15,780 --> 00:02:18,120 Ones indicate the tokens the attention layers 55 00:02:18,120 --> 00:02:20,100 should consider in the context 56 00:02:20,100 --> 00:02:22,100 and zeros the tokens they should ignore. 57 00:02:23,520 --> 00:02:26,760 Now, passing this attention mask along with the input ID 58 00:02:26,760 --> 00:02:28,170 will give us the same results 59 00:02:28,170 --> 00:02:31,170 as when we sent the two sentences individually to the model. 60 00:02:32,400 --> 00:02:34,950 This is all done behind the scenes by the tokenizer 61 00:02:34,950 --> 00:02:36,900 when you apply it to several sentences 62 00:02:36,900 --> 00:02:38,613 with the flag padding=True. 63 00:02:39,599 --> 00:02:41,490 It will apply the padding with the proper value 64 00:02:41,490 --> 00:02:43,140 to the smaller sentences 65 00:02:43,140 --> 00:02:45,423 and create the appropriate attention mask. 66 00:02:46,993 --> 00:02:49,576 (subtle blast)