subtitles/en/07_transformer-models-encoder-decoders.srt (479 lines of code) (raw):

1 00:00:00,520 --> 00:00:02,603 (swoosh) 2 00:00:04,230 --> 00:00:05,063 - In this video, 3 00:00:05,063 --> 00:00:07,638 we'll study the encoder-decoder architecture. 4 00:00:07,638 --> 00:00:12,243 An example of a popular encoder-decoder model is T5. 5 00:00:13,770 --> 00:00:16,980 In order to understand how the encoder-decoder works, 6 00:00:16,980 --> 00:00:18,630 we recommend you check out the videos 7 00:00:18,630 --> 00:00:22,590 on encoders and decoders as standalone models. 8 00:00:22,590 --> 00:00:24,990 Understanding how they work individually 9 00:00:24,990 --> 00:00:28,323 will help understanding how an encoder-decoder works. 10 00:00:30,510 --> 00:00:33,390 Let's start from what we've seen about the encoder. 11 00:00:33,390 --> 00:00:36,240 The encoder takes words as inputs, 12 00:00:36,240 --> 00:00:38,520 casts them through the encoder, 13 00:00:38,520 --> 00:00:40,800 and retrieves a numerical representation 14 00:00:40,800 --> 00:00:42,663 for each word cast through it. 15 00:00:43,560 --> 00:00:46,470 We now know that this numerical representation 16 00:00:46,470 --> 00:00:49,473 holds information about the meaning of the sequence. 17 00:00:51,090 --> 00:00:54,243 Let's put this aside and add the decoder to the diagram. 18 00:00:56,610 --> 00:00:57,510 In this scenario, 19 00:00:57,510 --> 00:00:59,190 we're using the decoder in a manner 20 00:00:59,190 --> 00:01:00,960 that we haven't seen before. 21 00:01:00,960 --> 00:01:04,173 We're passing the outputs of the encoder directly to it. 22 00:01:05,356 --> 00:01:07,770 Additionally to the encoder outputs, 23 00:01:07,770 --> 00:01:10,800 we also give the decoder a sequence. 24 00:01:10,800 --> 00:01:12,840 When prompting the decoder for an output 25 00:01:12,840 --> 00:01:14,190 with no initial sequence, 26 00:01:14,190 --> 00:01:16,140 we can give it the value that indicates 27 00:01:16,140 --> 00:01:18,060 the start of a sequence. 28 00:01:18,060 --> 00:01:20,919 And that's where the encoder-decoder magic happens. 29 00:01:20,919 --> 00:01:24,082 The encoder accepts a sequence as input. 30 00:01:24,082 --> 00:01:25,980 It computes a prediction, 31 00:01:25,980 --> 00:01:28,858 and outputs a numerical representation. 32 00:01:28,858 --> 00:01:33,120 Then, it sends that over to the decoder. 33 00:01:33,120 --> 00:01:36,300 It has, in a sense, encoded that sequence. 34 00:01:36,300 --> 00:01:38,130 And the decoder, in turn, 35 00:01:38,130 --> 00:01:40,847 using this input alongside its usual sequence input, 36 00:01:40,847 --> 00:01:43,906 will take a stab at decoding the sequence. 37 00:01:43,906 --> 00:01:46,530 The decoder decodes the sequence, 38 00:01:46,530 --> 00:01:48,360 and outputs a word. 39 00:01:48,360 --> 00:01:51,300 As of now, we don't need to make sense of that word, 40 00:01:51,300 --> 00:01:53,100 but we can understand that the decoder 41 00:01:53,100 --> 00:01:56,103 is essentially decoding what the encoder has output. 42 00:01:57,008 --> 00:02:00,000 The start of sequence word here 43 00:02:00,000 --> 00:02:02,871 indicates that it should start decoding the sequence. 44 00:02:02,871 --> 00:02:06,870 Now that we have both the encoder numerical representation 45 00:02:06,870 --> 00:02:09,570 and an initial generated word, 46 00:02:09,570 --> 00:02:11,343 we don't need the encoder anymore. 47 00:02:12,269 --> 00:02:15,540 As we have seen before with the decoder, 48 00:02:15,540 --> 00:02:18,720 it can act in an auto-regressive manner. 49 00:02:18,720 --> 00:02:22,933 The word it has just output can now be used as an input. 50 00:02:22,933 --> 00:02:26,188 This, in combination with the numerical representation 51 00:02:26,188 --> 00:02:28,560 output by the encoder, 52 00:02:28,560 --> 00:02:31,203 can now be used to generate a second word. 53 00:02:33,040 --> 00:02:35,910 Please note that the first word is still here, 54 00:02:35,910 --> 00:02:37,770 as the model still outputs it. 55 00:02:37,770 --> 00:02:39,240 However, we have grayed it out 56 00:02:39,240 --> 00:02:40,940 as we have no need for it anymore. 57 00:02:41,880 --> 00:02:44,070 We can continue on and on, for example, 58 00:02:44,070 --> 00:02:46,320 until the decoder outputs a value 59 00:02:46,320 --> 00:02:48,540 that we consider a stopping value, 60 00:02:48,540 --> 00:02:51,093 like a dot meaning the end of a sequence. 61 00:02:53,580 --> 00:02:55,926 Here, we've seen the full mechanism 62 00:02:55,926 --> 00:02:57,540 of the encoder-decoder transformer. 63 00:02:57,540 --> 00:02:59,280 Let's go over it one more time. 64 00:02:59,280 --> 00:03:02,773 We have an initial sequence that is sent to the encoder. 65 00:03:02,773 --> 00:03:06,450 That encoder output is then sent to the decoder 66 00:03:06,450 --> 00:03:07,563 for it to be decoded. 67 00:03:08,760 --> 00:03:12,450 While it can now discard the encoder after a single use, 68 00:03:12,450 --> 00:03:14,427 the decoder will be used several times 69 00:03:14,427 --> 00:03:17,763 until we have generated every word that we need. 70 00:03:19,288 --> 00:03:21,510 So let's see a concrete example 71 00:03:21,510 --> 00:03:23,460 with Translation Language Modeling. 72 00:03:23,460 --> 00:03:24,930 Also called transduction, 73 00:03:24,930 --> 00:03:28,200 which is the act of translating a sequence. 74 00:03:28,200 --> 00:03:30,577 Here, we would like to translate this English sequence 75 00:03:30,577 --> 00:03:33,067 "Welcome to NYC" in French. 76 00:03:33,067 --> 00:03:35,460 We're using a transformer model 77 00:03:35,460 --> 00:03:38,070 that is trained for that task explicitly. 78 00:03:38,070 --> 00:03:40,560 We use the encoder to create a representation 79 00:03:40,560 --> 00:03:42,240 of the English sentence. 80 00:03:42,240 --> 00:03:44,730 We cast this to the decoder, 81 00:03:44,730 --> 00:03:46,620 with the use of the start of sequence word, 82 00:03:46,620 --> 00:03:49,173 we ask it to output the first word. 83 00:03:50,029 --> 00:03:53,607 It outputs bienvenue, which means welcome. 84 00:03:53,607 --> 00:03:56,640 And we then use bienvenue 85 00:03:56,640 --> 00:03:59,283 as the input sequence for the decoder. 86 00:04:00,188 --> 00:04:04,470 This, alongside the encoder numerical representation, 87 00:04:04,470 --> 00:04:07,440 allows the decoder to predict the second word, Ã, 88 00:04:07,440 --> 00:04:09,240 which is to in English. 89 00:04:09,240 --> 00:04:13,590 Finally, we ask the decoder to predict a third word 90 00:04:13,590 --> 00:04:15,330 It predicts NYC, which is correct. 91 00:04:15,330 --> 00:04:18,288 We've translated the sentence. 92 00:04:18,288 --> 00:04:20,760 Where the encoder-decoder really shines, 93 00:04:20,760 --> 00:04:23,550 is that we have an encoder and a decoder, 94 00:04:23,550 --> 00:04:25,323 which often do not share weights. 95 00:04:26,256 --> 00:04:29,460 Therefore, we have an entire block, the encoder, 96 00:04:29,460 --> 00:04:31,650 that can be trained to understand the sequence 97 00:04:31,650 --> 00:04:34,290 and extract the relevant information. 98 00:04:34,290 --> 00:04:36,450 For the translation scenario we've seen earlier, 99 00:04:36,450 --> 00:04:38,760 for example, this would mean parsing 100 00:04:38,760 --> 00:04:42,003 and understanding what was said in the English language. 101 00:04:42,900 --> 00:04:45,960 It would mean extracting information from that language, 102 00:04:45,960 --> 00:04:49,413 and putting all of that in a vector dense in information. 103 00:04:50,361 --> 00:04:53,370 On the other hand, we have the decoder, 104 00:04:53,370 --> 00:04:56,850 whose sole purpose is to decode the numerical representation 105 00:04:56,850 --> 00:04:58,203 output by the encoder. 106 00:04:59,460 --> 00:05:01,170 This decoder can be specialized 107 00:05:01,170 --> 00:05:02,970 in a completely different language, 108 00:05:02,970 --> 00:05:05,403 or even modality like images or speech. 109 00:05:07,170 --> 00:05:10,473 Encoders-decoders are special for several reasons. 110 00:05:11,310 --> 00:05:15,570 Firstly, they're able to manage sequence to sequence tasks, 111 00:05:15,570 --> 00:05:18,358 like translation that we have just seen. 112 00:05:18,358 --> 00:05:20,940 Secondly, the weights between the encoder 113 00:05:20,940 --> 00:05:24,540 and the decoder parts are not necessarily shared. 114 00:05:24,540 --> 00:05:27,172 Let's take another example of translation. 115 00:05:27,172 --> 00:05:30,810 Here we're translating "Transformers are powerful" 116 00:05:30,810 --> 00:05:32,048 in French. 117 00:05:32,048 --> 00:05:35,258 Firstly, this means that from a sequence of three words, 118 00:05:35,258 --> 00:05:39,030 we're able to generate a sequence of four words. 119 00:05:39,030 --> 00:05:42,480 One could argue that this could be handled with a decoder 120 00:05:42,480 --> 00:05:44,160 that would generate the translation 121 00:05:44,160 --> 00:05:46,260 in an auto-regressive manner, 122 00:05:46,260 --> 00:05:47,460 and they would be right. 123 00:05:49,980 --> 00:05:51,930 Another example of where sequence to sequence 124 00:05:51,930 --> 00:05:54,810 transformers shine is in summarization. 125 00:05:54,810 --> 00:05:58,379 Here we have a very long sequence, generally a full text, 126 00:05:58,379 --> 00:06:01,020 and we want to summarize it. 127 00:06:01,020 --> 00:06:04,020 Since the encoder and decoders are separated, 128 00:06:04,020 --> 00:06:06,300 we can have different context lengths. 129 00:06:06,300 --> 00:06:08,910 For example, a very long context for the encoder, 130 00:06:08,910 --> 00:06:10,230 which handles the text, 131 00:06:10,230 --> 00:06:12,210 and a smaller context for the decoder 132 00:06:12,210 --> 00:06:14,223 which handles the summarized sequence. 133 00:06:16,470 --> 00:06:18,840 There are a lot of sequence to sequence models. 134 00:06:18,840 --> 00:06:20,310 This contains a few examples 135 00:06:20,310 --> 00:06:22,500 of popular encoder-decoder models 136 00:06:22,500 --> 00:06:24,400 available in the transformers library. 137 00:06:25,829 --> 00:06:29,940 Additionally, you can load an encoder and a decoder 138 00:06:29,940 --> 00:06:32,130 inside an encoder-decoder model. 139 00:06:32,130 --> 00:06:35,190 Therefore, according to the specific task you are targeting, 140 00:06:35,190 --> 00:06:38,700 you may choose to use specific encoders and decoders, 141 00:06:38,700 --> 00:06:42,613 which have proven their worth on these specific tasks. 142 00:06:42,613 --> 00:06:44,696 (swoosh)