subtitles/en/06_transformer-models-decoders.srt (308 lines of code) (raw):

1 00:00:03,750 --> 00:00:07,140 - In this video, we'll study the decoder architecture. 2 00:00:07,140 --> 00:00:07,973 An example 3 00:00:07,973 --> 00:00:11,338 of a popular decoder only architecture is GPT two. 4 00:00:11,338 --> 00:00:14,160 In order to understand how decoders work 5 00:00:14,160 --> 00:00:17,430 we recommend taking a look at the video regarding encoders. 6 00:00:17,430 --> 00:00:19,980 They're extremely similar to decoders. 7 00:00:19,980 --> 00:00:21,210 One can use a decoder 8 00:00:21,210 --> 00:00:23,760 for most of the same tasks as an encoder 9 00:00:23,760 --> 00:00:27,330 albeit with generally a little loss of performance. 10 00:00:27,330 --> 00:00:28,890 Let's take the same approach we have taken 11 00:00:28,890 --> 00:00:30,300 with the encoder to try 12 00:00:30,300 --> 00:00:32,670 and understand the architectural differences 13 00:00:32,670 --> 00:00:34,803 between an encoder and decoder. 14 00:00:35,777 --> 00:00:38,910 We'll use a small example using three words. 15 00:00:38,910 --> 00:00:41,050 We pass them through their decoder. 16 00:00:41,050 --> 00:00:44,793 We retrieve a numerical representation for each word. 17 00:00:46,410 --> 00:00:49,350 Here for example, the decoder converts the three words. 18 00:00:49,350 --> 00:00:53,545 Welcome to NYC, and these three sequences of numbers. 19 00:00:53,545 --> 00:00:56,040 The decoder outputs exactly one sequence 20 00:00:56,040 --> 00:00:58,740 of numbers per input word. 21 00:00:58,740 --> 00:01:00,630 This numerical representation can also 22 00:01:00,630 --> 00:01:03,783 be called a feature vector or a feature sensor. 23 00:01:04,920 --> 00:01:07,200 Let's dive in this representation. 24 00:01:07,200 --> 00:01:08,490 It contains one vector 25 00:01:08,490 --> 00:01:11,340 per word that was passed through the decoder. 26 00:01:11,340 --> 00:01:14,250 Each of these vectors is a numerical representation 27 00:01:14,250 --> 00:01:15,573 of the word in question. 28 00:01:16,920 --> 00:01:18,562 The dimension of that vector is defined 29 00:01:18,562 --> 00:01:20,703 by the architecture of the model. 30 00:01:22,860 --> 00:01:26,040 Where the decoder differs from the encoder is principally 31 00:01:26,040 --> 00:01:28,200 with its self attention mechanism. 32 00:01:28,200 --> 00:01:30,843 It's using what is called masked self attention. 33 00:01:31,860 --> 00:01:34,650 Here, for example, if we focus on the word "to" 34 00:01:34,650 --> 00:01:37,620 we'll see that is vector is absolutely unmodified 35 00:01:37,620 --> 00:01:39,690 by the NYC word. 36 00:01:39,690 --> 00:01:41,731 That's because all the words on the right, also known 37 00:01:41,731 --> 00:01:45,276 as the right context of the word is masked rather 38 00:01:45,276 --> 00:01:49,230 than benefiting from all the words on the left and right. 39 00:01:49,230 --> 00:01:51,600 So the bidirectional context. 40 00:01:51,600 --> 00:01:55,020 Decoders only have access to a single context 41 00:01:55,020 --> 00:01:58,203 which can be the left context or the right context. 42 00:01:59,539 --> 00:02:03,356 The masked self attention mechanism differs 43 00:02:03,356 --> 00:02:04,320 from the self attention mechanism 44 00:02:04,320 --> 00:02:07,110 by using an additional mask to hide the context 45 00:02:07,110 --> 00:02:09,390 on either side of the word 46 00:02:09,390 --> 00:02:12,810 the words numerical representation will not be affected 47 00:02:12,810 --> 00:02:14,853 by the words in the hidden context. 48 00:02:16,260 --> 00:02:18,330 So when should one use a decoder? 49 00:02:18,330 --> 00:02:22,380 Decoders like encoders can be used as standalone models 50 00:02:22,380 --> 00:02:25,020 as they generate a numerical representation. 51 00:02:25,020 --> 00:02:28,320 They can also be used in a wide variety of tasks. 52 00:02:28,320 --> 00:02:31,260 However, the strength of a decoder lies in the way. 53 00:02:31,260 --> 00:02:34,530 A word can only have access to its left context 54 00:02:34,530 --> 00:02:36,690 having only access to their left context. 55 00:02:36,690 --> 00:02:39,120 They're inherently good at text generation 56 00:02:39,120 --> 00:02:41,010 the ability to generate a word 57 00:02:41,010 --> 00:02:45,000 or a sequence of words given a known sequence of words. 58 00:02:45,000 --> 00:02:45,833 This is known 59 00:02:45,833 --> 00:02:49,083 as causal language modeling or natural language generation. 60 00:02:50,430 --> 00:02:53,520 Here's an example of how causal language modeling works. 61 00:02:53,520 --> 00:02:56,410 We start with an initial word, which is my 62 00:02:57,339 --> 00:02:59,973 we use this as input for the decoder. 63 00:03:00,810 --> 00:03:04,260 The model outputs a vector of numbers 64 00:03:04,260 --> 00:03:07,230 and this vector contains information about the sequence 65 00:03:07,230 --> 00:03:08,733 which is here a single word. 66 00:03:09,780 --> 00:03:11,430 We apply a small transformation 67 00:03:11,430 --> 00:03:13,110 to that vector so that it maps 68 00:03:13,110 --> 00:03:16,500 to all the words known by the model, which is a mapping 69 00:03:16,500 --> 00:03:19,890 that we'll see later called a language modeling head. 70 00:03:19,890 --> 00:03:21,930 We identify that the model believes 71 00:03:21,930 --> 00:03:25,053 that the most probable following word is name. 72 00:03:26,250 --> 00:03:28,710 We then take that new word and add it 73 00:03:28,710 --> 00:03:33,480 to the initial sequence from my, we are now at my name. 74 00:03:33,480 --> 00:03:36,870 This is where the auto regressive aspect comes in. 75 00:03:36,870 --> 00:03:38,490 Auto regressive models. 76 00:03:38,490 --> 00:03:42,513 We use their past outputs as inputs and the following steps. 77 00:03:43,452 --> 00:03:46,980 Once again, we do the exact same operation. 78 00:03:46,980 --> 00:03:49,500 We cast that sequence through the decoder 79 00:03:49,500 --> 00:03:51,993 and retrieve the most probable following word. 80 00:03:52,978 --> 00:03:57,978 In this case, it is the word "is", we repeat the operation 81 00:03:58,230 --> 00:04:02,040 until we're satisfied, starting from a single word. 82 00:04:02,040 --> 00:04:04,590 We've now generated a full sentence. 83 00:04:04,590 --> 00:04:07,890 We decide to stop there, but we could continue for a while. 84 00:04:07,890 --> 00:04:12,890 GPT two, for example, has a maximum context size of 1,024. 85 00:04:13,170 --> 00:04:16,830 We could eventually generate up to a 1,024 words 86 00:04:16,830 --> 00:04:19,050 and the decoder would still have some memory 87 00:04:19,050 --> 00:04:21,003 of the first words in this sequence.