1 00:00:00,000 --> 00:00:02,750 (logo whooshing) 2 00:00:05,010 --> 00:00:07,323 - Let's study the transformer architecture. 3 00:00:09,150 --> 00:00:12,030 This video is the introductory video to the encoders, 4 00:00:12,030 --> 00:00:15,510 decoders, and encoder-decoder series of videos. 5 00:00:15,510 --> 00:00:16,343 In this series, 6 00:00:16,343 --> 00:00:18,900 we'll try to understand what makes a transformer network, 7 00:00:18,900 --> 00:00:22,770 and we'll try to explain it in simple, high-level terms. 8 00:00:22,770 --> 00:00:25,800 No advanced understanding of neural networks is necessary, 9 00:00:25,800 --> 00:00:29,343 but an understanding of basic vectors and tensors may help. 10 00:00:32,250 --> 00:00:33,270 To get started, 11 00:00:33,270 --> 00:00:34,530 we'll take up this diagram 12 00:00:34,530 --> 00:00:36,630 from the original transformer paper, 13 00:00:36,630 --> 00:00:40,140 entitled "Attention Is All You Need", by Vaswani et al. 14 00:00:40,140 --> 00:00:41,010 As we'll see here, 15 00:00:41,010 --> 00:00:42,780 we can leverage only some parts of it, 16 00:00:42,780 --> 00:00:44,630 according to what we're trying to do. 17 00:00:45,480 --> 00:00:47,610 We want to dive into the specific layers, 18 00:00:47,610 --> 00:00:48,990 building up that architecture, 19 00:00:48,990 --> 00:00:51,390 but we'll try to understand the different ways 20 00:00:51,390 --> 00:00:52,893 this architecture can be used. 21 00:00:55,170 --> 00:00:56,003 Let's first start 22 00:00:56,003 --> 00:00:58,260 by splitting that architecture into two parts. 23 00:00:58,260 --> 00:00:59,910 On the left we have the encoder, 24 00:00:59,910 --> 00:01:01,980 and on the right, the decoder. 25 00:01:01,980 --> 00:01:03,330 These two can be used together, 26 00:01:03,330 --> 00:01:05,330 but they can also be used independently. 27 00:01:06,180 --> 00:01:08,610 Let's understand how these work. 28 00:01:08,610 --> 00:01:11,460 The encoder accepts inputs that represent text. 29 00:01:11,460 --> 00:01:13,620 It converts this text, these words, 30 00:01:13,620 --> 00:01:15,675 into numerical representations. 31 00:01:15,675 --> 00:01:17,400 These numerical representations 32 00:01:17,400 --> 00:01:20,460 can also be called embeddings, or features. 33 00:01:20,460 --> 00:01:23,100 We'll see that it uses the self-attention mechanism 34 00:01:23,100 --> 00:01:24,483 as its main component. 35 00:01:25,500 --> 00:01:27,120 We recommend you check out the video 36 00:01:27,120 --> 00:01:29,700 on encoders specifically to understand 37 00:01:29,700 --> 00:01:31,680 what is this numerical representation, 38 00:01:31,680 --> 00:01:33,690 as well as how it works. 39 00:01:33,690 --> 00:01:36,660 We'll study the self-attention mechanism in more detail, 40 00:01:36,660 --> 00:01:38,913 as well as its bi-directional properties. 41 00:01:40,650 --> 00:01:42,780 The decoder is similar to the encoder. 42 00:01:42,780 --> 00:01:45,630 It can also accept text inputs. 43 00:01:45,630 --> 00:01:48,210 It uses a similar mechanism as the encoder, 44 00:01:48,210 --> 00:01:51,150 which is the masked self-attention as well. 45 00:01:51,150 --> 00:01:52,590 It differs from the encoder 46 00:01:52,590 --> 00:01:54,990 due to its uni-directional feature 47 00:01:54,990 --> 00:01:58,590 and is traditionally used in an auto-regressive manner. 48 00:01:58,590 --> 00:02:01,650 Here too, we recommend you check out the video on decoders 49 00:02:01,650 --> 00:02:04,000 especially to understand how all of this works. 50 00:02:06,810 --> 00:02:07,890 Combining the two parts 51 00:02:07,890 --> 00:02:10,200 results in what is known as an encoder-decoder, 52 00:02:10,200 --> 00:02:12,720 or a sequence-to-sequence transformer. 53 00:02:12,720 --> 00:02:14,280 The encoder accepts inputs 54 00:02:14,280 --> 00:02:17,850 and computes a high-level representation of those inputs. 55 00:02:17,850 --> 00:02:20,252 These outputs are then passed to the decoder. 56 00:02:20,252 --> 00:02:22,860 The decoder uses the encoder's output, 57 00:02:22,860 --> 00:02:26,370 alongside other inputs to generate a prediction. 58 00:02:26,370 --> 00:02:27,900 It then predicts an output, 59 00:02:27,900 --> 00:02:30,248 which it will re-use in future iterations, 60 00:02:30,248 --> 00:02:32,662 hence the term, auto-regressive. 61 00:02:32,662 --> 00:02:34,740 Finally, to get an understanding 62 00:02:34,740 --> 00:02:36,690 of the encoder-decoders as a whole, 63 00:02:36,690 --> 00:02:39,670 we recommend you check out the video on encoder-decoders. 64 00:02:39,670 --> 00:02:42,420 (logo whooshing)