subtitles/en/05_transformer-models-encoders.srt (352 lines of code) (raw):

1 00:00:00,253 --> 00:00:03,003 (intro striking) 2 00:00:04,440 --> 00:00:07,830 - In this video, we'll study the encoder architecture. 3 00:00:07,830 --> 00:00:11,070 An example of a popular encoder only architecture is BURT 4 00:00:11,070 --> 00:00:13,323 which is the most popular model of its kind. 5 00:00:14,550 --> 00:00:16,950 Let's first start by understanding how it works. 6 00:00:18,360 --> 00:00:20,910 We'll use a small example using three words. 7 00:00:20,910 --> 00:00:23,823 We use these as inputs and pass them through the encoder. 8 00:00:25,290 --> 00:00:28,173 We retrieve a numerical representation of each word. 9 00:00:29,970 --> 00:00:32,700 Here, for example, the encoder converts those three words, 10 00:00:32,700 --> 00:00:37,350 welcome to NYC, in these three sequences of numbers. 11 00:00:37,350 --> 00:00:40,350 The encoder outputs exactly one sequence of numbers 12 00:00:40,350 --> 00:00:41,493 per input word. 13 00:00:42,330 --> 00:00:44,880 This numerical representation can also be called 14 00:00:44,880 --> 00:00:47,163 a feature vector, or a feature tensor. 15 00:00:49,080 --> 00:00:51,030 Let's dive into this representation. 16 00:00:51,030 --> 00:00:52,740 It contains one vector per word 17 00:00:52,740 --> 00:00:54,540 that was passed through the encoder. 18 00:00:56,130 --> 00:00:58,620 Each of these vector is a numerical representation 19 00:00:58,620 --> 00:01:00,033 of the word in question. 20 00:01:01,080 --> 00:01:03,300 The dimension of that vector is defined 21 00:01:03,300 --> 00:01:05,520 by the architecture of the model. 22 00:01:05,520 --> 00:01:08,703 For the base BERT model, it is 768. 23 00:01:10,650 --> 00:01:13,230 These representations contain the value of a word, 24 00:01:13,230 --> 00:01:15,240 but contextualized. 25 00:01:15,240 --> 00:01:18,570 For example, the vector attributed to the word "to" 26 00:01:18,570 --> 00:01:22,290 isn't the representation of only the "to" word. 27 00:01:22,290 --> 00:01:25,650 It also takes into account the words around it 28 00:01:25,650 --> 00:01:27,363 which we call the context. 29 00:01:28,650 --> 00:01:30,780 As in it looks to the left context, 30 00:01:30,780 --> 00:01:32,970 the words on the left of the one we're studying, 31 00:01:32,970 --> 00:01:34,980 here the word "Welcome", 32 00:01:34,980 --> 00:01:37,497 and the context on the right, here the word "NYC", 33 00:01:38,348 --> 00:01:42,000 and it outputs a value for the word given its context. 34 00:01:42,000 --> 00:01:45,420 It is therefore a contextualized value. 35 00:01:45,420 --> 00:01:48,810 One could say that the vector of 768 values 36 00:01:48,810 --> 00:01:51,993 holds the meaning of the word within the text. 37 00:01:53,310 --> 00:01:56,073 It does this thanks to the self-attention mechanism. 38 00:01:57,240 --> 00:02:00,630 The self-attention mechanism relates to different positions, 39 00:02:00,630 --> 00:02:02,850 or different words in a single sequence 40 00:02:02,850 --> 00:02:06,003 in order to compute a representation of that sequence. 41 00:02:07,200 --> 00:02:09,000 As we've seen before, this means that 42 00:02:09,000 --> 00:02:11,130 the resulting representation of a word 43 00:02:11,130 --> 00:02:13,983 has been affected by other words in the sequence. 44 00:02:15,840 --> 00:02:18,030 We won't dive into the specifics here 45 00:02:18,030 --> 00:02:19,680 which will offer some further readings 46 00:02:19,680 --> 00:02:21,330 if you want to get a better understanding 47 00:02:21,330 --> 00:02:22,953 at what happens under the hood. 48 00:02:25,050 --> 00:02:27,480 So why should one use and encoder? 49 00:02:27,480 --> 00:02:29,370 Encoders can be used as stand-alone models 50 00:02:29,370 --> 00:02:31,263 in a wide variety of tasks. 51 00:02:32,100 --> 00:02:33,360 For example, BERT, 52 00:02:33,360 --> 00:02:35,670 arguably the most famous transformer model, 53 00:02:35,670 --> 00:02:37,590 is a standalone encoder model, 54 00:02:37,590 --> 00:02:38,820 and at the time of release, 55 00:02:38,820 --> 00:02:40,440 it'd be the state of the art 56 00:02:40,440 --> 00:02:42,780 in many sequence classification tasks, 57 00:02:42,780 --> 00:02:44,190 question answering tasks, 58 00:02:44,190 --> 00:02:46,743 and mask language modeling to only cite of few. 59 00:02:48,150 --> 00:02:50,460 The idea is that encoders are very powerful 60 00:02:50,460 --> 00:02:52,470 at extracting vectors that carry 61 00:02:52,470 --> 00:02:55,350 meaningful information about a sequence. 62 00:02:55,350 --> 00:02:57,870 This vector can then be handled down the road 63 00:02:57,870 --> 00:03:00,070 by additional neurons to make sense of them. 64 00:03:01,380 --> 00:03:02,850 Let's take a look at some examples 65 00:03:02,850 --> 00:03:04,563 where encoder really shine. 66 00:03:06,210 --> 00:03:09,900 First of all, Masked Language Modeling, or MLM. 67 00:03:09,900 --> 00:03:11,970 It's the task of predicting a hidden word 68 00:03:11,970 --> 00:03:13,590 in a sequence of word. 69 00:03:13,590 --> 00:03:15,630 Here, for example, we have hidden the word 70 00:03:15,630 --> 00:03:17,247 between "My" and "is". 71 00:03:18,270 --> 00:03:21,120 This is one of the objectives with which BERT was trained. 72 00:03:21,120 --> 00:03:24,393 It was trained to predict hidden words in a sequence. 73 00:03:25,230 --> 00:03:27,930 Encoders shine in this scenario in particular 74 00:03:27,930 --> 00:03:31,140 as bi-directional information is crucial here. 75 00:03:31,140 --> 00:03:32,947 If we didn't have the words on the right, 76 00:03:32,947 --> 00:03:34,650 "is", "Sylvain" and the ".", 77 00:03:34,650 --> 00:03:35,940 then there is very little chance 78 00:03:35,940 --> 00:03:38,580 that BERT would have been able to identify name 79 00:03:38,580 --> 00:03:40,500 as the correct word. 80 00:03:40,500 --> 00:03:42,270 The encoder needs to have a good understanding 81 00:03:42,270 --> 00:03:45,360 of the sequence in order to predict a masked word 82 00:03:45,360 --> 00:03:48,840 as even if the text is grammatically correct, 83 00:03:48,840 --> 00:03:50,610 it does not necessarily make sense 84 00:03:50,610 --> 00:03:52,413 in the context of the sequence. 85 00:03:55,230 --> 00:03:56,580 As mentioned earlier, 86 00:03:56,580 --> 00:03:59,520 encoders are good at doing sequence classification. 87 00:03:59,520 --> 00:04:02,883 Sentiment analysis is an example of sequence classification. 88 00:04:04,410 --> 00:04:09,410 The model's aim is to identify the sentiment of a sequence. 89 00:04:09,540 --> 00:04:11,280 It can range from giving a sequence, 90 00:04:11,280 --> 00:04:12,960 a rating from one to five stars 91 00:04:12,960 --> 00:04:15,900 if doing review analysis to giving a positive, 92 00:04:15,900 --> 00:04:17,820 or negative rating to a sequence 93 00:04:17,820 --> 00:04:19,220 which is what is shown here. 94 00:04:20,280 --> 00:04:22,950 For example, here, given the two sequences, 95 00:04:22,950 --> 00:04:25,860 we use the model to compute a prediction, 96 00:04:25,860 --> 00:04:27,420 and to classify the sequences 97 00:04:27,420 --> 00:04:30,393 among these two classes, positive and negative. 98 00:04:31,230 --> 00:04:33,450 While the two sequences are very similar 99 00:04:33,450 --> 00:04:35,220 containing the same words, 100 00:04:35,220 --> 00:04:37,170 the meaning is entirely different, 101 00:04:37,170 --> 00:04:40,143 and the encoder model is able to grasp that difference. 102 00:04:41,404 --> 00:04:44,154 (outro striking)