1 00:00:00,554 --> 00:00:03,304 (logo whooshing) 2 00:00:05,340 --> 00:00:07,563 - What happens inside the pipeline function? 3 00:00:08,760 --> 00:00:11,580 In this video, we will look at what actually happens 4 00:00:11,580 --> 00:00:13,080 when we use the pipeline function 5 00:00:13,080 --> 00:00:15,090 of the Transformers library. 6 00:00:15,090 --> 00:00:16,860 More specifically, we will look 7 00:00:16,860 --> 00:00:19,200 at the sentiment analysis pipeline, 8 00:00:19,200 --> 00:00:22,020 and how it went from the two following sentences, 9 00:00:22,020 --> 00:00:23,970 to the positive and negative labels 10 00:00:23,970 --> 00:00:25,420 with their respective scores. 11 00:00:26,760 --> 00:00:29,190 As we have seen in the pipeline presentation, 12 00:00:29,190 --> 00:00:31,860 there are three stages in the pipeline. 13 00:00:31,860 --> 00:00:34,620 First, we convert the raw texts to numbers 14 00:00:34,620 --> 00:00:37,173 the model can make sense of using a tokenizer. 15 00:00:38,010 --> 00:00:40,530 Then those numbers go through the model, 16 00:00:40,530 --> 00:00:41,943 which outputs logits. 17 00:00:42,780 --> 00:00:45,600 Finally, the post-processing steps transforms 18 00:00:45,600 --> 00:00:48,150 those logits into labels and scores. 19 00:00:48,150 --> 00:00:50,700 Let's look in detail at those three steps 20 00:00:50,700 --> 00:00:53,640 and how to replicate them using the Transformers library, 21 00:00:53,640 --> 00:00:56,043 beginning with the first stage, tokenization. 22 00:00:57,915 --> 00:01:00,360 The tokenization process has several steps. 23 00:01:00,360 --> 00:01:04,950 First, the text is split into small chunks called tokens. 24 00:01:04,950 --> 00:01:08,550 They can be words, parts of words or punctuation symbols. 25 00:01:08,550 --> 00:01:11,580 Then the tokenizer will had some special tokens, 26 00:01:11,580 --> 00:01:13,500 if the model expect them. 27 00:01:13,500 --> 00:01:16,860 Here the model uses expects a CLS token at the beginning 28 00:01:16,860 --> 00:01:19,743 and a SEP token at the end of the sentence to classify. 29 00:01:20,580 --> 00:01:24,180 Lastly, the tokenizer matches each token to its unique ID 30 00:01:24,180 --> 00:01:27,000 in the vocabulary of the pretrained model. 31 00:01:27,000 --> 00:01:28,680 To load such a tokenizer, 32 00:01:28,680 --> 00:01:31,743 the Transformers library provides the AutoTokenizer API. 33 00:01:32,730 --> 00:01:36,120 The most important method of this class is from_pretrained, 34 00:01:36,120 --> 00:01:38,910 which will download and cache the configuration 35 00:01:38,910 --> 00:01:41,853 and the vocabulary associated to a given checkpoint. 36 00:01:43,200 --> 00:01:45,360 Here the checkpoint used by default 37 00:01:45,360 --> 00:01:47,280 for the sentiment analysis pipeline 38 00:01:47,280 --> 00:01:51,986 is distilbert-base-uncased-finetuned-sst-2-English. 39 00:01:51,986 --> 00:01:53,700 (indistinct) 40 00:01:53,700 --> 00:01:56,490 We instantiate a tokenizer associated with that checkpoint, 41 00:01:56,490 --> 00:01:59,490 then feed it the two sentences. 42 00:01:59,490 --> 00:02:02,100 Since those two sentences are not of the same size, 43 00:02:02,100 --> 00:02:03,930 we will need to pad the shortest one 44 00:02:03,930 --> 00:02:06,030 to be able to build an array. 45 00:02:06,030 --> 00:02:09,840 This is done by the tokenizer with the option, padding=True. 46 00:02:09,840 --> 00:02:12,810 With truncation=True, we ensure that any sentence 47 00:02:12,810 --> 00:02:15,873 longer than the maximum the model can handle is truncated. 48 00:02:17,010 --> 00:02:19,620 Lastly, the return_tensors option 49 00:02:19,620 --> 00:02:22,323 tells the tokenizer to return a PyTorch tensor. 50 00:02:23,190 --> 00:02:25,590 Looking at the result, we see we have a dictionary 51 00:02:25,590 --> 00:02:26,670 with two keys. 52 00:02:26,670 --> 00:02:29,970 Input IDs contains the IDs of both sentences, 53 00:02:29,970 --> 00:02:32,550 with zero where the padding is applied. 54 00:02:32,550 --> 00:02:34,260 The second key, attention mask, 55 00:02:34,260 --> 00:02:36,150 indicates where padding has been applied, 56 00:02:36,150 --> 00:02:38,940 so the model does not pay attention to it. 57 00:02:38,940 --> 00:02:42,090 This is all what is inside the tokenization step. 58 00:02:42,090 --> 00:02:46,289 Now, let's have a look at the second step, the model. 59 00:02:46,289 --> 00:02:47,952 As for the tokenizer, 60 00:02:47,952 --> 00:02:51,133 there is an AutoModel API with a from_pretrained method. 61 00:02:51,133 --> 00:02:53,954 It will download and cache the configuration of the model 62 00:02:53,954 --> 00:02:56,280 as well as the pretrained weights. 63 00:02:56,280 --> 00:02:58,200 However, the AutoModel API 64 00:02:58,200 --> 00:03:00,630 will only instantiate the body of the model, 65 00:03:00,630 --> 00:03:03,420 that is the part of the model that is left 66 00:03:03,420 --> 00:03:06,090 once the pretraining head is removed. 67 00:03:06,090 --> 00:03:08,610 It will output a high-dimensional tensor 68 00:03:08,610 --> 00:03:11,220 that is a representation of the sentences passed, 69 00:03:11,220 --> 00:03:12,690 but which is not directly useful 70 00:03:12,690 --> 00:03:15,030 for our classification problem. 71 00:03:15,030 --> 00:03:19,230 Here the tensor has two sentences, each of 16 tokens, 72 00:03:19,230 --> 00:03:23,433 and the last dimension is the hidden size of our model, 768. 73 00:03:24,900 --> 00:03:27,510 To get an output linked to our classification problem, 74 00:03:27,510 --> 00:03:31,170 we need to use the AutoModelForSequenceClassification class. 75 00:03:31,170 --> 00:03:33,330 It works exactly as the AutoModel class, 76 00:03:33,330 --> 00:03:35,130 except that it will build a model 77 00:03:35,130 --> 00:03:36,543 with a classification head. 78 00:03:37,483 --> 00:03:39,560 There is one auto class for each common NLP task 79 00:03:39,560 --> 00:03:40,960 in the Transformers library. 80 00:03:42,150 --> 00:03:45,570 Here after giving our model the two sentences, 81 00:03:45,570 --> 00:03:47,820 we get a tensor of size two by two, 82 00:03:47,820 --> 00:03:50,943 one result for each sentence and for each possible label. 83 00:03:51,840 --> 00:03:53,970 Those outputs are not probabilities yet, 84 00:03:53,970 --> 00:03:56,100 we can see they don't sum to 1. 85 00:03:56,100 --> 00:03:57,270 This is because each model 86 00:03:57,270 --> 00:04:00,810 of the Transformers library returns logits. 87 00:04:00,810 --> 00:04:02,250 To make sense of those logits, 88 00:04:02,250 --> 00:04:05,910 we need to dig into the third and last step of the pipeline. 89 00:04:05,910 --> 00:04:10,620 Post-processing, to convert logits into probabilities, 90 00:04:10,620 --> 00:04:13,470 we need to apply a SoftMax layers to them. 91 00:04:13,470 --> 00:04:14,610 As we can see, 92 00:04:14,610 --> 00:04:17,267 this transforms them into positive number 93 00:04:17,267 --> 00:04:18,663 that sum up to one. 94 00:04:18,663 --> 00:04:21,360 The last step is to know which of those corresponds 95 00:04:21,360 --> 00:04:23,580 to the positive or the negative label. 96 00:04:23,580 --> 00:04:28,020 This is given by the id2label field of the model config. 97 00:04:28,020 --> 00:04:30,390 The first probabilities, index zero, 98 00:04:30,390 --> 00:04:32,250 correspond to the negative label, 99 00:04:32,250 --> 00:04:34,140 and the seconds, index one, 100 00:04:34,140 --> 00:04:36,480 correspond to the positive label. 101 00:04:36,480 --> 00:04:37,950 This is how our classifier built 102 00:04:37,950 --> 00:04:40,230 with the pipeline function picked those labels 103 00:04:40,230 --> 00:04:42,240 and computed those scores. 104 00:04:42,240 --> 00:04:44,220 Now that you know how each steps works, 105 00:04:44,220 --> 00:04:46,220 you can easily tweak them to your needs. 106 00:04:47,524 --> 00:04:50,274 (logo whooshing)