subtitles/en/70_using-a-debugger-in-a-notebook.srt (247 lines of code) (raw):

1 00:00:05,400 --> 00:00:08,150 - [Instructor] Using the Python debugger in a notebook. 2 00:00:09,540 --> 00:00:12,330 In this video, we'll learn how to use the Python debugger 3 00:00:12,330 --> 00:00:15,027 in a Jupyter Notebook or a Colab. 4 00:00:15,027 --> 00:00:17,070 For this example, we are running code 5 00:00:17,070 --> 00:00:19,775 from the token classification section, 6 00:00:19,775 --> 00:00:21,513 downloading the Conll dataset, 7 00:00:23,670 --> 00:00:25,503 looking a little bit at data, 8 00:00:27,840 --> 00:00:29,250 before loading a tokenizer 9 00:00:29,250 --> 00:00:31,173 to preprocess the whole dataset. 10 00:00:32,880 --> 00:00:34,740 Check out the section of the course linked below 11 00:00:34,740 --> 00:00:35,823 for more information. 12 00:00:37,080 --> 00:00:38,520 Once this is done, 13 00:00:38,520 --> 00:00:41,580 we try to load eight features of the training dataset, 14 00:00:41,580 --> 00:00:43,080 and then batch them together, 15 00:00:43,080 --> 00:00:45,210 using tokenizer.pad, 16 00:00:45,210 --> 00:00:46,760 and we get the following error. 17 00:00:48,090 --> 00:00:49,230 We use PyTorch here, 18 00:00:49,230 --> 00:00:51,330 with return_tensors="pt" 19 00:00:51,330 --> 00:00:53,273 but you will get the same error with TensorFlow. 20 00:00:54,120 --> 00:00:55,897 As we have seen in the "How to debug an error?" video, 21 00:00:55,897 --> 00:00:59,160 the error message is at the end of the traceback. 22 00:00:59,160 --> 00:01:01,710 Here, it indicates us we should use padding, 23 00:01:01,710 --> 00:01:04,290 which we are actually trying to do. 24 00:01:04,290 --> 00:01:05,610 So this is not useful at all, 25 00:01:05,610 --> 00:01:06,990 and we will need to go a little deeper 26 00:01:06,990 --> 00:01:08,610 to debug the problem. 27 00:01:08,610 --> 00:01:10,650 Fortunately, you can use the Python debugger 28 00:01:10,650 --> 00:01:13,170 at any time you get an error in a Jupyter Notebook 29 00:01:13,170 --> 00:01:16,350 by typing the magic command, debug, in a cell. 30 00:01:16,350 --> 00:01:18,450 Don't forget the percent at the beginning. 31 00:01:20,400 --> 00:01:21,870 When executing that cell, 32 00:01:21,870 --> 00:01:23,910 you go to the very bottom of the traceback 33 00:01:23,910 --> 00:01:25,320 where you can type commands 34 00:01:25,320 --> 00:01:27,690 that will help you debug your script. 35 00:01:27,690 --> 00:01:29,250 The first two commands you should learn, 36 00:01:29,250 --> 00:01:32,040 are u and d, for up and down. 37 00:01:32,040 --> 00:01:36,090 Typing u and enter will take you up one step 38 00:01:36,090 --> 00:01:38,910 in the traceback to the previous instruction. 39 00:01:38,910 --> 00:01:41,190 Typing d and then enter will take you 40 00:01:41,190 --> 00:01:43,023 one step down in the traceback. 41 00:01:44,130 --> 00:01:47,910 Going up twice, we get to the point the error was reached. 42 00:01:47,910 --> 00:01:51,510 The third command to learn for the debugger is p, for print. 43 00:01:51,510 --> 00:01:54,780 It allows you to print any value you want. 44 00:01:54,780 --> 00:01:58,740 For instance, typing p return_tensors and enter, 45 00:01:58,740 --> 00:02:02,893 we see the value pt that we pass to the bad function. 46 00:02:02,893 --> 00:02:05,370 We can also have a look at the batch outputs 47 00:02:05,370 --> 00:02:07,353 this batch line coding object gets. 48 00:02:09,480 --> 00:02:12,600 The batch outputs dictionary is a bit hard to dig in to, 49 00:02:12,600 --> 00:02:15,360 so let's dive into smaller pieces of it. 50 00:02:15,360 --> 00:02:18,390 Inside the debugger you can not only print any variable 51 00:02:18,390 --> 00:02:20,970 but also evaluate any expression, 52 00:02:20,970 --> 00:02:23,610 for instance, we can have a look at the input_ids keys 53 00:02:23,610 --> 00:02:25,203 this batch_outputs object. 54 00:02:27,600 --> 00:02:30,693 Or at the labels keys of this batch_outputs object. 55 00:02:35,730 --> 00:02:37,320 Those labels are definitely weird: 56 00:02:37,320 --> 00:02:38,970 they are of various sizes, 57 00:02:38,970 --> 00:02:41,340 which we can actually confirm, if we want, 58 00:02:41,340 --> 00:02:43,983 by printing the size with the least compression. 59 00:02:52,290 --> 00:02:54,913 This is because the pad method of the tokenizer 60 00:02:54,913 --> 00:02:57,090 only takes care of the tokenizer outputs: 61 00:02:57,090 --> 00:03:00,450 input IDs, attention mask, and token type IDs, 62 00:03:00,450 --> 00:03:02,340 so we have to pad the labels ourselves 63 00:03:02,340 --> 00:03:05,310 before trying to create a tensor with them. 64 00:03:05,310 --> 00:03:07,260 Once you are ready to exit the Python debugger, 65 00:03:07,260 --> 00:03:09,453 you can press q and enter for quit. 66 00:03:10,320 --> 00:03:11,670 One way to fix the error 67 00:03:11,670 --> 00:03:14,313 is to manually pad the labels to the longest. 68 00:03:15,300 --> 00:03:17,400 Another way is to use a data collator 69 00:03:17,400 --> 00:03:19,863 specifically designed for token classification. 70 00:03:20,970 --> 00:03:22,950 You can also use a Python debugger directly 71 00:03:22,950 --> 00:03:23,850 in the terminal. 72 00:03:23,850 --> 00:03:25,943 Check out the video link below to learn how.