subtitles/en/70_using-a-debugger-in-a-notebook.srt (247 lines of code) (raw):
1
00:00:05,400 --> 00:00:08,150
- [Instructor] Using the
Python debugger in a notebook.
2
00:00:09,540 --> 00:00:12,330
In this video, we'll learn
how to use the Python debugger
3
00:00:12,330 --> 00:00:15,027
in a Jupyter Notebook or a Colab.
4
00:00:15,027 --> 00:00:17,070
For this example, we are running code
5
00:00:17,070 --> 00:00:19,775
from the token classification section,
6
00:00:19,775 --> 00:00:21,513
downloading the Conll dataset,
7
00:00:23,670 --> 00:00:25,503
looking a little bit at data,
8
00:00:27,840 --> 00:00:29,250
before loading a tokenizer
9
00:00:29,250 --> 00:00:31,173
to preprocess the whole dataset.
10
00:00:32,880 --> 00:00:34,740
Check out the section of
the course linked below
11
00:00:34,740 --> 00:00:35,823
for more information.
12
00:00:37,080 --> 00:00:38,520
Once this is done,
13
00:00:38,520 --> 00:00:41,580
we try to load eight features
of the training dataset,
14
00:00:41,580 --> 00:00:43,080
and then batch them together,
15
00:00:43,080 --> 00:00:45,210
using tokenizer.pad,
16
00:00:45,210 --> 00:00:46,760
and we get the following error.
17
00:00:48,090 --> 00:00:49,230
We use PyTorch here,
18
00:00:49,230 --> 00:00:51,330
with return_tensors="pt"
19
00:00:51,330 --> 00:00:53,273
but you will get the same
error with TensorFlow.
20
00:00:54,120 --> 00:00:55,897
As we have seen in the "How
to debug an error?" video,
21
00:00:55,897 --> 00:00:59,160
the error message is at
the end of the traceback.
22
00:00:59,160 --> 00:01:01,710
Here, it indicates us
we should use padding,
23
00:01:01,710 --> 00:01:04,290
which we are actually trying to do.
24
00:01:04,290 --> 00:01:05,610
So this is not useful at all,
25
00:01:05,610 --> 00:01:06,990
and we will need to go a little deeper
26
00:01:06,990 --> 00:01:08,610
to debug the problem.
27
00:01:08,610 --> 00:01:10,650
Fortunately, you can
use the Python debugger
28
00:01:10,650 --> 00:01:13,170
at any time you get an
error in a Jupyter Notebook
29
00:01:13,170 --> 00:01:16,350
by typing the magic
command, debug, in a cell.
30
00:01:16,350 --> 00:01:18,450
Don't forget the percent at the beginning.
31
00:01:20,400 --> 00:01:21,870
When executing that cell,
32
00:01:21,870 --> 00:01:23,910
you go to the very bottom of the traceback
33
00:01:23,910 --> 00:01:25,320
where you can type commands
34
00:01:25,320 --> 00:01:27,690
that will help you debug your script.
35
00:01:27,690 --> 00:01:29,250
The first two commands you should learn,
36
00:01:29,250 --> 00:01:32,040
are u and d, for up and down.
37
00:01:32,040 --> 00:01:36,090
Typing u and enter will
take you up one step
38
00:01:36,090 --> 00:01:38,910
in the traceback to the
previous instruction.
39
00:01:38,910 --> 00:01:41,190
Typing d and then enter will take you
40
00:01:41,190 --> 00:01:43,023
one step down in the traceback.
41
00:01:44,130 --> 00:01:47,910
Going up twice, we get to the
point the error was reached.
42
00:01:47,910 --> 00:01:51,510
The third command to learn for
the debugger is p, for print.
43
00:01:51,510 --> 00:01:54,780
It allows you to print any value you want.
44
00:01:54,780 --> 00:01:58,740
For instance, typing p
return_tensors and enter,
45
00:01:58,740 --> 00:02:02,893
we see the value pt that we
pass to the bad function.
46
00:02:02,893 --> 00:02:05,370
We can also have a look
at the batch outputs
47
00:02:05,370 --> 00:02:07,353
this batch line coding object gets.
48
00:02:09,480 --> 00:02:12,600
The batch outputs dictionary
is a bit hard to dig in to,
49
00:02:12,600 --> 00:02:15,360
so let's dive into smaller pieces of it.
50
00:02:15,360 --> 00:02:18,390
Inside the debugger you can
not only print any variable
51
00:02:18,390 --> 00:02:20,970
but also evaluate any expression,
52
00:02:20,970 --> 00:02:23,610
for instance, we can have a
look at the input_ids keys
53
00:02:23,610 --> 00:02:25,203
this batch_outputs object.
54
00:02:27,600 --> 00:02:30,693
Or at the labels keys of
this batch_outputs object.
55
00:02:35,730 --> 00:02:37,320
Those labels are definitely weird:
56
00:02:37,320 --> 00:02:38,970
they are of various sizes,
57
00:02:38,970 --> 00:02:41,340
which we can actually confirm, if we want,
58
00:02:41,340 --> 00:02:43,983
by printing the size with
the least compression.
59
00:02:52,290 --> 00:02:54,913
This is because the pad
method of the tokenizer
60
00:02:54,913 --> 00:02:57,090
only takes care of the tokenizer outputs:
61
00:02:57,090 --> 00:03:00,450
input IDs, attention
mask, and token type IDs,
62
00:03:00,450 --> 00:03:02,340
so we have to pad the labels ourselves
63
00:03:02,340 --> 00:03:05,310
before trying to create
a tensor with them.
64
00:03:05,310 --> 00:03:07,260
Once you are ready to
exit the Python debugger,
65
00:03:07,260 --> 00:03:09,453
you can press q and enter for quit.
66
00:03:10,320 --> 00:03:11,670
One way to fix the error
67
00:03:11,670 --> 00:03:14,313
is to manually pad the
labels to the longest.
68
00:03:15,300 --> 00:03:17,400
Another way is to use a data collator
69
00:03:17,400 --> 00:03:19,863
specifically designed
for token classification.
70
00:03:20,970 --> 00:03:22,950
You can also use a
Python debugger directly
71
00:03:22,950 --> 00:03:23,850
in the terminal.
72
00:03:23,850 --> 00:03:25,943
Check out the video
link below to learn how.