subtitles/zh-CN/70_using-a-debugger-in-a-notebook.srt (288 lines of code) (raw):
1
00:00:05,400 --> 00:00:08,150
- [讲师] 在笔记本中使用 Python 调试器。
- [Instructor] Using the Python debugger in a notebook.
2
00:00:09,540 --> 00:00:12,330
在本视频中,我们将学习如何使用 Python 调试器
In this video, we'll learn how to use the Python debugger
3
00:00:12,330 --> 00:00:15,027
在 Jupyter Notebook 或 Colab 中。
in a Jupyter Notebook or a Colab.
4
00:00:15,027 --> 00:00:17,070
对于这个例子,我们正在运行代码
For this example, we are running code
5
00:00:17,070 --> 00:00:19,775
从令牌分类部分,
from the token classification section,
6
00:00:19,775 --> 00:00:21,513
下载 Conll 数据集,
downloading the Conll dataset,
7
00:00:23,670 --> 00:00:25,503
稍微看一下数据,
looking a little bit at data,
8
00:00:27,840 --> 00:00:29,250
在加载分词器之前
before loading a tokenizer
9
00:00:29,250 --> 00:00:31,173
预处理整个数据集。
to preprocess the whole dataset.
10
00:00:32,880 --> 00:00:34,740
查看下面链接的课程部分
Check out the section of the course linked below
11
00:00:34,740 --> 00:00:35,823
了解更多信息。
for more information.
12
00:00:37,080 --> 00:00:38,520
一旦完成,
Once this is done,
13
00:00:38,520 --> 00:00:41,580
我们尝试加载训练数据集的八个特征,
we try to load eight features of the training dataset,
14
00:00:41,580 --> 00:00:43,080
然后将它们批在一起,
and then batch them together,
15
00:00:43,080 --> 00:00:45,210
使用 tokenizer.pad,
using tokenizer.pad,
16
00:00:45,210 --> 00:00:46,760
我们得到以下错误。
and we get the following error.
17
00:00:48,090 --> 00:00:49,230
我们在这里使用 PyTorch,
We use PyTorch here,
18
00:00:49,230 --> 00:00:51,330
使用 return_tensors="pt"
with return_tensors="pt"
19
00:00:51,330 --> 00:00:53,273
但是你会在 TensorFlow 中遇到同样的错误。
but you will get the same error with TensorFlow.
20
00:00:54,120 --> 00:00:55,897
正如我们在 “如何调试错误?” 中看到的那样。视频,
As we have seen in the "How to debug an error?" video,
21
00:00:55,897 --> 00:00:59,160
错误消息在回溯的末尾。
the error message is at the end of the traceback.
22
00:00:59,160 --> 00:01:01,710
在这里,它表明我们应该使用填充,
Here, it indicates us we should use padding,
23
00:01:01,710 --> 00:01:04,290
我们实际上正在尝试这样做。
which we are actually trying to do.
24
00:01:04,290 --> 00:01:05,610
所以这根本没有用,
So this is not useful at all,
25
00:01:05,610 --> 00:01:06,990
我们需要更深入一点
and we will need to go a little deeper
26
00:01:06,990 --> 00:01:08,610
调试问题。
to debug the problem.
27
00:01:08,610 --> 00:01:10,650
幸运的是,你可以使用 Python 调试器
Fortunately, you can use the Python debugger
28
00:01:10,650 --> 00:01:13,170
任何时候你在 Jupyter Notebook 中遇到错误
at any time you get an error in a Jupyter Notebook
29
00:01:13,170 --> 00:01:16,350
通过在单元格中键入魔法命令 debug。
by typing the magic command, debug, in a cell.
30
00:01:16,350 --> 00:01:18,450
不要忘记开头的百分比。
Don't forget the percent at the beginning.
31
00:01:20,400 --> 00:01:21,870
执行该单元格时,
When executing that cell,
32
00:01:21,870 --> 00:01:23,910
你走到回溯的最底部
you go to the very bottom of the traceback
33
00:01:23,910 --> 00:01:25,320
你可以在其中键入命令
where you can type commands
34
00:01:25,320 --> 00:01:27,690
这将帮助你调试脚本。
that will help you debug your script.
35
00:01:27,690 --> 00:01:29,250
你应该学习的前两个命令,
The first two commands you should learn,
36
00:01:29,250 --> 00:01:32,040
是 u 和 d,代表向上和向下。
are u and d, for up and down.
37
00:01:32,040 --> 00:01:36,090
输入 u 和 enter 会让你更上一层楼
Typing u and enter will take you up one step
38
00:01:36,090 --> 00:01:38,910
在上一条指令的回溯中。
in the traceback to the previous instruction.
39
00:01:38,910 --> 00:01:41,190
输入 d 然后输入将带你
Typing d and then enter will take you
40
00:01:41,190 --> 00:01:43,023
在追溯中向下迈出一步。
one step down in the traceback.
41
00:01:44,130 --> 00:01:47,910
上升两次,我们到达了错误发生的地步。
Going up twice, we get to the point the error was reached.
42
00:01:47,910 --> 00:01:51,510
为调试器学习的第三个命令是 p,用于打印。
The third command to learn for the debugger is p, for print.
43
00:01:51,510 --> 00:01:54,780
它允许你打印任何你想要的值。
It allows you to print any value you want.
44
00:01:54,780 --> 00:01:58,740
例如,键入 p return_tensors 并输入,
For instance, typing p return_tensors and enter,
45
00:01:58,740 --> 00:02:02,893
我们看到传递给错误函数的值 pt。
we see the value pt that we pass to the bad function.
46
00:02:02,893 --> 00:02:05,370
我们还可以查看批处理输出
We can also have a look at the batch outputs
47
00:02:05,370 --> 00:02:07,353
此批处理行编码对象获取。
this batch line coding object gets.
48
00:02:09,480 --> 00:02:12,600
批量输出字典有点难以挖掘,
The batch outputs dictionary is a bit hard to dig in to,
49
00:02:12,600 --> 00:02:15,360
因此,让我们深入研究它的较小部分。
so let's dive into smaller pieces of it.
50
00:02:15,360 --> 00:02:18,390
在调试器内部,你不仅可以打印任何变量
Inside the debugger you can not only print any variable
51
00:02:18,390 --> 00:02:20,970
还要评估任何表达式,
but also evaluate any expression,
52
00:02:20,970 --> 00:02:23,610
例如,我们可以查看 input_ids 键
for instance, we can have a look at the input_ids keys
53
00:02:23,610 --> 00:02:25,203
这个 batch_outputs 对象。
this batch_outputs object.
54
00:02:27,600 --> 00:02:30,693
或者在这个 batch_outputs 对象的标签键处。
Or at the labels keys of this batch_outputs object.
55
00:02:35,730 --> 00:02:37,320
这些标签肯定很奇怪:
Those labels are definitely weird:
56
00:02:37,320 --> 00:02:38,970
它们大小不一,
they are of various sizes,
57
00:02:38,970 --> 00:02:41,340
我们实际上可以确认,如果我们愿意的话,
which we can actually confirm, if we want,
58
00:02:41,340 --> 00:02:43,983
通过以最小压缩打印尺寸。
by printing the size with the least compression.
59
00:02:52,290 --> 00:02:54,913
这是因为 tokenizer 的 pad 方法
This is because the pad method of the tokenizer
60
00:02:54,913 --> 00:02:57,090
只处理分词器输出:
only takes care of the tokenizer outputs:
61
00:02:57,090 --> 00:03:00,450
输入 ID、注意掩码和令牌类型 ID,
input IDs, attention mask, and token type IDs,
62
00:03:00,450 --> 00:03:02,340
所以我们必须自己填充标签
so we have to pad the labels ourselves
63
00:03:02,340 --> 00:03:05,310
在尝试用它们创建张量之前。
before trying to create a tensor with them.
64
00:03:05,310 --> 00:03:07,260
一旦你准备好退出 Python 调试器,
Once you are ready to exit the Python debugger,
65
00:03:07,260 --> 00:03:09,453
你可以按 q 并输入退出。
you can press q and enter for quit.
66
00:03:10,320 --> 00:03:11,670
修复错误的一种方法
One way to fix the error
67
00:03:11,670 --> 00:03:14,313
就是手动把 labels 补到最长。
is to manually pad the labels to the longest.
68
00:03:15,300 --> 00:03:17,400
另一种方法是使用数据整理器
Another way is to use a data collator
69
00:03:17,400 --> 00:03:19,863
专为令牌分类而设计。
specifically designed for token classification.
70
00:03:20,970 --> 00:03:22,950
你也可以直接使用 Python 调试器
You can also use a Python debugger directly
71
00:03:22,950 --> 00:03:23,850
在终端。
in the terminal.
72
00:03:23,850 --> 00:03:25,943
查看下面的视频链接以了解操作方法。
Check out the video link below to learn how.