subtitles/zh-CN/70_using-a-debugger-in-a-notebook.srt (288 lines of code) (raw):

1 00:00:05,400 --> 00:00:08,150 - [讲师] 在笔记本中使用 Python 调试器。 - [Instructor] Using the Python debugger in a notebook. 2 00:00:09,540 --> 00:00:12,330 在本视频中,我们将学习如何使用 Python 调试器 In this video, we'll learn how to use the Python debugger 3 00:00:12,330 --> 00:00:15,027 在 Jupyter Notebook 或 Colab 中。 in a Jupyter Notebook or a Colab. 4 00:00:15,027 --> 00:00:17,070 对于这个例子,我们正在运行代码 For this example, we are running code 5 00:00:17,070 --> 00:00:19,775 从令牌分类部分, from the token classification section, 6 00:00:19,775 --> 00:00:21,513 下载 Conll 数据集, downloading the Conll dataset, 7 00:00:23,670 --> 00:00:25,503 稍微看一下数据, looking a little bit at data, 8 00:00:27,840 --> 00:00:29,250 在加载分词器之前 before loading a tokenizer 9 00:00:29,250 --> 00:00:31,173 预处理整个数据集。 to preprocess the whole dataset. 10 00:00:32,880 --> 00:00:34,740 查看下面链接的课程部分 Check out the section of the course linked below 11 00:00:34,740 --> 00:00:35,823 了解更多信息。 for more information. 12 00:00:37,080 --> 00:00:38,520 一旦完成, Once this is done, 13 00:00:38,520 --> 00:00:41,580 我们尝试加载训练数据集的八个特征, we try to load eight features of the training dataset, 14 00:00:41,580 --> 00:00:43,080 然后将它们批在一起, and then batch them together, 15 00:00:43,080 --> 00:00:45,210 使用 tokenizer.pad, using tokenizer.pad, 16 00:00:45,210 --> 00:00:46,760 我们得到以下错误。 and we get the following error. 17 00:00:48,090 --> 00:00:49,230 我们在这里使用 PyTorch, We use PyTorch here, 18 00:00:49,230 --> 00:00:51,330 使用 return_tensors="pt" with return_tensors="pt" 19 00:00:51,330 --> 00:00:53,273 但是你会在 TensorFlow 中遇到同样的错误。 but you will get the same error with TensorFlow. 20 00:00:54,120 --> 00:00:55,897 正如我们在 “如何调试错误?” 中看到的那样。视频, As we have seen in the "How to debug an error?" video, 21 00:00:55,897 --> 00:00:59,160 错误消息在回溯的末尾。 the error message is at the end of the traceback. 22 00:00:59,160 --> 00:01:01,710 在这里,它表明我们应该使用填充, Here, it indicates us we should use padding, 23 00:01:01,710 --> 00:01:04,290 我们实际上正在尝试这样做。 which we are actually trying to do. 24 00:01:04,290 --> 00:01:05,610 所以这根本没有用, So this is not useful at all, 25 00:01:05,610 --> 00:01:06,990 我们需要更深入一点 and we will need to go a little deeper 26 00:01:06,990 --> 00:01:08,610 调试问题。 to debug the problem. 27 00:01:08,610 --> 00:01:10,650 幸运的是,你可以使用 Python 调试器 Fortunately, you can use the Python debugger 28 00:01:10,650 --> 00:01:13,170 任何时候你在 Jupyter Notebook 中遇到错误 at any time you get an error in a Jupyter Notebook 29 00:01:13,170 --> 00:01:16,350 通过在单元格中键入魔法命令 debug。 by typing the magic command, debug, in a cell. 30 00:01:16,350 --> 00:01:18,450 不要忘记开头的百分比。 Don't forget the percent at the beginning. 31 00:01:20,400 --> 00:01:21,870 执行该单元格时, When executing that cell, 32 00:01:21,870 --> 00:01:23,910 你走到回溯的最底部 you go to the very bottom of the traceback 33 00:01:23,910 --> 00:01:25,320 你可以在其中键入命令 where you can type commands 34 00:01:25,320 --> 00:01:27,690 这将帮助你调试脚本。 that will help you debug your script. 35 00:01:27,690 --> 00:01:29,250 你应该学习的前两个命令, The first two commands you should learn, 36 00:01:29,250 --> 00:01:32,040 是 u 和 d,代表向上和向下。 are u and d, for up and down. 37 00:01:32,040 --> 00:01:36,090 输入 u 和 enter 会让你更上一层楼 Typing u and enter will take you up one step 38 00:01:36,090 --> 00:01:38,910 在上一条指令的回溯中。 in the traceback to the previous instruction. 39 00:01:38,910 --> 00:01:41,190 输入 d 然后输入将带你 Typing d and then enter will take you 40 00:01:41,190 --> 00:01:43,023 在追溯中向下迈出一步。 one step down in the traceback. 41 00:01:44,130 --> 00:01:47,910 上升两次,我们到达了错误发生的地步。 Going up twice, we get to the point the error was reached. 42 00:01:47,910 --> 00:01:51,510 为调试器学习的第三个命令是 p,用于打印。 The third command to learn for the debugger is p, for print. 43 00:01:51,510 --> 00:01:54,780 它允许你打印任何你想要的值。 It allows you to print any value you want. 44 00:01:54,780 --> 00:01:58,740 例如,键入 p return_tensors 并输入, For instance, typing p return_tensors and enter, 45 00:01:58,740 --> 00:02:02,893 我们看到传递给错误函数的值 pt。 we see the value pt that we pass to the bad function. 46 00:02:02,893 --> 00:02:05,370 我们还可以查看批处理输出 We can also have a look at the batch outputs 47 00:02:05,370 --> 00:02:07,353 此批处理行编码对象获取。 this batch line coding object gets. 48 00:02:09,480 --> 00:02:12,600 批量输出字典有点难以挖掘, The batch outputs dictionary is a bit hard to dig in to, 49 00:02:12,600 --> 00:02:15,360 因此,让我们深入研究它的较小部分。 so let's dive into smaller pieces of it. 50 00:02:15,360 --> 00:02:18,390 在调试器内部,你不仅可以打印任何变量 Inside the debugger you can not only print any variable 51 00:02:18,390 --> 00:02:20,970 还要评估任何表达式, but also evaluate any expression, 52 00:02:20,970 --> 00:02:23,610 例如,我们可以查看 input_ids 键 for instance, we can have a look at the input_ids keys 53 00:02:23,610 --> 00:02:25,203 这个 batch_outputs 对象。 this batch_outputs object. 54 00:02:27,600 --> 00:02:30,693 或者在这个 batch_outputs 对象的标签键处。 Or at the labels keys of this batch_outputs object. 55 00:02:35,730 --> 00:02:37,320 这些标签肯定很奇怪: Those labels are definitely weird: 56 00:02:37,320 --> 00:02:38,970 它们大小不一, they are of various sizes, 57 00:02:38,970 --> 00:02:41,340 我们实际上可以确认,如果我们愿意的话, which we can actually confirm, if we want, 58 00:02:41,340 --> 00:02:43,983 通过以最小压缩打印尺寸。 by printing the size with the least compression. 59 00:02:52,290 --> 00:02:54,913 这是因为 tokenizer 的 pad 方法 This is because the pad method of the tokenizer 60 00:02:54,913 --> 00:02:57,090 只处理分词器输出: only takes care of the tokenizer outputs: 61 00:02:57,090 --> 00:03:00,450 输入 ID、注意掩码和令牌类型 ID, input IDs, attention mask, and token type IDs, 62 00:03:00,450 --> 00:03:02,340 所以我们必须自己填充标签 so we have to pad the labels ourselves 63 00:03:02,340 --> 00:03:05,310 在尝试用它们创建张量之前。 before trying to create a tensor with them. 64 00:03:05,310 --> 00:03:07,260 一旦你准备好退出 Python 调试器, Once you are ready to exit the Python debugger, 65 00:03:07,260 --> 00:03:09,453 你可以按 q 并输入退出。 you can press q and enter for quit. 66 00:03:10,320 --> 00:03:11,670 修复错误的一种方法 One way to fix the error 67 00:03:11,670 --> 00:03:14,313 就是手动把 labels 补到最长。 is to manually pad the labels to the longest. 68 00:03:15,300 --> 00:03:17,400 另一种方法是使用数据整理器 Another way is to use a data collator 69 00:03:17,400 --> 00:03:19,863 专为令牌分类而设计。 specifically designed for token classification. 70 00:03:20,970 --> 00:03:22,950 你也可以直接使用 Python 调试器 You can also use a Python debugger directly 71 00:03:22,950 --> 00:03:23,850 在终端。 in the terminal. 72 00:03:23,850 --> 00:03:25,943 查看下面的视频链接以了解操作方法。 Check out the video link below to learn how.