subtitles/en/55_data-processing-for-token-classification.srt (253 lines of code) (raw):
1
00:00:05,730 --> 00:00:07,590
- Let's study how to preprocess a dataset
2
00:00:07,590 --> 00:00:09,063
for token classification!
3
00:00:10,560 --> 00:00:12,660
Token classification regroups any task
4
00:00:12,660 --> 00:00:14,940
that can be framed as labeling each word
5
00:00:14,940 --> 00:00:17,190
or token in a sentence,
6
00:00:17,190 --> 00:00:19,530
like identifying the
persons, organizations
7
00:00:19,530 --> 00:00:21,093
and locations for instance.
8
00:00:22,170 --> 00:00:25,470
For our example, we will
use the Conll dataset,
9
00:00:25,470 --> 00:00:27,900
in which we remove columns we won't use
10
00:00:27,900 --> 00:00:29,940
and rename the other
ones to get to a dataset
11
00:00:29,940 --> 00:00:32,943
with just two columns, words and labels.
12
00:00:34,200 --> 00:00:36,750
If you have your own dataset
for token classification,
13
00:00:36,750 --> 00:00:39,870
just make sure you clean your
data to get to the same point,
14
00:00:39,870 --> 00:00:43,290
with one column containing
words as list of strings
15
00:00:43,290 --> 00:00:45,540
and another containing labels as integers
16
00:00:45,540 --> 00:00:48,513
spanning from zero to your
number of labels minus one.
17
00:00:49,740 --> 00:00:52,290
Make sure you have your
label names stored somewhere.
18
00:00:52,290 --> 00:00:54,810
Here we get them from
the dataset features.
19
00:00:54,810 --> 00:00:57,660
So you are able to map the
integers to some real labels
20
00:00:57,660 --> 00:00:58,960
when inspecting your data.
21
00:01:00,690 --> 00:01:03,510
Here we are doing named
entity recognitions,
22
00:01:03,510 --> 00:01:05,430
so ours labels are either O
23
00:01:05,430 --> 00:01:08,310
for words that do not
belong to any entity.
24
00:01:08,310 --> 00:01:13,310
LOC for location, PER for
person, ORG for organization
25
00:01:13,860 --> 00:01:15,603
and MISC for miscellaneous.
26
00:01:16,650 --> 00:01:18,540
Each label has two versions.
27
00:01:18,540 --> 00:01:21,960
The B labels indicate a
word that begins an entity
28
00:01:21,960 --> 00:01:25,503
while the I labels indicate a
word that is inside an entity.
29
00:01:27,180 --> 00:01:28,830
The first step to preprocess our data
30
00:01:28,830 --> 00:01:30,660
is to tokenize the words.
31
00:01:30,660 --> 00:01:33,120
This is very easily
done with the tokenizer.
32
00:01:33,120 --> 00:01:35,370
We just have to tell it we
have pre-tokenized the data
33
00:01:35,370 --> 00:01:37,503
with the flag is_split_into_words=True.
34
00:01:38,520 --> 00:01:40,380
Then comes the hard part.
35
00:01:40,380 --> 00:01:42,360
Since we have added special tokens
36
00:01:42,360 --> 00:01:45,270
and each word may have been
split into several tokens,
37
00:01:45,270 --> 00:01:48,090
our labels won't match the tokens anymore.
38
00:01:48,090 --> 00:01:50,670
This is where the word IDs
our fast tokenizer provides
39
00:01:50,670 --> 00:01:51,723
come to the rescue.
40
00:01:53,040 --> 00:01:55,500
They match each token to
the word it belongs to
41
00:01:55,500 --> 00:01:58,470
which allows us to map
each token to its label.
42
00:01:58,470 --> 00:01:59,303
We just have to make sure
43
00:01:59,303 --> 00:02:01,710
we change the B labels
to their I counterparts
44
00:02:01,710 --> 00:02:03,450
for tokens that are inside
45
00:02:03,450 --> 00:02:05,433
but not at the beginning of a word.
46
00:02:06,330 --> 00:02:09,120
The special tokens get a label of -100,
47
00:02:09,120 --> 00:02:11,070
which is how we tell the
Transformer loss functions
48
00:02:11,070 --> 00:02:14,607
to ignore them when computing the loss.
49
00:02:14,607 --> 00:02:16,890
The code is then pretty straightforward.
50
00:02:16,890 --> 00:02:18,660
We write a function that shifts the labels
51
00:02:18,660 --> 00:02:21,840
for tokens that are inside a
word that you can customize
52
00:02:21,840 --> 00:02:24,490
and use it when generating
the labels for each token.
53
00:02:25,830 --> 00:02:28,260
Once that function to create
our labels is written,
54
00:02:28,260 --> 00:02:31,920
we can preprocess the whole
dataset using the map function.
55
00:02:31,920 --> 00:02:33,360
With the option batched=True,
56
00:02:33,360 --> 00:02:35,793
we unleash the speed
of out fast tokenizers.
57
00:02:37,110 --> 00:02:40,350
The last problem comes when
we need to create a batch.
58
00:02:40,350 --> 00:02:42,150
Unless you changed the
preprocessing function
59
00:02:42,150 --> 00:02:43,890
to apply some fixed padding,
60
00:02:43,890 --> 00:02:45,900
we will get sentences of various lengths,
61
00:02:45,900 --> 00:02:47,900
which we need to pad to the same length.
62
00:02:48,930 --> 00:02:50,730
The padding needs to be
applied to the inputs
63
00:02:50,730 --> 00:02:51,900
as well as the labels,
64
00:02:51,900 --> 00:02:53,950
since we should have one label per token.
65
00:02:54,870 --> 00:02:58,260
Again, -100 indicates the
labels that should be ignored
66
00:02:58,260 --> 00:02:59,510
for the loss computation.
67
00:03:00,420 --> 00:03:01,560
This is all done for us
68
00:03:01,560 --> 00:03:04,050
by the DataCollatorForTokenClassification,
69
00:03:04,050 --> 00:03:06,740
which you can use in
PyTorch or TensorFlow.
70
00:03:06,740 --> 00:03:08,880
With all of this, you are
either ready to send your data
71
00:03:08,880 --> 00:03:11,190
and this data collator to the Trainer,
72
00:03:11,190 --> 00:03:13,320
or use the to_tf_dataset method
73
00:03:13,320 --> 00:03:15,333
and the fit method of your model.