subtitles/en/13_word-based-tokenizers.srt (204 lines of code) (raw):
1
00:00:00,165 --> 00:00:01,416
(screen whooshing)
2
00:00:01,416 --> 00:00:02,716
(sticker popping)
3
00:00:02,716 --> 00:00:03,549
(screen whooshing)
4
00:00:03,549 --> 00:00:05,603
- Let's take a look at
word-based tokenization.
5
00:00:07,650 --> 00:00:09,780
Word-based tokenization is the idea
6
00:00:09,780 --> 00:00:11,940
of splitting the raw text into words
7
00:00:11,940 --> 00:00:14,673
by splitting on spaces
or other specific rules,
8
00:00:16,020 --> 00:00:17,163
like punctuation.
9
00:00:18,900 --> 00:00:21,810
In this algorithm, each
word has a specific number
10
00:00:21,810 --> 00:00:23,463
or ID attributed to it.
11
00:00:24,360 --> 00:00:27,270
Here, let's has the ID 250,
12
00:00:27,270 --> 00:00:30,150
do has 861, and tokenization
13
00:00:30,150 --> 00:00:33,393
followed by an exclamation mark has 345.
14
00:00:34,380 --> 00:00:36,000
This approach is interesting
15
00:00:36,000 --> 00:00:38,100
as the model has representations
16
00:00:38,100 --> 00:00:40,233
that are based on entire words.
17
00:00:42,720 --> 00:00:45,960
The information held in
a single number is high,
18
00:00:45,960 --> 00:00:48,240
as a word contains a lot of contextual
19
00:00:48,240 --> 00:00:49,803
and semantic information.
20
00:00:53,070 --> 00:00:55,473
However, this approach
does have its limits.
21
00:00:56,610 --> 00:01:00,570
For example, the word dog and
the word dogs are very similar
22
00:01:00,570 --> 00:01:01,923
and their meaning is close.
23
00:01:03,210 --> 00:01:05,550
The word-based tokenization, however,
24
00:01:05,550 --> 00:01:08,520
will attribute entirely
different IDs to these two words
25
00:01:08,520 --> 00:01:10,110
and the model will therefore learn
26
00:01:10,110 --> 00:01:12,930
two different embeddings
for these two words.
27
00:01:12,930 --> 00:01:15,090
This is unfortunate as
we would like the model
28
00:01:15,090 --> 00:01:18,240
to understand that these
words are indeed related,
29
00:01:18,240 --> 00:01:21,483
and that dogs is simply the
plural form of the word dog.
30
00:01:22,980 --> 00:01:24,480
Another issue with this approach,
31
00:01:24,480 --> 00:01:28,050
is that there are a lot of
different words in the language.
32
00:01:28,050 --> 00:01:29,490
If we want our model to understand
33
00:01:29,490 --> 00:01:32,160
all possible sentences in that language,
34
00:01:32,160 --> 00:01:35,850
then we will need to have an
ID for each different word.
35
00:01:35,850 --> 00:01:37,380
And the total number of words,
36
00:01:37,380 --> 00:01:40,080
which is also known as
the vocabulary size,
37
00:01:40,080 --> 00:01:41,913
can quickly become very large.
38
00:01:44,400 --> 00:01:47,640
This is an issue because each
ID is mapped to a large vector
39
00:01:47,640 --> 00:01:50,190
that represents the word's meaning,
40
00:01:50,190 --> 00:01:52,170
and keeping track of these mappings
41
00:01:52,170 --> 00:01:54,990
requires an enormous number of weights
42
00:01:54,990 --> 00:01:57,123
when the vocabulary size is very large.
43
00:01:59,160 --> 00:02:00,960
If we want our models to stay lean,
44
00:02:00,960 --> 00:02:04,440
we can opt for our tokenizer
to ignore certain words
45
00:02:04,440 --> 00:02:06,093
that we don't necessarily need.
46
00:02:08,400 --> 00:02:11,970
For example, here, when training
our tokenizer on a text,
47
00:02:11,970 --> 00:02:15,020
we might want to take only
the 10,000 most frequent words
48
00:02:15,020 --> 00:02:16,320
in that text.
49
00:02:16,320 --> 00:02:18,600
Rather than taking all
words from in that text
50
00:02:18,600 --> 00:02:22,503
or all languages words to
create our basic vocabulary.
51
00:02:23,790 --> 00:02:26,520
The tokenizer will know how
to convert those 10,000 words
52
00:02:26,520 --> 00:02:29,370
into numbers, but any other
word will be converted
53
00:02:29,370 --> 00:02:31,530
to the out-of-vocabulary word,
54
00:02:31,530 --> 00:02:33,783
or like shown here, the unknown word.
55
00:02:35,280 --> 00:02:37,440
Unfortunately, this is a compromise.
56
00:02:37,440 --> 00:02:39,900
The model will have the
exact same representation
57
00:02:39,900 --> 00:02:42,390
for all words that it doesn't know,
58
00:02:42,390 --> 00:02:45,210
which can result in a
lot of lost information
59
00:02:45,210 --> 00:02:47,664
if many unknown words are present.
60
00:02:47,664 --> 00:02:50,581
(screen whooshing)