subtitles/en/15_subword-based-tokenizers.srt (251 lines of code) (raw):
1
00:00:06,450 --> 00:00:09,540
- Let's take a look at
subword based tokenization.
2
00:00:09,540 --> 00:00:11,610
Understanding why subword
based tokenization is
3
00:00:11,610 --> 00:00:13,980
interesting requires
understanding the flaws
4
00:00:13,980 --> 00:00:17,340
of word based and corrector
based tokenization.
5
00:00:17,340 --> 00:00:18,780
If you haven't seen the first videos
6
00:00:18,780 --> 00:00:22,020
on word based and character
based tokenization
7
00:00:22,020 --> 00:00:23,130
we recommend you check them
8
00:00:23,130 --> 00:00:24,780
out before looking at this video.
9
00:00:27,840 --> 00:00:31,493
Subword based tokenization
lies in between character based
10
00:00:31,493 --> 00:00:35,280
and word based tokenization algorithms.
11
00:00:35,280 --> 00:00:37,410
The idea is to find a middle ground
12
00:00:37,410 --> 00:00:39,486
between very large vocabularies
13
00:00:39,486 --> 00:00:42,600
a large quantity of out vocabulary tokens
14
00:00:42,600 --> 00:00:45,360
and a loss of meaning
across very similar words
15
00:00:45,360 --> 00:00:48,630
for word based tokenizers
and very long sequences
16
00:00:48,630 --> 00:00:51,330
as well as less meaningful
individual tokens.
17
00:00:51,330 --> 00:00:53,133
For character based tokenizers.
18
00:00:54,840 --> 00:00:57,960
These algorithms rely on
the following principle.
19
00:00:57,960 --> 00:01:00,000
Frequently used words should not be split
20
00:01:00,000 --> 00:01:01,500
into smaller subwords
21
00:01:01,500 --> 00:01:03,433
while rare words should be decomposed
22
00:01:03,433 --> 00:01:05,103
into meaningful subwords.
23
00:01:06,510 --> 00:01:08,460
An example is the word dog.
24
00:01:08,460 --> 00:01:11,190
We would like to have our
tokenizer to have a single ID
25
00:01:11,190 --> 00:01:12,600
for the word dog rather
26
00:01:12,600 --> 00:01:15,363
than splitting it into
correctors D O and G.
27
00:01:16,650 --> 00:01:19,260
However, when encountering the word dogs
28
00:01:19,260 --> 00:01:22,710
we would like our tokenize to
understand that at the root
29
00:01:22,710 --> 00:01:24,120
this is still the word dog.
30
00:01:24,120 --> 00:01:27,030
With an added S, that
slightly changes the meaning
31
00:01:27,030 --> 00:01:28,923
while keeping the original idea.
32
00:01:30,600 --> 00:01:34,080
Another example is a complex
word like tokenization
33
00:01:34,080 --> 00:01:37,140
which can be split into
meaningful subwords.
34
00:01:37,140 --> 00:01:37,973
The root
35
00:01:37,973 --> 00:01:40,590
of the word is token and
-ization completes the root
36
00:01:40,590 --> 00:01:42,870
to give it a slightly different meaning.
37
00:01:42,870 --> 00:01:44,430
It makes sense to split the word
38
00:01:44,430 --> 00:01:47,640
into two, token as the root of the word,
39
00:01:47,640 --> 00:01:49,950
labeled as the start of the word
40
00:01:49,950 --> 00:01:52,530
and ization as additional
information labeled
41
00:01:52,530 --> 00:01:54,393
as a completion of the word.
42
00:01:55,826 --> 00:01:58,740
In turn, the model will
now be able to make sense
43
00:01:58,740 --> 00:02:01,080
of token in different situations.
44
00:02:01,080 --> 00:02:04,602
It will understand that the
word's token, tokens, tokenizing
45
00:02:04,602 --> 00:02:08,760
and tokenization have a
similar meaning and are linked.
46
00:02:08,760 --> 00:02:12,450
It's will also understand that
tokenization, modernization
47
00:02:12,450 --> 00:02:16,200
and immunization, which
all have the same suffixes
48
00:02:16,200 --> 00:02:19,383
are probably used in the
same syntactic situations.
49
00:02:20,610 --> 00:02:23,130
Subword based tokenizers
generally have a way to
50
00:02:23,130 --> 00:02:25,890
identify which tokens are a start of word
51
00:02:25,890 --> 00:02:28,443
and which tokens complete start of words.
52
00:02:29,520 --> 00:02:31,140
So here token as the start
53
00:02:31,140 --> 00:02:35,100
of a ward and hash hash
ization as completion of award.
54
00:02:35,100 --> 00:02:38,103
Here, the hash hash prefix
indicates that ization is part
55
00:02:38,103 --> 00:02:41,013
of award rather than the beginning of it.
56
00:02:41,910 --> 00:02:43,110
The hash hash comes
57
00:02:43,110 --> 00:02:47,013
from the BERT tokenizer based
on the word piece algorithm.
58
00:02:47,850 --> 00:02:50,700
Other tokenizes use other
prefixes which can be
59
00:02:50,700 --> 00:02:52,200
placed to indicate part of words
60
00:02:52,200 --> 00:02:55,083
like in here or start of words instead.
61
00:02:56,250 --> 00:02:57,083
There are a lot
62
00:02:57,083 --> 00:02:58,740
of different algorithms that can be used
63
00:02:58,740 --> 00:03:00,090
for subword tokenization
64
00:03:00,090 --> 00:03:02,670
and most models obtaining
state-of-the-art results
65
00:03:02,670 --> 00:03:03,780
in English today
66
00:03:03,780 --> 00:03:06,663
use some kind of subword
tokenization algorithms.
67
00:03:07,620 --> 00:03:10,953
These approaches help in
reducing the vocabulary sizes
68
00:03:10,953 --> 00:03:13,636
by sharing information
across different words
69
00:03:13,636 --> 00:03:15,960
having the ability to have prefixes
70
00:03:15,960 --> 00:03:18,630
and suffixes understood as such.
71
00:03:18,630 --> 00:03:20,700
They keep meaning across
very similar words
72
00:03:20,700 --> 00:03:23,103
by recognizing similar
tokens, making them up.