subtitles/en/14_character-based-tokenizers.srt (217 lines of code) (raw):
1
00:00:00,234 --> 00:00:02,901
(page whirring)
2
00:00:04,260 --> 00:00:07,200
- Before diving in
character-based tokenization,
3
00:00:07,200 --> 00:00:10,350
understanding why this kind
of tokenization is interesting
4
00:00:10,350 --> 00:00:13,533
requires understanding the flaws
of word-based tokenization.
5
00:00:14,640 --> 00:00:16,320
If you haven't seen the first video
6
00:00:16,320 --> 00:00:17,880
on word-based tokenization
7
00:00:17,880 --> 00:00:21,450
we recommend you check it out
before looking at this video.
8
00:00:21,450 --> 00:00:24,250
Okay, let's take a look at
character-based tokenization.
9
00:00:25,650 --> 00:00:28,560
We now split our text into
individual characters,
10
00:00:28,560 --> 00:00:29,673
rather than words.
11
00:00:32,850 --> 00:00:35,550
There are generally a lot of
different words in languages,
12
00:00:35,550 --> 00:00:37,743
while the number of characters stays low.
13
00:00:38,610 --> 00:00:41,313
To begin let's take a look
at the English language,
14
00:00:42,210 --> 00:00:45,540
it has an estimated
170,000 different words,
15
00:00:45,540 --> 00:00:47,730
so we would need a very large vocabulary
16
00:00:47,730 --> 00:00:49,413
to encompass all words.
17
00:00:50,280 --> 00:00:52,200
With a character-based vocabulary,
18
00:00:52,200 --> 00:00:55,440
we can get by with only 256 characters,
19
00:00:55,440 --> 00:00:58,683
which includes letters,
numbers and special characters.
20
00:00:59,760 --> 00:01:02,190
Even languages with a lot
of different characters
21
00:01:02,190 --> 00:01:04,800
like the Chinese languages
can have dictionaries
22
00:01:04,800 --> 00:01:08,130
with up to 20,000 different characters
23
00:01:08,130 --> 00:01:11,523
but more than 375,000 different words.
24
00:01:12,480 --> 00:01:14,310
So character-based vocabularies
25
00:01:14,310 --> 00:01:16,293
let us use fewer different tokens
26
00:01:16,293 --> 00:01:19,050
than the word-based
tokenization dictionaries
27
00:01:19,050 --> 00:01:20,523
we would otherwise use.
28
00:01:23,250 --> 00:01:25,830
These vocabularies are also more complete
29
00:01:25,830 --> 00:01:28,950
than their word-based
vocabularies counterparts.
30
00:01:28,950 --> 00:01:31,410
As our vocabulary contains all characters
31
00:01:31,410 --> 00:01:33,960
used in a language, even words unseen
32
00:01:33,960 --> 00:01:36,990
during the tokenizer training
can still be tokenized,
33
00:01:36,990 --> 00:01:39,633
so out-of-vocabulary tokens
will be less frequent.
34
00:01:40,680 --> 00:01:42,840
This includes the ability
to correctly tokenize
35
00:01:42,840 --> 00:01:45,210
misspelled words, rather
than discarding them
36
00:01:45,210 --> 00:01:46,623
as unknown straight away.
37
00:01:48,240 --> 00:01:52,380
However, this algorithm
isn't perfect either.
38
00:01:52,380 --> 00:01:54,360
Intuitively, characters do not hold
39
00:01:54,360 --> 00:01:57,990
as much information individually
as a word would hold.
40
00:01:57,990 --> 00:02:00,930
For example, "Let's"
holds more information
41
00:02:00,930 --> 00:02:03,570
than it's first letter "l".
42
00:02:03,570 --> 00:02:05,880
Of course, this is not
true for all languages,
43
00:02:05,880 --> 00:02:08,880
as some languages like
ideogram-based languages
44
00:02:08,880 --> 00:02:11,523
have a lot of information
held in single characters,
45
00:02:12,750 --> 00:02:15,360
but for others like roman-based languages,
46
00:02:15,360 --> 00:02:17,760
the model will have to make
sense of multiple tokens
47
00:02:17,760 --> 00:02:20,670
at a time to get the
information otherwise held
48
00:02:20,670 --> 00:02:21,753
in a single word.
49
00:02:23,760 --> 00:02:27,000
This leads to another issue
with character-based tokenizers,
50
00:02:27,000 --> 00:02:29,520
their sequences are translated
into very large amount
51
00:02:29,520 --> 00:02:31,593
of tokens to be processed by the model.
52
00:02:33,090 --> 00:02:36,810
And this can have an impact
on the size of the context
53
00:02:36,810 --> 00:02:40,020
the model will carry around,
and will reduce the size
54
00:02:40,020 --> 00:02:42,030
of the text we can use
as input for our model,
55
00:02:42,030 --> 00:02:43,233
which is often limited.
56
00:02:44,100 --> 00:02:46,650
This tokenization, while
it has some issues,
57
00:02:46,650 --> 00:02:48,720
has seen some very good
results in the past
58
00:02:48,720 --> 00:02:50,490
and so it should be
considered when approaching
59
00:02:50,490 --> 00:02:52,680
a new problem as it solves issues
60
00:02:52,680 --> 00:02:54,843
encountered in the word-based algorithm.
61
00:02:56,107 --> 00:02:58,774
(page whirring)