1
00:00:06,450 --> 00:00:09,540
- Let's take a look at
subword based tokenization.

2
00:00:09,540 --> 00:00:11,610
Understanding why subword
based tokenization is

3
00:00:11,610 --> 00:00:13,980
interesting requires
understanding the flaws

4
00:00:13,980 --> 00:00:17,340
of word based and corrector
based tokenization.

5
00:00:17,340 --> 00:00:18,780
If you haven't seen the first videos

6
00:00:18,780 --> 00:00:22,020
on word based and character
based tokenization

7
00:00:22,020 --> 00:00:23,130
we recommend you check them

8
00:00:23,130 --> 00:00:24,780
out before looking at this video.

9
00:00:27,840 --> 00:00:31,493
Subword based tokenization
lies in between character based

10
00:00:31,493 --> 00:00:35,280
and word based tokenization algorithms.

11
00:00:35,280 --> 00:00:37,410
The idea is to find a middle ground

12
00:00:37,410 --> 00:00:39,486
between very large vocabularies

13
00:00:39,486 --> 00:00:42,600
a large quantity of out vocabulary tokens

14
00:00:42,600 --> 00:00:45,360
and a loss of meaning
across very similar words

15
00:00:45,360 --> 00:00:48,630
for word based tokenizers
and very long sequences

16
00:00:48,630 --> 00:00:51,330
as well as less meaningful
individual tokens.

17
00:00:51,330 --> 00:00:53,133
For character based tokenizers.

18
00:00:54,840 --> 00:00:57,960
These algorithms rely on
the following principle.

19
00:00:57,960 --> 00:01:00,000
Frequently used words should not be split

20
00:01:00,000 --> 00:01:01,500
into smaller subwords

21
00:01:01,500 --> 00:01:03,433
while rare words should be decomposed

22
00:01:03,433 --> 00:01:05,103
into meaningful subwords.

23
00:01:06,510 --> 00:01:08,460
An example is the word dog.

24
00:01:08,460 --> 00:01:11,190
We would like to have our
tokenizer to have a single ID

25
00:01:11,190 --> 00:01:12,600
for the word dog rather

26
00:01:12,600 --> 00:01:15,363
than splitting it into
correctors D O and G.

27
00:01:16,650 --> 00:01:19,260
However, when encountering the word dogs

28
00:01:19,260 --> 00:01:22,710
we would like our tokenize to
understand that at the root

29
00:01:22,710 --> 00:01:24,120
this is still the word dog.

30
00:01:24,120 --> 00:01:27,030
With an added S, that
slightly changes the meaning

31
00:01:27,030 --> 00:01:28,923
while keeping the original idea.

32
00:01:30,600 --> 00:01:34,080
Another example is a complex
word like tokenization

33
00:01:34,080 --> 00:01:37,140
which can be split into
meaningful subwords.

34
00:01:37,140 --> 00:01:37,973
The root

35
00:01:37,973 --> 00:01:40,590
of the word is token and
-ization completes the root

36
00:01:40,590 --> 00:01:42,870
to give it a slightly different meaning.

37
00:01:42,870 --> 00:01:44,430
It makes sense to split the word

38
00:01:44,430 --> 00:01:47,640
into two, token as the root of the word,

39
00:01:47,640 --> 00:01:49,950
labeled as the start of the word

40
00:01:49,950 --> 00:01:52,530
and ization as additional
information labeled

41
00:01:52,530 --> 00:01:54,393
as a completion of the word.

42
00:01:55,826 --> 00:01:58,740
In turn, the model will
now be able to make sense

43
00:01:58,740 --> 00:02:01,080
of token in different situations.

44
00:02:01,080 --> 00:02:04,602
It will understand that the
word's token, tokens, tokenizing

45
00:02:04,602 --> 00:02:08,760
and tokenization have a
similar meaning and are linked.

46
00:02:08,760 --> 00:02:12,450
It's will also understand that
tokenization, modernization

47
00:02:12,450 --> 00:02:16,200
and immunization, which
all have the same suffixes

48
00:02:16,200 --> 00:02:19,383
are probably used in the
same syntactic situations.

49
00:02:20,610 --> 00:02:23,130
Subword based tokenizers
generally have a way to

50
00:02:23,130 --> 00:02:25,890
identify which tokens are a start of word

51
00:02:25,890 --> 00:02:28,443
and which tokens complete start of words.

52
00:02:29,520 --> 00:02:31,140
So here token as the start

53
00:02:31,140 --> 00:02:35,100
of a ward and hash hash
ization as completion of award.

54
00:02:35,100 --> 00:02:38,103
Here, the hash hash prefix
indicates that ization is part

55
00:02:38,103 --> 00:02:41,013
of award rather than the beginning of it.

56
00:02:41,910 --> 00:02:43,110
The hash hash comes

57
00:02:43,110 --> 00:02:47,013
from the BERT tokenizer based
on the word piece algorithm.

58
00:02:47,850 --> 00:02:50,700
Other tokenizes use other
prefixes which can be

59
00:02:50,700 --> 00:02:52,200
placed to indicate part of words

60
00:02:52,200 --> 00:02:55,083
like in here or start of words instead.

61
00:02:56,250 --> 00:02:57,083
There are a lot

62
00:02:57,083 --> 00:02:58,740
of different algorithms that can be used

63
00:02:58,740 --> 00:03:00,090
for subword tokenization

64
00:03:00,090 --> 00:03:02,670
and most models obtaining
state-of-the-art results

65
00:03:02,670 --> 00:03:03,780
in English today

66
00:03:03,780 --> 00:03:06,663
use some kind of subword
tokenization algorithms.

67
00:03:07,620 --> 00:03:10,953
These approaches help in
reducing the vocabulary sizes

68
00:03:10,953 --> 00:03:13,636
by sharing information
across different words

69
00:03:13,636 --> 00:03:15,960
having the ability to have prefixes

70
00:03:15,960 --> 00:03:18,630
and suffixes understood as such.

71
00:03:18,630 --> 00:03:20,700
They keep meaning across
very similar words

72
00:03:20,700 --> 00:03:23,103
by recognizing similar
tokens, making them up.