subtitles/en/12_tokenizers-overview.srt (76 lines of code) (raw):
1
00:00:00,450 --> 00:00:01,509
(intro whooshing)
2
00:00:01,509 --> 00:00:02,720
(smiley snapping)
3
00:00:02,720 --> 00:00:03,930
(words whooshing)
4
00:00:03,930 --> 00:00:04,920
- In the next few videos,
5
00:00:04,920 --> 00:00:06,720
we'll take a look at the tokenizers.
6
00:00:07,860 --> 00:00:09,240
In natural language processing,
7
00:00:09,240 --> 00:00:12,930
most of the data that we
handle consists of raw text.
8
00:00:12,930 --> 00:00:14,280
However, machine learning models
9
00:00:14,280 --> 00:00:17,103
cannot read or understand
text in its raw form,
10
00:00:18,540 --> 00:00:20,253
they can only work with numbers.
11
00:00:21,360 --> 00:00:23,220
So the tokenizer's objective
12
00:00:23,220 --> 00:00:25,923
will be to translate
the text into numbers.
13
00:00:27,600 --> 00:00:30,240
There are several possible
approaches to this conversion,
14
00:00:30,240 --> 00:00:31,110
and the objective
15
00:00:31,110 --> 00:00:33,453
is to find the most
meaningful representation.
16
00:00:36,240 --> 00:00:39,390
We'll take a look at three
distinct tokenization algorithms.
17
00:00:39,390 --> 00:00:40,530
We compare them one to one,
18
00:00:40,530 --> 00:00:42,600
so we recommend you take
a look at the videos
19
00:00:42,600 --> 00:00:44,040
in the following order.
20
00:00:44,040 --> 00:00:45,390
First, "Word-based,"
21
00:00:45,390 --> 00:00:46,800
followed by "Character-based,"
22
00:00:46,800 --> 00:00:48,877
and finally, "Subword-based."
23
00:00:48,877 --> 00:00:51,794
(outro whooshing)