1
00:00:00,450 --> 00:00:01,509
(intro whooshing)

2
00:00:01,509 --> 00:00:02,720
(smiley snapping)

3
00:00:02,720 --> 00:00:03,930
(words whooshing)

4
00:00:03,930 --> 00:00:04,920
- In the next few videos,

5
00:00:04,920 --> 00:00:06,720
we'll take a look at the tokenizers.

6
00:00:07,860 --> 00:00:09,240
In natural language processing,

7
00:00:09,240 --> 00:00:12,930
most of the data that we
handle consists of raw text.

8
00:00:12,930 --> 00:00:14,280
However, machine learning models

9
00:00:14,280 --> 00:00:17,103
cannot read or understand
text in its raw form,

10
00:00:18,540 --> 00:00:20,253
they can only work with numbers.

11
00:00:21,360 --> 00:00:23,220
So the tokenizer's objective

12
00:00:23,220 --> 00:00:25,923
will be to translate
the text into numbers.

13
00:00:27,600 --> 00:00:30,240
There are several possible
approaches to this conversion,

14
00:00:30,240 --> 00:00:31,110
and the objective

15
00:00:31,110 --> 00:00:33,453
is to find the most
meaningful representation.

16
00:00:36,240 --> 00:00:39,390
We'll take a look at three
distinct tokenization algorithms.

17
00:00:39,390 --> 00:00:40,530
We compare them one to one,

18
00:00:40,530 --> 00:00:42,600
so we recommend you take
a look at the videos

19
00:00:42,600 --> 00:00:44,040
in the following order.

20
00:00:44,040 --> 00:00:45,390
First, "Word-based,"

21
00:00:45,390 --> 00:00:46,800
followed by "Character-based,"

22
00:00:46,800 --> 00:00:48,877
and finally, "Subword-based."

23
00:00:48,877 --> 00:00:51,794
(outro whooshing)