subtitles/en/12_tokenizers-overview.srt

1 00:00:00,450 --> 00:00:01,509 (intro whooshing) 2 00:00:01,509 --> 00:00:02,720 (smiley snapping) 3 00:00:02,720 --> 00:00:03,930 (words whooshing) 4 00:00:03,930 --> 00:00:04,920 - In the next few videos, 5 00:00:04,920 --> 00:00:06,720 we'll take a look at the tokenizers. 6 00:00:07,860 --> 00:00:09,240 In natural language processing, 7 00:00:09,240 --> 00:00:12,930 most of the data that we handle consists of raw text. 8 00:00:12,930 --> 00:00:14,280 However, machine learning models 9 00:00:14,280 --> 00:00:17,103 cannot read or understand text in its raw form, 10 00:00:18,540 --> 00:00:20,253 they can only work with numbers. 11 00:00:21,360 --> 00:00:23,220 So the tokenizer's objective 12 00:00:23,220 --> 00:00:25,923 will be to translate the text into numbers. 13 00:00:27,600 --> 00:00:30,240 There are several possible approaches to this conversion, 14 00:00:30,240 --> 00:00:31,110 and the objective 15 00:00:31,110 --> 00:00:33,453 is to find the most meaningful representation. 16 00:00:36,240 --> 00:00:39,390 We'll take a look at three distinct tokenization algorithms. 17 00:00:39,390 --> 00:00:40,530 We compare them one to one, 18 00:00:40,530 --> 00:00:42,600 so we recommend you take a look at the videos 19 00:00:42,600 --> 00:00:44,040 in the following order. 20 00:00:44,040 --> 00:00:45,390 First, "Word-based," 21 00:00:45,390 --> 00:00:46,800 followed by "Character-based," 22 00:00:46,800 --> 00:00:48,877 and finally, "Subword-based." 23 00:00:48,877 --> 00:00:51,794 (outro whooshing)

subtitles/en/12_tokenizers-overview.srt (76 lines of code) (raw):