1
00:00:04,660 --> 00:00:07,589
Welcome to the Hugging Face tasks series!

2
00:00:07,589 --> 00:00:13,730
In this video we’ll take a look at Masked
Language Modeling.

3
00:00:13,730 --> 00:00:20,720
Masked language modeling is the task of predicting
which words should fill in the blanks of a

4
00:00:20,720 --> 00:00:23,500
sentence.

5
00:00:23,500 --> 00:00:32,870
These models take a masked text as the input
and output the possible values for that mask.

6
00:00:32,870 --> 00:00:37,550
Masked language modeling is handy before fine-tuning
your model for your task.

7
00:00:37,550 --> 00:00:43,579
For example, if you need to use a model in
a specific domain, say, biomedical documents,

8
00:00:43,579 --> 00:00:49,050
models like BERT will treat your domain-specific
words as rare tokens.

9
00:00:49,050 --> 00:00:54,220
If you train a masked language model using
your biomedical corpus and then fine tune

10
00:00:54,220 --> 00:01:02,929
your model on a downstream task, you will
have a better performance.

11
00:01:02,929 --> 00:01:07,799
Classification metrics can’t be used as
there’s no single correct answer to mask

12
00:01:07,799 --> 00:01:08,799
values.

13
00:01:08,799 --> 00:01:12,900
Instead, we evaluate the distribution of the
mask values.

14
00:01:12,900 --> 00:01:16,590
A common metric to do so is the cross-entropy
loss.

15
00:01:16,590 --> 00:01:22,010
Perplexity is also a widely used metric and
it is calculated as the exponential of the

16
00:01:22,010 --> 00:01:27,240
cross-entropy loss.

17
00:01:27,240 --> 00:01:35,680
You can use any dataset with plain text and
tokenize the text to mask the data.

18
00:01:35,680 --> 00:01:44,710
For more information about the Masked Language
Modeling, check out the Hugging Face course.