subtitles/en/41_text-embeddings-&-semantic-search.srt (287 lines of code) (raw):
1
00:00:00,621 --> 00:00:03,204
(upbeat music)
2
00:00:05,670 --> 00:00:08,520
- Text embeddings and semantic search.
3
00:00:08,520 --> 00:00:10,770
In this video we'll explore
how Transformer models
4
00:00:10,770 --> 00:00:12,810
represent text as embedding vectors
5
00:00:12,810 --> 00:00:15,420
and how these vectors can be
used to find similar documents
6
00:00:15,420 --> 00:00:16,293
in a corpus.
7
00:00:17,730 --> 00:00:19,890
Text embeddings are just
a fancy way of saying
8
00:00:19,890 --> 00:00:22,170
that we can represent text
as an array of numbers
9
00:00:22,170 --> 00:00:23,640
called a vector.
10
00:00:23,640 --> 00:00:25,710
To create these embeddings we usually use
11
00:00:25,710 --> 00:00:27,393
an encoder-based model like BERT.
12
00:00:28,530 --> 00:00:31,290
In this example, you can see
how we feed three sentences
13
00:00:31,290 --> 00:00:34,830
to the encoder and get
three vectors as the output.
14
00:00:34,830 --> 00:00:37,050
Reading the text, we can
see that walking the dog
15
00:00:37,050 --> 00:00:39,450
seems to be most similar
to walking the cat,
16
00:00:39,450 --> 00:00:41,350
but let's see if we can quantify this.
17
00:00:42,810 --> 00:00:44,040
The trick to do the comparison
18
00:00:44,040 --> 00:00:45,630
is to compute a similarity metric
19
00:00:45,630 --> 00:00:48,210
between each pair of embedding vectors.
20
00:00:48,210 --> 00:00:51,120
These vectors usually live in
a very high-dimensional space,
21
00:00:51,120 --> 00:00:53,190
so a similarity metric can
be anything that measures
22
00:00:53,190 --> 00:00:55,740
some sort of distance between vectors.
23
00:00:55,740 --> 00:00:58,560
One very popular metric
is cosine similarity,
24
00:00:58,560 --> 00:01:00,390
which uses the angle between two vectors
25
00:01:00,390 --> 00:01:02,610
to measure how close they are.
26
00:01:02,610 --> 00:01:05,250
In this example, our
embedding vectors live in 3D
27
00:01:05,250 --> 00:01:07,110
and we can see that the
orange and Grey vectors
28
00:01:07,110 --> 00:01:09,560
are close to each other
and have a smaller angle.
29
00:01:11,130 --> 00:01:12,510
Now one problem we have to deal with
30
00:01:12,510 --> 00:01:15,180
is that Transformer models
like BERT will actually return
31
00:01:15,180 --> 00:01:16,983
one embedding vector per token.
32
00:01:17,880 --> 00:01:20,700
For example in the sentence,
"I took my dog for a walk,"
33
00:01:20,700 --> 00:01:23,853
we can expect several embedding
vectors, one for each word.
34
00:01:25,110 --> 00:01:27,870
For example, here we can
see the output of our model
35
00:01:27,870 --> 00:01:30,540
has produced 9 embedding
vectors per sentence,
36
00:01:30,540 --> 00:01:33,750
and each vector has 384 dimensions.
37
00:01:33,750 --> 00:01:36,210
But what we really want is
a single embedding vector
38
00:01:36,210 --> 00:01:37,353
for each sentence.
39
00:01:38,940 --> 00:01:42,060
To deal with this, we can use
a technique called pooling.
40
00:01:42,060 --> 00:01:43,050
The simplest pooling method
41
00:01:43,050 --> 00:01:44,520
is to just take the token embedding
42
00:01:44,520 --> 00:01:46,203
of the special CLS token.
43
00:01:47,100 --> 00:01:49,650
Alternatively, we can
average the token embeddings
44
00:01:49,650 --> 00:01:52,500
which is called mean_pooling
and this is what we do here.
45
00:01:53,370 --> 00:01:55,800
With mean_pooling the only
thing we need to make sure
46
00:01:55,800 --> 00:01:58,410
is that we don't include the
padding tokens in the average,
47
00:01:58,410 --> 00:02:01,860
which is why you can see the
attention_mask being used here.
48
00:02:01,860 --> 00:02:05,100
This gives us a 384 dimensional
vector for each sentence
49
00:02:05,100 --> 00:02:06,600
which is exactly what we want.
50
00:02:07,920 --> 00:02:09,810
And once we have our sentence embeddings,
51
00:02:09,810 --> 00:02:11,730
we can compute the cosine similarity
52
00:02:11,730 --> 00:02:13,113
for each pair of vectors.
53
00:02:13,993 --> 00:02:16,350
In this example we use the
function from scikit-learn
54
00:02:16,350 --> 00:02:19,140
and you can see that the sentence
"I took my dog for a walk"
55
00:02:19,140 --> 00:02:22,140
has indeed a strong overlap
with "I took my cat for a walk".
56
00:02:22,140 --> 00:02:23,240
Hooray! We've done it.
57
00:02:25,110 --> 00:02:27,180
We can actually take this
idea one step further
58
00:02:27,180 --> 00:02:29,220
by comparing the similarity
between a question
59
00:02:29,220 --> 00:02:31,170
and a corpus of documents.
60
00:02:31,170 --> 00:02:33,810
For example, suppose we embed every post
61
00:02:33,810 --> 00:02:35,430
in the Hugging Face forums.
62
00:02:35,430 --> 00:02:37,800
We can then ask a question, embed it,
63
00:02:37,800 --> 00:02:40,590
and check which forum
posts are most similar.
64
00:02:40,590 --> 00:02:42,750
This process is often
called semantic search,
65
00:02:42,750 --> 00:02:45,423
because it allows us to
compare queries with context.
66
00:02:47,040 --> 00:02:48,450
To create a semantic search engine
67
00:02:48,450 --> 00:02:51,030
is actually quite simple
in the datasets library.
68
00:02:51,030 --> 00:02:53,340
First we need to embed all the documents.
69
00:02:53,340 --> 00:02:56,070
And in this example,
we take a small sample
70
00:02:56,070 --> 00:02:57,780
from the squad dataset and apply
71
00:02:57,780 --> 00:03:00,180
the same embedding logic as before.
72
00:03:00,180 --> 00:03:02,280
This gives us a new
column called embeddings,
73
00:03:02,280 --> 00:03:04,530
which stores the embeddings
of every passage.
74
00:03:05,880 --> 00:03:07,260
Once we have our embeddings,
75
00:03:07,260 --> 00:03:10,200
we need a way to find nearest
neighbors for a query.
76
00:03:10,200 --> 00:03:13,170
The datasets library provides
a special object called FAISS
77
00:03:13,170 --> 00:03:16,080
which allows you to quickly
compare embedding vectors.
78
00:03:16,080 --> 00:03:19,950
So we add the FAISS index,
embed a question and voila,
79
00:03:19,950 --> 00:03:21,870
we've now found the 3
most similar articles
80
00:03:21,870 --> 00:03:23,320
which might store the answer.
81
00:03:25,182 --> 00:03:27,849
(upbeat music)