1
00:00:00,621 --> 00:00:03,204
(upbeat music)

2
00:00:05,670 --> 00:00:08,520
- Text embeddings and semantic search.

3
00:00:08,520 --> 00:00:10,770
In this video we'll explore
how Transformer models

4
00:00:10,770 --> 00:00:12,810
represent text as embedding vectors

5
00:00:12,810 --> 00:00:15,420
and how these vectors can be
used to find similar documents

6
00:00:15,420 --> 00:00:16,293
in a corpus.

7
00:00:17,730 --> 00:00:19,890
Text embeddings are just
a fancy way of saying

8
00:00:19,890 --> 00:00:22,170
that we can represent text
as an array of numbers

9
00:00:22,170 --> 00:00:23,640
called a vector.

10
00:00:23,640 --> 00:00:25,710
To create these embeddings we usually use

11
00:00:25,710 --> 00:00:27,393
an encoder-based model like BERT.

12
00:00:28,530 --> 00:00:31,290
In this example, you can see
how we feed three sentences

13
00:00:31,290 --> 00:00:34,830
to the encoder and get
three vectors as the output.

14
00:00:34,830 --> 00:00:37,050
Reading the text, we can
see that walking the dog

15
00:00:37,050 --> 00:00:39,450
seems to be most similar
to walking the cat,

16
00:00:39,450 --> 00:00:41,350
but let's see if we can quantify this.

17
00:00:42,810 --> 00:00:44,040
The trick to do the comparison

18
00:00:44,040 --> 00:00:45,630
is to compute a similarity metric

19
00:00:45,630 --> 00:00:48,210
between each pair of embedding vectors.

20
00:00:48,210 --> 00:00:51,120
These vectors usually live in
a very high-dimensional space,

21
00:00:51,120 --> 00:00:53,190
so a similarity metric can
be anything that measures

22
00:00:53,190 --> 00:00:55,740
some sort of distance between vectors.

23
00:00:55,740 --> 00:00:58,560
One very popular metric
is cosine similarity,

24
00:00:58,560 --> 00:01:00,390
which uses the angle between two vectors

25
00:01:00,390 --> 00:01:02,610
to measure how close they are.

26
00:01:02,610 --> 00:01:05,250
In this example, our
embedding vectors live in 3D

27
00:01:05,250 --> 00:01:07,110
and we can see that the
orange and Grey vectors

28
00:01:07,110 --> 00:01:09,560
are close to each other
and have a smaller angle.

29
00:01:11,130 --> 00:01:12,510
Now one problem we have to deal with

30
00:01:12,510 --> 00:01:15,180
is that Transformer models
like BERT will actually return

31
00:01:15,180 --> 00:01:16,983
one embedding vector per token.

32
00:01:17,880 --> 00:01:20,700
For example in the sentence,
"I took my dog for a walk,"

33
00:01:20,700 --> 00:01:23,853
we can expect several embedding
vectors, one for each word.

34
00:01:25,110 --> 00:01:27,870
For example, here we can
see the output of our model

35
00:01:27,870 --> 00:01:30,540
has produced 9 embedding
vectors per sentence,

36
00:01:30,540 --> 00:01:33,750
and each vector has 384 dimensions.

37
00:01:33,750 --> 00:01:36,210
But what we really want is
a single embedding vector

38
00:01:36,210 --> 00:01:37,353
for each sentence.

39
00:01:38,940 --> 00:01:42,060
To deal with this, we can use
a technique called pooling.

40
00:01:42,060 --> 00:01:43,050
The simplest pooling method

41
00:01:43,050 --> 00:01:44,520
is to just take the token embedding

42
00:01:44,520 --> 00:01:46,203
of the special CLS token.

43
00:01:47,100 --> 00:01:49,650
Alternatively, we can
average the token embeddings

44
00:01:49,650 --> 00:01:52,500
which is called mean_pooling
and this is what we do here.

45
00:01:53,370 --> 00:01:55,800
With mean_pooling the only
thing we need to make sure

46
00:01:55,800 --> 00:01:58,410
is that we don't include the
padding tokens in the average,

47
00:01:58,410 --> 00:02:01,860
which is why you can see the
attention_mask being used here.

48
00:02:01,860 --> 00:02:05,100
This gives us a 384 dimensional
vector for each sentence

49
00:02:05,100 --> 00:02:06,600
which is exactly what we want.

50
00:02:07,920 --> 00:02:09,810
And once we have our sentence embeddings,

51
00:02:09,810 --> 00:02:11,730
we can compute the cosine similarity

52
00:02:11,730 --> 00:02:13,113
for each pair of vectors.

53
00:02:13,993 --> 00:02:16,350
In this example we use the
function from scikit-learn

54
00:02:16,350 --> 00:02:19,140
and you can see that the sentence
"I took my dog for a walk"

55
00:02:19,140 --> 00:02:22,140
has indeed a strong overlap
with "I took my cat for a walk".

56
00:02:22,140 --> 00:02:23,240
Hooray! We've done it.

57
00:02:25,110 --> 00:02:27,180
We can actually take this
idea one step further

58
00:02:27,180 --> 00:02:29,220
by comparing the similarity
between a question

59
00:02:29,220 --> 00:02:31,170
and a corpus of documents.

60
00:02:31,170 --> 00:02:33,810
For example, suppose we embed every post

61
00:02:33,810 --> 00:02:35,430
in the Hugging Face forums.

62
00:02:35,430 --> 00:02:37,800
We can then ask a question, embed it,

63
00:02:37,800 --> 00:02:40,590
and check which forum
posts are most similar.

64
00:02:40,590 --> 00:02:42,750
This process is often
called semantic search,

65
00:02:42,750 --> 00:02:45,423
because it allows us to
compare queries with context.

66
00:02:47,040 --> 00:02:48,450
To create a semantic search engine

67
00:02:48,450 --> 00:02:51,030
is actually quite simple
in the datasets library.

68
00:02:51,030 --> 00:02:53,340
First we need to embed all the documents.

69
00:02:53,340 --> 00:02:56,070
And in this example,
we take a small sample

70
00:02:56,070 --> 00:02:57,780
from the squad dataset and apply

71
00:02:57,780 --> 00:03:00,180
the same embedding logic as before.

72
00:03:00,180 --> 00:03:02,280
This gives us a new
column called embeddings,

73
00:03:02,280 --> 00:03:04,530
which stores the embeddings
of every passage.

74
00:03:05,880 --> 00:03:07,260
Once we have our embeddings,

75
00:03:07,260 --> 00:03:10,200
we need a way to find nearest
neighbors for a query.

76
00:03:10,200 --> 00:03:13,170
The datasets library provides
a special object called FAISS

77
00:03:13,170 --> 00:03:16,080
which allows you to quickly
compare embedding vectors.

78
00:03:16,080 --> 00:03:19,950
So we add the FAISS index,
embed a question and voila,

79
00:03:19,950 --> 00:03:21,870
we've now found the 3
most similar articles

80
00:03:21,870 --> 00:03:23,320
which might store the answer.

81
00:03:25,182 --> 00:03:27,849
(upbeat music)