subtitles/zh-CN/41_text-embeddings-&-semantic-search.srt (324 lines of code) (raw):
1
00:00:00,621 --> 00:00:03,204
(欢快的音乐)
(upbeat music)
2
00:00:05,670 --> 00:00:08,520
- 文本嵌入和语义搜索。
- Text embeddings and semantic search.
3
00:00:08,520 --> 00:00:10,770
在本视频中,我们将探索 Transformer 如何建模
In this video we'll explore how Transformer models
4
00:00:10,770 --> 00:00:12,810
将文本表示为嵌入向量
represent text as embedding vectors
5
00:00:12,810 --> 00:00:15,420
以及在语料库中如何使用这些向量
and how these vectors can be used to find similar documents
6
00:00:15,420 --> 00:00:16,293
来查找相似文档。
in a corpus.
7
00:00:17,730 --> 00:00:19,890
文本嵌入只是一种时髦的说法
Text embeddings are just a fancy way of saying
8
00:00:19,890 --> 00:00:22,170
我们可以将文本表示为数字数组
that we can represent text as an array of numbers
9
00:00:22,170 --> 00:00:23,640
称之为矢量。
called a vector.
10
00:00:23,640 --> 00:00:25,710
为了创建这些嵌入,我们通常使用
To create these embeddings we usually use
11
00:00:25,710 --> 00:00:27,393
基于编码器的模型,如 BERT。
an encoder-based model like BERT.
12
00:00:28,530 --> 00:00:31,290
在此示例中,你可以看到我们如何提供三个句子
In this example, you can see how we feed three sentences
13
00:00:31,290 --> 00:00:34,830
到编码器并获得三个向量作为输出。
to the encoder and get three vectors as the output.
14
00:00:34,830 --> 00:00:37,050
读一下输入文本,我们可以看到 walking the dog
Reading the text, we can see that walking the dog
15
00:00:37,050 --> 00:00:39,450
和 walking the cat 从字面上感觉很像,
seems to be most similar to walking the cat,
16
00:00:39,450 --> 00:00:41,350
但让我们看看我们是否可以对此进行量化。
but let's see if we can quantify this.
17
00:00:42,810 --> 00:00:44,040
进行比较的技巧
The trick to do the comparison
18
00:00:44,040 --> 00:00:45,630
是在每对嵌入向量之间
is to compute a similarity metric
19
00:00:45,630 --> 00:00:48,210
计算相似性度量。
between each pair of embedding vectors.
20
00:00:48,210 --> 00:00:51,120
这些向量通常存在于一个非常高维的空间中,
These vectors usually live in a very high-dimensional space,
21
00:00:51,120 --> 00:00:53,190
所以相似性度量可以是任何可用于
so a similarity metric can be anything that measures
22
00:00:53,190 --> 00:00:55,740
衡量矢量之间的某种距离的属性。
some sort of distance between vectors.
23
00:00:55,740 --> 00:00:58,560
一个非常流行的指标是余弦相似度,
One very popular metric is cosine similarity,
24
00:00:58,560 --> 00:01:00,390
它使用两个向量之间的角度
which uses the angle between two vectors
25
00:01:00,390 --> 00:01:02,610
来衡量他们有多接近。
to measure how close they are.
26
00:01:02,610 --> 00:01:05,250
在这个例子中,我们的嵌入向量存在于三维空间中
In this example, our embedding vectors live in 3D
27
00:01:05,250 --> 00:01:07,110
我们可以看到橙色和灰色向量
and we can see that the orange and Grey vectors
28
00:01:07,110 --> 00:01:09,560
彼此靠近并且具有更小的角度。
are close to each other and have a smaller angle.
29
00:01:11,130 --> 00:01:12,510
现在我们必须处理一个问题
Now one problem we have to deal with
30
00:01:12,510 --> 00:01:15,180
是像 BERT 这样的 Transformer 模型实际上会返回
is that Transformer models like BERT will actually return
31
00:01:15,180 --> 00:01:16,983
每个词元一个嵌入向量。
one embedding vector per token.
32
00:01:17,880 --> 00:01:20,700
例如在句子中,“I took my dog for a walk,”
For example in the sentence, "I took my dog for a walk,"
33
00:01:20,700 --> 00:01:23,853
我们可以期待几个嵌入向量,每个词一个。
we can expect several embedding vectors, one for each word.
34
00:01:25,110 --> 00:01:27,870
例如,在这里我们可以看到模型的输出
For example, here we can see the output of our model
35
00:01:27,870 --> 00:01:30,540
每个句子产生了 9 个嵌入向量,
has produced 9 embedding vectors per sentence,
36
00:01:30,540 --> 00:01:33,750
每个向量有 384 个维度。
and each vector has 384 dimensions.
37
00:01:33,750 --> 00:01:36,210
但我们真正想要的是对于每个句子
But what we really want is a single embedding vector
38
00:01:36,210 --> 00:01:37,353
对应一个单一的嵌入向量。
for each sentence.
39
00:01:38,940 --> 00:01:42,060
为了解决这个问题,我们可以使用一种称为 pooling 的技术。
To deal with this, we can use a technique called pooling.
40
00:01:42,060 --> 00:01:43,050
最简单的 pooling 方法
The simplest pooling method
41
00:01:43,050 --> 00:01:44,520
就是把词元嵌入
is to just take the token embedding
42
00:01:44,520 --> 00:01:46,203
特殊的 CLS 词元。
of the special CLS token.
43
00:01:47,100 --> 00:01:49,650
或者,我们可以对词元嵌入进行平均
Alternatively, we can average the token embeddings
44
00:01:49,650 --> 00:01:52,500
这就是所谓的 mean_pooling,也就是我们在这里所做的。
which is called mean_pooling and this is what we do here.
45
00:01:53,370 --> 00:01:55,800
使用 mean_pooling 时我们唯一需要确保的事情
With mean_pooling the only thing we need to make sure
46
00:01:55,800 --> 00:01:58,410
是我们不在平均值中包含 padding 词元,
is that we don't include the padding tokens in the average,
47
00:01:58,410 --> 00:02:01,860
这就是为什么你可以看到这里用到了 attention_mask。
which is why you can see the attention_mask being used here.
48
00:02:01,860 --> 00:02:05,100
这为每个句子提供了一个 384 维向量
This gives us a 384 dimensional vector for each sentence
49
00:02:05,100 --> 00:02:06,600
这正是我们想要的。
which is exactly what we want.
50
00:02:07,920 --> 00:02:09,810
一旦我们有了句子嵌入,
And once we have our sentence embeddings,
51
00:02:09,810 --> 00:02:11,730
我们可以针对每对向量
we can compute the cosine similarity
52
00:02:11,730 --> 00:02:13,113
计算余弦相似度。
for each pair of vectors.
53
00:02:13,993 --> 00:02:16,350
在此示例中,我们使用 scikit-learn 中的函数
In this example we use the function from scikit-learn
54
00:02:16,350 --> 00:02:19,140
你可以看到 “I took my dog for a walk” 这句话
and you can see that the sentence "I took my dog for a walk"
55
00:02:19,140 --> 00:02:22,140
确实与 “I took my cat for a walk” 有很明显的重叠。
has indeed a strong overlap with "I took my cat for a walk".
56
00:02:22,140 --> 00:02:23,240
万岁!我们做到了。
Hooray! We've done it.
57
00:02:25,110 --> 00:02:27,180
我们实际上可以将这个想法更进一步
We can actually take this idea one step further
58
00:02:27,180 --> 00:02:29,220
通过比较问题和文档语料库
by comparing the similarity between a question
59
00:02:29,220 --> 00:02:31,170
之间的相似性。
and a corpus of documents.
60
00:02:31,170 --> 00:02:33,810
例如,假设我们在 Hugging Face 论坛中
For example, suppose we embed every post
61
00:02:33,810 --> 00:02:35,430
嵌入每个帖子。
in the Hugging Face forums.
62
00:02:35,430 --> 00:02:37,800
然后我们可以问一个问题,嵌入它,
We can then ask a question, embed it,
63
00:02:37,800 --> 00:02:40,590
并检查哪些论坛帖子最相似。
and check which forum posts are most similar.
64
00:02:40,590 --> 00:02:42,750
这个过程通常称为语义搜索,
This process is often called semantic search,
65
00:02:42,750 --> 00:02:45,423
因为它允许我们将查询与上下文进行比较。
because it allows us to compare queries with context.
66
00:02:47,040 --> 00:02:48,450
使用 datasets 库
To create a semantic search engine
67
00:02:48,450 --> 00:02:51,030
创建语义搜索引擎其实很简单。
is actually quite simple in the datasets library.
68
00:02:51,030 --> 00:02:53,340
首先我们需要嵌入所有文档。
First we need to embed all the documents.
69
00:02:53,340 --> 00:02:56,070
在这个例子中,我们取了
And in this example, we take a small sample
70
00:02:56,070 --> 00:02:57,780
一个来自 squad 数据集的小样本
from the squad dataset and apply
71
00:02:57,780 --> 00:03:00,180
并按照与以前相同的嵌入逻辑使用。
the same embedding logic as before.
72
00:03:00,180 --> 00:03:02,280
这为我们提供了一个名为 embeddings 的新列,
This gives us a new column called embeddings,
73
00:03:02,280 --> 00:03:04,530
它存储每个段落的嵌入。
which stores the embeddings of every passage.
74
00:03:05,880 --> 00:03:07,260
一旦我们有了嵌入,
Once we have our embeddings,
75
00:03:07,260 --> 00:03:10,200
我们需要一种方法来为查询找到最近的相邻数据。
we need a way to find nearest neighbors for a query.
76
00:03:10,200 --> 00:03:13,170
datasets 库提供了一个名为 FAISS 的特殊对象
The datasets library provides a special object called FAISS
77
00:03:13,170 --> 00:03:16,080
这使你可以快速比较嵌入向量。
which allows you to quickly compare embedding vectors.
78
00:03:16,080 --> 00:03:19,950
所以我们添加 FAISS 索引,嵌入一个问题,瞧,
So we add the FAISS index, embed a question and voila,
79
00:03:19,950 --> 00:03:21,870
我们现在找到了 3 篇最相似的文章
we've now found the 3 most similar articles
80
00:03:21,870 --> 00:03:23,320
其中可能会包含答案。
which might store the answer.
81
00:03:25,182 --> 00:03:27,849
(欢快的音乐)
(upbeat music)