subtitles/zh-CN/41_text-embeddings-&-semantic-search.srt (324 lines of code) (raw):

1 00:00:00,621 --> 00:00:03,204 (欢快的音乐) (upbeat music) 2 00:00:05,670 --> 00:00:08,520 - 文本嵌入和语义搜索。 - Text embeddings and semantic search. 3 00:00:08,520 --> 00:00:10,770 在本视频中,我们将探索 Transformer 如何建模 In this video we'll explore how Transformer models 4 00:00:10,770 --> 00:00:12,810 将文本表示为嵌入向量 represent text as embedding vectors 5 00:00:12,810 --> 00:00:15,420 以及在语料库中如何使用这些向量 and how these vectors can be used to find similar documents 6 00:00:15,420 --> 00:00:16,293 来查找相似文档。 in a corpus. 7 00:00:17,730 --> 00:00:19,890 文本嵌入只是一种时髦的说法 Text embeddings are just a fancy way of saying 8 00:00:19,890 --> 00:00:22,170 我们可以将文本表示为数字数组 that we can represent text as an array of numbers 9 00:00:22,170 --> 00:00:23,640 称之为矢量。 called a vector. 10 00:00:23,640 --> 00:00:25,710 为了创建这些嵌入,我们通常使用 To create these embeddings we usually use 11 00:00:25,710 --> 00:00:27,393 基于编码器的模型,如 BERT。 an encoder-based model like BERT. 12 00:00:28,530 --> 00:00:31,290 在此示例中,你可以看到我们如何提供三个句子 In this example, you can see how we feed three sentences 13 00:00:31,290 --> 00:00:34,830 到编码器并获得三个向量作为输出。 to the encoder and get three vectors as the output. 14 00:00:34,830 --> 00:00:37,050 读一下输入文本,我们可以看到 walking the dog Reading the text, we can see that walking the dog 15 00:00:37,050 --> 00:00:39,450 和 walking the cat 从字面上感觉很像, seems to be most similar to walking the cat, 16 00:00:39,450 --> 00:00:41,350 但让我们看看我们是否可以对此进行量化。 but let's see if we can quantify this. 17 00:00:42,810 --> 00:00:44,040 进行比较的技巧 The trick to do the comparison 18 00:00:44,040 --> 00:00:45,630 是在每对嵌入向量之间 is to compute a similarity metric 19 00:00:45,630 --> 00:00:48,210 计算相似性度量。 between each pair of embedding vectors. 20 00:00:48,210 --> 00:00:51,120 这些向量通常存在于一个非常高维的空间中, These vectors usually live in a very high-dimensional space, 21 00:00:51,120 --> 00:00:53,190 所以相似性度量可以是任何可用于 so a similarity metric can be anything that measures 22 00:00:53,190 --> 00:00:55,740 衡量矢量之间的某种距离的属性。 some sort of distance between vectors. 23 00:00:55,740 --> 00:00:58,560 一个非常流行的指标是余弦相似度, One very popular metric is cosine similarity, 24 00:00:58,560 --> 00:01:00,390 它使用两个向量之间的角度 which uses the angle between two vectors 25 00:01:00,390 --> 00:01:02,610 来衡量他们有多接近。 to measure how close they are. 26 00:01:02,610 --> 00:01:05,250 在这个例子中,我们的嵌入向量存在于三维空间中 In this example, our embedding vectors live in 3D 27 00:01:05,250 --> 00:01:07,110 我们可以看到橙色和灰色向量 and we can see that the orange and Grey vectors 28 00:01:07,110 --> 00:01:09,560 彼此靠近并且具有更小的角度。 are close to each other and have a smaller angle. 29 00:01:11,130 --> 00:01:12,510 现在我们必须处理一个问题 Now one problem we have to deal with 30 00:01:12,510 --> 00:01:15,180 是像 BERT 这样的 Transformer 模型实际上会返回 is that Transformer models like BERT will actually return 31 00:01:15,180 --> 00:01:16,983 每个词元一个嵌入向量。 one embedding vector per token. 32 00:01:17,880 --> 00:01:20,700 例如在句子中,“I took my dog for a walk,” For example in the sentence, "I took my dog for a walk," 33 00:01:20,700 --> 00:01:23,853 我们可以期待几个嵌入向量,每个词一个。 we can expect several embedding vectors, one for each word. 34 00:01:25,110 --> 00:01:27,870 例如,在这里我们可以看到模型的输出 For example, here we can see the output of our model 35 00:01:27,870 --> 00:01:30,540 每个句子产生了 9 个嵌入向量, has produced 9 embedding vectors per sentence, 36 00:01:30,540 --> 00:01:33,750 每个向量有 384 个维度。 and each vector has 384 dimensions. 37 00:01:33,750 --> 00:01:36,210 但我们真正想要的是对于每个句子 But what we really want is a single embedding vector 38 00:01:36,210 --> 00:01:37,353 对应一个单一的嵌入向量。 for each sentence. 39 00:01:38,940 --> 00:01:42,060 为了解决这个问题,我们可以使用一种称为 pooling 的技术。 To deal with this, we can use a technique called pooling. 40 00:01:42,060 --> 00:01:43,050 最简单的 pooling 方法 The simplest pooling method 41 00:01:43,050 --> 00:01:44,520 就是把词元嵌入 is to just take the token embedding 42 00:01:44,520 --> 00:01:46,203 特殊的 CLS 词元。 of the special CLS token. 43 00:01:47,100 --> 00:01:49,650 或者,我们可以对词元嵌入进行平均 Alternatively, we can average the token embeddings 44 00:01:49,650 --> 00:01:52,500 这就是所谓的 mean_pooling,也就是我们在这里所做的。 which is called mean_pooling and this is what we do here. 45 00:01:53,370 --> 00:01:55,800 使用 mean_pooling 时我们唯一需要确保的事情 With mean_pooling the only thing we need to make sure 46 00:01:55,800 --> 00:01:58,410 是我们不在平均值中包含 padding 词元, is that we don't include the padding tokens in the average, 47 00:01:58,410 --> 00:02:01,860 这就是为什么你可以看到这里用到了 attention_mask。 which is why you can see the attention_mask being used here. 48 00:02:01,860 --> 00:02:05,100 这为每个句子提供了一个 384 维向量 This gives us a 384 dimensional vector for each sentence 49 00:02:05,100 --> 00:02:06,600 这正是我们想要的。 which is exactly what we want. 50 00:02:07,920 --> 00:02:09,810 一旦我们有了句子嵌入, And once we have our sentence embeddings, 51 00:02:09,810 --> 00:02:11,730 我们可以针对每对向量 we can compute the cosine similarity 52 00:02:11,730 --> 00:02:13,113 计算余弦相似度。 for each pair of vectors. 53 00:02:13,993 --> 00:02:16,350 在此示例中,我们使用 scikit-learn 中的函数 In this example we use the function from scikit-learn 54 00:02:16,350 --> 00:02:19,140 你可以看到 “I took my dog for a walk” 这句话 and you can see that the sentence "I took my dog for a walk" 55 00:02:19,140 --> 00:02:22,140 确实与 “I took my cat for a walk” 有很明显的重叠。 has indeed a strong overlap with "I took my cat for a walk". 56 00:02:22,140 --> 00:02:23,240 万岁!我们做到了。 Hooray! We've done it. 57 00:02:25,110 --> 00:02:27,180 我们实际上可以将这个想法更进一步 We can actually take this idea one step further 58 00:02:27,180 --> 00:02:29,220 通过比较问题和文档语料库 by comparing the similarity between a question 59 00:02:29,220 --> 00:02:31,170 之间的相似性。 and a corpus of documents. 60 00:02:31,170 --> 00:02:33,810 例如,假设我们在 Hugging Face 论坛中 For example, suppose we embed every post 61 00:02:33,810 --> 00:02:35,430 嵌入每个帖子。 in the Hugging Face forums. 62 00:02:35,430 --> 00:02:37,800 然后我们可以问一个问题,嵌入它, We can then ask a question, embed it, 63 00:02:37,800 --> 00:02:40,590 并检查哪些论坛帖子最相似。 and check which forum posts are most similar. 64 00:02:40,590 --> 00:02:42,750 这个过程通常称为语义搜索, This process is often called semantic search, 65 00:02:42,750 --> 00:02:45,423 因为它允许我们将查询与上下文进行比较。 because it allows us to compare queries with context. 66 00:02:47,040 --> 00:02:48,450 使用 datasets 库 To create a semantic search engine 67 00:02:48,450 --> 00:02:51,030 创建语义搜索引擎其实很简单。 is actually quite simple in the datasets library. 68 00:02:51,030 --> 00:02:53,340 首先我们需要嵌入所有文档。 First we need to embed all the documents. 69 00:02:53,340 --> 00:02:56,070 在这个例子中,我们取了 And in this example, we take a small sample 70 00:02:56,070 --> 00:02:57,780 一个来自 squad 数据集的小样本 from the squad dataset and apply 71 00:02:57,780 --> 00:03:00,180 并按照与以前相同的嵌入逻辑使用。 the same embedding logic as before. 72 00:03:00,180 --> 00:03:02,280 这为我们提供了一个名为 embeddings 的新列, This gives us a new column called embeddings, 73 00:03:02,280 --> 00:03:04,530 它存储每个段落的嵌入。 which stores the embeddings of every passage. 74 00:03:05,880 --> 00:03:07,260 一旦我们有了嵌入, Once we have our embeddings, 75 00:03:07,260 --> 00:03:10,200 我们需要一种方法来为查询找到最近的相邻数据。 we need a way to find nearest neighbors for a query. 76 00:03:10,200 --> 00:03:13,170 datasets 库提供了一个名为 FAISS 的特殊对象 The datasets library provides a special object called FAISS 77 00:03:13,170 --> 00:03:16,080 这使你可以快速比较嵌入向量。 which allows you to quickly compare embedding vectors. 78 00:03:16,080 --> 00:03:19,950 所以我们添加 FAISS 索引,嵌入一个问题,瞧, So we add the FAISS index, embed a question and voila, 79 00:03:19,950 --> 00:03:21,870 我们现在找到了 3 篇最相似的文章 we've now found the 3 most similar articles 80 00:03:21,870 --> 00:03:23,320 其中可能会包含答案。 which might store the answer. 81 00:03:25,182 --> 00:03:27,849 (欢快的音乐) (upbeat music)