subtitles/en/41_text-embeddings-&-semantic-search.srt

1 00:00:00,621 --> 00:00:03,204 (upbeat music) 2 00:00:05,670 --> 00:00:08,520 - Text embeddings and semantic search. 3 00:00:08,520 --> 00:00:10,770 In this video we'll explore how Transformer models 4 00:00:10,770 --> 00:00:12,810 represent text as embedding vectors 5 00:00:12,810 --> 00:00:15,420 and how these vectors can be used to find similar documents 6 00:00:15,420 --> 00:00:16,293 in a corpus. 7 00:00:17,730 --> 00:00:19,890 Text embeddings are just a fancy way of saying 8 00:00:19,890 --> 00:00:22,170 that we can represent text as an array of numbers 9 00:00:22,170 --> 00:00:23,640 called a vector. 10 00:00:23,640 --> 00:00:25,710 To create these embeddings we usually use 11 00:00:25,710 --> 00:00:27,393 an encoder-based model like BERT. 12 00:00:28,530 --> 00:00:31,290 In this example, you can see how we feed three sentences 13 00:00:31,290 --> 00:00:34,830 to the encoder and get three vectors as the output. 14 00:00:34,830 --> 00:00:37,050 Reading the text, we can see that walking the dog 15 00:00:37,050 --> 00:00:39,450 seems to be most similar to walking the cat, 16 00:00:39,450 --> 00:00:41,350 but let's see if we can quantify this. 17 00:00:42,810 --> 00:00:44,040 The trick to do the comparison 18 00:00:44,040 --> 00:00:45,630 is to compute a similarity metric 19 00:00:45,630 --> 00:00:48,210 between each pair of embedding vectors. 20 00:00:48,210 --> 00:00:51,120 These vectors usually live in a very high-dimensional space, 21 00:00:51,120 --> 00:00:53,190 so a similarity metric can be anything that measures 22 00:00:53,190 --> 00:00:55,740 some sort of distance between vectors. 23 00:00:55,740 --> 00:00:58,560 One very popular metric is cosine similarity, 24 00:00:58,560 --> 00:01:00,390 which uses the angle between two vectors 25 00:01:00,390 --> 00:01:02,610 to measure how close they are. 26 00:01:02,610 --> 00:01:05,250 In this example, our embedding vectors live in 3D 27 00:01:05,250 --> 00:01:07,110 and we can see that the orange and Grey vectors 28 00:01:07,110 --> 00:01:09,560 are close to each other and have a smaller angle. 29 00:01:11,130 --> 00:01:12,510 Now one problem we have to deal with 30 00:01:12,510 --> 00:01:15,180 is that Transformer models like BERT will actually return 31 00:01:15,180 --> 00:01:16,983 one embedding vector per token. 32 00:01:17,880 --> 00:01:20,700 For example in the sentence, "I took my dog for a walk," 33 00:01:20,700 --> 00:01:23,853 we can expect several embedding vectors, one for each word. 34 00:01:25,110 --> 00:01:27,870 For example, here we can see the output of our model 35 00:01:27,870 --> 00:01:30,540 has produced 9 embedding vectors per sentence, 36 00:01:30,540 --> 00:01:33,750 and each vector has 384 dimensions. 37 00:01:33,750 --> 00:01:36,210 But what we really want is a single embedding vector 38 00:01:36,210 --> 00:01:37,353 for each sentence. 39 00:01:38,940 --> 00:01:42,060 To deal with this, we can use a technique called pooling. 40 00:01:42,060 --> 00:01:43,050 The simplest pooling method 41 00:01:43,050 --> 00:01:44,520 is to just take the token embedding 42 00:01:44,520 --> 00:01:46,203 of the special CLS token. 43 00:01:47,100 --> 00:01:49,650 Alternatively, we can average the token embeddings 44 00:01:49,650 --> 00:01:52,500 which is called mean_pooling and this is what we do here. 45 00:01:53,370 --> 00:01:55,800 With mean_pooling the only thing we need to make sure 46 00:01:55,800 --> 00:01:58,410 is that we don't include the padding tokens in the average, 47 00:01:58,410 --> 00:02:01,860 which is why you can see the attention_mask being used here. 48 00:02:01,860 --> 00:02:05,100 This gives us a 384 dimensional vector for each sentence 49 00:02:05,100 --> 00:02:06,600 which is exactly what we want. 50 00:02:07,920 --> 00:02:09,810 And once we have our sentence embeddings, 51 00:02:09,810 --> 00:02:11,730 we can compute the cosine similarity 52 00:02:11,730 --> 00:02:13,113 for each pair of vectors. 53 00:02:13,993 --> 00:02:16,350 In this example we use the function from scikit-learn 54 00:02:16,350 --> 00:02:19,140 and you can see that the sentence "I took my dog for a walk" 55 00:02:19,140 --> 00:02:22,140 has indeed a strong overlap with "I took my cat for a walk". 56 00:02:22,140 --> 00:02:23,240 Hooray! We've done it. 57 00:02:25,110 --> 00:02:27,180 We can actually take this idea one step further 58 00:02:27,180 --> 00:02:29,220 by comparing the similarity between a question 59 00:02:29,220 --> 00:02:31,170 and a corpus of documents. 60 00:02:31,170 --> 00:02:33,810 For example, suppose we embed every post 61 00:02:33,810 --> 00:02:35,430 in the Hugging Face forums. 62 00:02:35,430 --> 00:02:37,800 We can then ask a question, embed it, 63 00:02:37,800 --> 00:02:40,590 and check which forum posts are most similar. 64 00:02:40,590 --> 00:02:42,750 This process is often called semantic search, 65 00:02:42,750 --> 00:02:45,423 because it allows us to compare queries with context. 66 00:02:47,040 --> 00:02:48,450 To create a semantic search engine 67 00:02:48,450 --> 00:02:51,030 is actually quite simple in the datasets library. 68 00:02:51,030 --> 00:02:53,340 First we need to embed all the documents. 69 00:02:53,340 --> 00:02:56,070 And in this example, we take a small sample 70 00:02:56,070 --> 00:02:57,780 from the squad dataset and apply 71 00:02:57,780 --> 00:03:00,180 the same embedding logic as before. 72 00:03:00,180 --> 00:03:02,280 This gives us a new column called embeddings, 73 00:03:02,280 --> 00:03:04,530 which stores the embeddings of every passage. 74 00:03:05,880 --> 00:03:07,260 Once we have our embeddings, 75 00:03:07,260 --> 00:03:10,200 we need a way to find nearest neighbors for a query. 76 00:03:10,200 --> 00:03:13,170 The datasets library provides a special object called FAISS 77 00:03:13,170 --> 00:03:16,080 which allows you to quickly compare embedding vectors. 78 00:03:16,080 --> 00:03:19,950 So we add the FAISS index, embed a question and voila, 79 00:03:19,950 --> 00:03:21,870 we've now found the 3 most similar articles 80 00:03:21,870 --> 00:03:23,320 which might store the answer. 81 00:03:25,182 --> 00:03:27,849 (upbeat music)

subtitles/en/41_text-embeddings-&-semantic-search.srt (287 lines of code) (raw):