notebooks/en/vector_search_with_hub_as_backend.ipynb (505 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Vector Search on Hugging Face with the Hub as Backend\n", "\n", "Datasets on the Hugging Face Hub rely on parquet files. We can [interact with these files using DuckDB](https://huggingface.co/docs/hub/en/datasets-duckdb) as a fast in-memory database system. One of DuckDB's features is [vector similarity search](https://duckdb.org/docs/extensions/vss.html) which can be used with or without an index. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install datasets duckdb sentence-transformers model2vec -q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create embeddings for the dataset\n", "\n", "First, we need to create embeddings for the dataset to search over. We will use the `sentence-transformers` library to create embeddings for the dataset." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sentence_transformers import SentenceTransformer\n", "from sentence_transformers.models import StaticEmbedding\n", "\n", "static_embedding = StaticEmbedding.from_model2vec(\"minishlab/potion-base-8M\")\n", "model = SentenceTransformer(modules=[static_embedding])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's load the [ai-blueprint/fineweb-bbc-news](https://huggingface.co/datasets/ai-blueprint/fineweb-bbc-news) dataset from the Hub. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "ds = load_dataset(\"ai-blueprint/fineweb-bbc-news\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now create embeddings for the dataset. Normally, we might want to chunk our data into smaller batches to avoid losing precision, but for this example, we will just create embeddings for the full text of the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_embeddings(batch):\n", " embeddings = model.encode(batch[\"text\"], convert_to_numpy=True)\n", " batch[\"embeddings\"] = embeddings.tolist()\n", " return batch\n", "\n", "ds = ds.map(create_embeddings, batched=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now upload our dataset with embeddings back to the Hub." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds.push_to_hub(\"ai-blueprint/fineweb-bbc-news-embeddings\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vector Search the Hugging Face Hub\n", "\n", "We can now perform vector search on the dataset using `duckdb`. When doing so, we can either use an index or not. Searching **without** an index is slower but more precise, whereas searching **with** an index is faster but less precise. \n", "\n", "### Without an index\n", "\n", "To search without an index, we can use the `duckdb` library to connect to the dataset and perform a vector search. This is a slow operation, but normally works quick enough for small datasets up to let's say 100k rows. Meaning querying our dataset will be somewhat slower." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "85f5cca028da4602852f413d09697e8e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>url</th>\n", " <th>text</th>\n", " <th>embeddings</th>\n", " <th>distance</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>https://www.bbc.com/news/technology-51064369</td>\n", " <td>The last decade was a big one for artificial i...</td>\n", " <td>[-0.9206902980804443, 0.783940315246582, -2.00...</td>\n", " <td>0.281200</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>http://www.bbc.com/news/technology-25000756</td>\n", " <td>Singularity: The robots are coming to steal ou...</td>\n", " <td>[-1.5080468654632568, 0.7840216755867004, -1.1...</td>\n", " <td>0.365842</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>http://www.bbc.co.uk/news/technology-25000756</td>\n", " <td>Singularity: The robots are coming to steal ou...</td>\n", " <td>[-1.5080468654632568, 0.7840216755867004, -1.1...</td>\n", " <td>0.365842</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>https://www.bbc.co.uk/news/technology-37494863</td>\n", " <td>Google, Facebook, Amazon join forces on future...</td>\n", " <td>[-0.38261985778808594, 1.6644303798675537, -1....</td>\n", " <td>0.380820</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>https://www.bbc.co.uk/news/technology-37494863</td>\n", " <td>Google, Facebook, Amazon join forces on future...</td>\n", " <td>[-0.38261985778808594, 1.6644303798675537, -1....</td>\n", " <td>0.380820</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " url \\\n", "0 https://www.bbc.com/news/technology-51064369 \n", "1 http://www.bbc.com/news/technology-25000756 \n", "2 http://www.bbc.co.uk/news/technology-25000756 \n", "3 https://www.bbc.co.uk/news/technology-37494863 \n", "4 https://www.bbc.co.uk/news/technology-37494863 \n", "\n", " text \\\n", "0 The last decade was a big one for artificial i... \n", "1 Singularity: The robots are coming to steal ou... \n", "2 Singularity: The robots are coming to steal ou... \n", "3 Google, Facebook, Amazon join forces on future... \n", "4 Google, Facebook, Amazon join forces on future... \n", "\n", " embeddings distance \n", "0 [-0.9206902980804443, 0.783940315246582, -2.00... 0.281200 \n", "1 [-1.5080468654632568, 0.7840216755867004, -1.1... 0.365842 \n", "2 [-1.5080468654632568, 0.7840216755867004, -1.1... 0.365842 \n", "3 [-0.38261985778808594, 1.6644303798675537, -1.... 0.380820 \n", "4 [-0.38261985778808594, 1.6644303798675537, -1.... 0.380820 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import duckdb\n", "from typing import List\n", "\n", "def similarity_search_without_duckdb_index(\n", " query: str,\n", " k: int = 5,\n", " dataset_name: str = \"ai-blueprint/fineweb-bbc-news-embeddings\",\n", " embedding_column: str = \"embeddings\",\n", "): \n", " # Use same model as used for indexing\n", " query_vector = model.encode(query)\n", " embedding_dim = model.get_sentence_embedding_dimension()\n", "\n", " sql = f\"\"\"\n", " SELECT \n", " *,\n", " array_cosine_distance(\n", " {embedding_column}::float[{embedding_dim}], \n", " {query_vector.tolist()}::float[{embedding_dim}]\n", " ) as distance\n", " FROM 'hf://datasets/{dataset_name}/**/*.parquet'\n", " ORDER BY distance\n", " LIMIT {k}\n", " \"\"\"\n", " return duckdb.sql(sql).to_df()\n", "\n", "similarity_search_without_duckdb_index(\"What is the future of AI?\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### With an index\n", "\n", "This approach creates a local copy of the dataset and uses this to create an index. This has some minor overhead but it will significantly speed up the search once you've created it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import duckdb\n", "\n", "def _setup_vss():\n", " duckdb.sql(\n", " query=\"\"\"\n", " INSTALL vss;\n", " LOAD vss;\n", " \"\"\"\n", " )\n", "def _drop_table(table_name):\n", " duckdb.sql(\n", " query=f\"\"\"\n", " DROP TABLE IF EXISTS {table_name};\n", " \"\"\"\n", " )\n", "\n", "def _create_table(dataset_name, table_name, embedding_column):\n", " duckdb.sql(\n", " query=f\"\"\"\n", " CREATE TABLE {table_name} AS \n", " SELECT *, {embedding_column}::float[{model.get_sentence_embedding_dimension()}] as {embedding_column}_float \n", " FROM 'hf://datasets/{dataset_name}/**/*.parquet';\n", " \"\"\"\n", " )\n", "\n", "def _create_index(table_name, embedding_column):\n", " duckdb.sql(\n", " query=f\"\"\"\n", " CREATE INDEX my_hnsw_index ON {table_name} USING HNSW ({embedding_column}_float) WITH (metric = 'cosine');\n", " \"\"\"\n", " )\n", "\n", "def create_index(dataset_name, table_name, embedding_column):\n", " _setup_vss()\n", " _drop_table(table_name)\n", " _create_table(dataset_name, table_name, embedding_column)\n", " _create_index(table_name, embedding_column)\n", "\n", "create_index(\n", " dataset_name=\"ai-blueprint/fineweb-bbc-news-embeddings\",\n", " table_name=\"fineweb_bbc_news_embeddings\",\n", " embedding_column=\"embeddings\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can perform a vector search with the index, which return the results instantly. " ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>url</th>\n", " <th>text</th>\n", " <th>embeddings</th>\n", " <th>embeddings_float</th>\n", " <th>distance</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>https://www.bbc.com/news/technology-51064369</td>\n", " <td>The last decade was a big one for artificial i...</td>\n", " <td>[-0.9206902980804443, 0.783940315246582, -2.00...</td>\n", " <td>[-0.9206903, 0.7839403, -2.0030282, -0.889843,...</td>\n", " <td>0.281200</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>http://www.bbc.co.uk/news/technology-25000756</td>\n", " <td>Singularity: The robots are coming to steal ou...</td>\n", " <td>[-1.5080468654632568, 0.7840216755867004, -1.1...</td>\n", " <td>[-1.5080469, 0.7840217, -1.1112142, -0.8743323...</td>\n", " <td>0.365842</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>http://www.bbc.com/news/technology-25000756</td>\n", " <td>Singularity: The robots are coming to steal ou...</td>\n", " <td>[-1.5080468654632568, 0.7840216755867004, -1.1...</td>\n", " <td>[-1.5080469, 0.7840217, -1.1112142, -0.8743323...</td>\n", " <td>0.365842</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>https://www.bbc.co.uk/news/technology-37494863</td>\n", " <td>Google, Facebook, Amazon join forces on future...</td>\n", " <td>[-0.38261985778808594, 1.6644303798675537, -1....</td>\n", " <td>[-0.38261986, 1.6644304, -1.8754442, -0.761026...</td>\n", " <td>0.380820</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>https://www.bbc.co.uk/news/technology-37494863</td>\n", " <td>Google, Facebook, Amazon join forces on future...</td>\n", " <td>[-0.38261985778808594, 1.6644303798675537, -1....</td>\n", " <td>[-0.38261986, 1.6644304, -1.8754442, -0.761026...</td>\n", " <td>0.380820</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " url \\\n", "0 https://www.bbc.com/news/technology-51064369 \n", "1 http://www.bbc.co.uk/news/technology-25000756 \n", "2 http://www.bbc.com/news/technology-25000756 \n", "3 https://www.bbc.co.uk/news/technology-37494863 \n", "4 https://www.bbc.co.uk/news/technology-37494863 \n", "\n", " text \\\n", "0 The last decade was a big one for artificial i... \n", "1 Singularity: The robots are coming to steal ou... \n", "2 Singularity: The robots are coming to steal ou... \n", "3 Google, Facebook, Amazon join forces on future... \n", "4 Google, Facebook, Amazon join forces on future... \n", "\n", " embeddings \\\n", "0 [-0.9206902980804443, 0.783940315246582, -2.00... \n", "1 [-1.5080468654632568, 0.7840216755867004, -1.1... \n", "2 [-1.5080468654632568, 0.7840216755867004, -1.1... \n", "3 [-0.38261985778808594, 1.6644303798675537, -1.... \n", "4 [-0.38261985778808594, 1.6644303798675537, -1.... \n", "\n", " embeddings_float distance \n", "0 [-0.9206903, 0.7839403, -2.0030282, -0.889843,... 0.281200 \n", "1 [-1.5080469, 0.7840217, -1.1112142, -0.8743323... 0.365842 \n", "2 [-1.5080469, 0.7840217, -1.1112142, -0.8743323... 0.365842 \n", "3 [-0.38261986, 1.6644304, -1.8754442, -0.761026... 0.380820 \n", "4 [-0.38261986, 1.6644304, -1.8754442, -0.761026... 0.380820 " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def similarity_search_with_duckdb_index(\n", " query: str,\n", " k: int = 5,\n", " table_name: str = \"fineweb_bbc_news_embeddings\",\n", " embedding_column: str = \"embeddings\"\n", "):\n", " embedding = model.encode(query).tolist()\n", " return duckdb.sql(\n", " query=f\"\"\"\n", " SELECT *, array_cosine_distance({embedding_column}_float, {embedding}::FLOAT[{model.get_sentence_embedding_dimension()}]) as distance \n", " FROM {table_name}\n", " ORDER BY distance \n", " LIMIT {k};\n", " \"\"\"\n", " ).to_df()\n", "\n", "similarity_search_with_duckdb_index(\"What is the future of AI?\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The query reduces from 30 seconds to sub-second response times and does not require you to deploy a heavy-weight vector search engine, while storage is handled by the Hub." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "We have seen how to perform vector search on the Hub using `duckdb`. For small datasets <100k rows, we can perform vector search without an index using the Hub as a vector search backend, but for larger datasets, we should create an index with the `vss` extension while doing local search and using the Hub as a storage backend. \n", "\n", "## Learn more\n", "\n", "- [Vector Search on Hugging Face](https://huggingface.co/docs/hub/en/datasets-duckdb)\n", "- [Vector Search Indexing with DuckDB](https://duckdb.org/docs/extensions/vss.html)" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }