notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb (411 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"# NLP text search using hugging face transformer model\n",
"\n",
"[](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/loading-model-from-hugging-face.ipynb)\n",
"\n",
"The workbook implements NLP text search in Elasticsearch using a simple dataset consisting of Elastic blogs titles.\n",
"\n",
"You will index blogs documents, and using ingest pipeline generate text embeddings. By using NLP model you will query the documents using natural language over the the blogs documents.\n",
"\n",
"\n",
"## Prerequisities\n",
"Before we begin, create an elastic cloud deployment and [autoscale](https://www.elastic.co/guide/en/cloud/current/ec-autoscaling.html) to have least one machine learning (ML) node with enough (4GB) memory. Also ensure that the Elasticsearch cluster is running. \n",
"\n",
"If you don't already have an Elastic deployment, you can sign up for a free [Elastic Cloud trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zdzl8tmZfr3y"
},
"source": [
"## Install packages and import modules\n",
"Before you start you need to install all required Python dependencies."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9NM_6fGFURcz",
"outputId": "53f1a78c-db7f-468e-c6af-bdf9be554473"
},
"outputs": [],
"source": [
"!python3 -m pip install sentence-transformers==2.7.0 \"eland<9\" \"elasticsearch<9\" transformers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import modules\n",
"from elasticsearch import Elasticsearch\n",
"from getpass import getpass\n",
"from urllib.request import urlopen\n",
"import json\n",
"from time import sleep"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vKU9L8o2FodV"
},
"source": [
"## Deploy an NLP model\n",
"\n",
"We are using the [`eland`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/overview.html) tool to install a `text_embedding` model. For our model, We have used [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to transform the search text into the dense vector. \n",
"\n",
"The model will transfer your search query into vector which will be used for the search over the set of documents stored in Elasticsearch. \n",
"\n",
"\n",
"## Install text embedding NLP model\n",
"\n",
"Using the [`eland_import_hub_model`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch) script, download and install `all-MiniLM-L6-v2` transformer model. Setting the NLP `--task-type` as `text_embedding`. \n",
"\n",
"To get the cloud id, go to [Elastic cloud](https://cloud.elastic.co) and `On the deployment overview page, copy down the Cloud ID.`\n",
"\n",
"To authenticate your request, You could use [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key). Alternatively, you can use your cloud deployment username and password.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "vQkKn02j_FfJ",
"outputId": "99b9ffd4-6780-4167-dbd7-e0264369e40e"
},
"outputs": [],
"source": [
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!eland_import_hub_model --cloud-id $ELASTIC_CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --es-api-key $ELASTIC_API_KEY --start --clear-previous"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MGfkUDWDMkc4"
},
"source": [
"## Connect to Elasticsearch cluster\n",
"\n",
"Create a elasticsearch client instance with your deployment `Cloud Id` and `API Key`. In this example, we are using the `API_KEY` and `CLOUD_ID` value from previous step. \n",
"\n",
"Alternately you could use your deployment `Username` and `Password` to authenticate your instance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "XGi175RbJhVQ",
"outputId": "14e7f403-06ba-4fa7-a3f4-4bcc271a1f5d"
},
"outputs": [],
"source": [
"es = Elasticsearch(\n",
" cloud_id=ELASTIC_CLOUD_ID, api_key=ELASTIC_API_KEY, request_timeout=600\n",
")\n",
"\n",
"es.info() # should return cluster info"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8FoZ5TBrIqOT"
},
"source": [
"## Create an Ingest pipeline\n",
"\n",
"We need to create a text embedding ingest pipeline to generate vector (text) embeddings for `title` field.\n",
"\n",
"The pipeline below is defining a processor for the [inference](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) to the NLP model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "geY7WLh7Ky-k",
"outputId": "97c17b36-94d6-454f-b976-4c20e8e49edc"
},
"outputs": [],
"source": [
"# ingest pipeline definition\n",
"PIPELINE_ID = \"vectorize_blogs\"\n",
"\n",
"es.ingest.put_pipeline(\n",
" id=PIPELINE_ID,\n",
" processors=[\n",
" {\n",
" \"inference\": {\n",
" \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
" \"target_field\": \"text_embedding\",\n",
" \"field_map\": {\"title\": \"text_field\"},\n",
" }\n",
" }\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IW-GIlH2OxB4"
},
"source": [
"## Create Index with mappings\n",
"\n",
"We will now create an elasticsearch index with correct mapping before we index documents. \n",
"We are adding `text_embedding` to include the `model_id` and `predicted_value` to store the embeddings.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "xAkc1OVcOxy3",
"outputId": "b2453634-89b8-48bc-ac65-a6a1c3b8170f"
},
"outputs": [],
"source": [
"# define index name\n",
"INDEX_NAME = \"blogs\"\n",
"\n",
"# flag to check if index has to be deleted before creating\n",
"SHOULD_DELETE_INDEX = True\n",
"\n",
"# define index mapping\n",
"INDEX_MAPPING = {\n",
" \"properties\": {\n",
" \"title\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"text_embedding\": {\n",
" \"properties\": {\n",
" \"is_truncated\": {\"type\": \"boolean\"},\n",
" \"model_id\": {\n",
" \"type\": \"text\",\n",
" \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n",
" },\n",
" \"predicted_value\": {\n",
" \"type\": \"dense_vector\",\n",
" \"dims\": 384,\n",
" \"index\": True,\n",
" \"similarity\": \"l2_norm\",\n",
" },\n",
" }\n",
" },\n",
" }\n",
"}\n",
"\n",
"INDEX_SETTINGS = {\n",
" \"index\": {\n",
" \"number_of_replicas\": \"1\",\n",
" \"number_of_shards\": \"1\",\n",
" \"default_pipeline\": PIPELINE_ID,\n",
" }\n",
"}\n",
"\n",
"# check if we want to delete index before creating the index\n",
"if SHOULD_DELETE_INDEX:\n",
" if es.indices.exists(index=INDEX_NAME):\n",
" print(\"Deleting existing %s\" % INDEX_NAME)\n",
" es.indices.delete(index=INDEX_NAME, ignore=[400, 404])\n",
"\n",
"print(\"Creating index %s\" % INDEX_NAME)\n",
"es.indices.create(\n",
" index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS, ignore=[400, 404]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WOGsvnGveAoP"
},
"source": [
"## Index data to elasticsearch index\n",
"\n",
"Let's index sample blogs data using the ingest pipeline. \n",
"\n",
"Note: Before we begin indexing, ensure you have [started your trained model deployment](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/integrations/hugging-face/blogs.json\"\n",
"response = urlopen(url)\n",
"titles = json.loads(response.read())\n",
"\n",
"actions = []\n",
"for title in titles:\n",
" actions.append({\"index\": {\"_index\": \"blogs\"}})\n",
" actions.append(title)\n",
"es.bulk(index=\"blogs\", operations=actions)\n",
"sleep(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xPPHg8K8T3wY"
},
"source": [
"## Querying the dataset\n",
"The next step is to run a query to search for relevant blogs. The example query searches for `\"model_text\": \"how to track network connections\"` using the model we uploaded to Elasticsearch `sentence-transformers__all-minilm-l6-v2`.\n",
"\n",
"The process is one query even though it internally consists of two tasks. First, the query will generate an vector for your search text using the NLP model and then pass this vector to search over the dataset.\n",
"\n",
"As a result, the output shows the list of query documents sorted by their proximity to the search query. \n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 566
},
"id": "c4G5V9wmU9C5",
"outputId": "c8f0cc24-5713-4560-8a5d-c42da562a670"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Brewing in Beats: Track network connections']\n",
"Score: 0.5917864\n",
"\n",
"['Machine Learning for Nginx Logs - Identifying Operational Issues with Your Website']\n",
"Score: 0.40109876\n",
"\n",
"['Data Visualization For Machine Learning']\n",
"Score: 0.39027885\n",
"\n",
"['Logstash Lines: Introduce integration plugins']\n",
"Score: 0.36899462\n",
"\n",
"['Keeping up with Kibana: This week in Kibana for November 29th, 2019']\n",
"Score: 0.35690257\n",
"\n"
]
}
],
"source": [
"INDEX_NAME = \"blogs\"\n",
"\n",
"source_fields = [\"id\", \"title\"]\n",
"\n",
"query = {\n",
" \"field\": \"text_embedding.predicted_value\",\n",
" \"k\": 5,\n",
" \"num_candidates\": 50,\n",
" \"query_vector_builder\": {\n",
" \"text_embedding\": {\n",
" \"model_id\": \"sentence-transformers__all-minilm-l6-v2\",\n",
" \"model_text\": \"how to track network connections\",\n",
" }\n",
" },\n",
"}\n",
"\n",
"response = es.search(index=INDEX_NAME, fields=source_fields, knn=query, source=False)\n",
"\n",
"\n",
"def show_results(results):\n",
" for result in results:\n",
" print(f'{result[\"fields\"][\"title\"]}\\nScore: {result[\"_score\"]}\\n')\n",
"\n",
"\n",
"show_results(response.body[\"hits\"][\"hits\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}