notebooks/search/06-synonyms-api.ipynb (620 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "87773ce7"
},
"source": [
"# Synonyms API quick start\n",
"\n",
"<a target=\"_blank\" href=\"https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/06-synonyms-api.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"\n",
"This interactive notebook will introduce you to the Synonyms API ([blog post](https://www.elastic.co/blog/update-synonyms-elasticsearch-introducing-synonyms-api), [API documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/synonyms-apis.html)) using the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html). Synonyms allow you to enhance search relevancy by defining relationships between terms that have the similar meanings. In this notebook, you'll create & update synonyms sets, configure an index to use synonyms, and run queries that leverage synonyms for enhanced relevancy."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a32202e2"
},
"source": [
"## Create Elastic Cloud deployment\n",
"\n",
"If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=search&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.\n",
"\n",
"Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page and select **Create deployment**. Leave all settings with their default values."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "52a6a607"
},
"source": [
"## Install packages and import modules\n",
"\n",
"To get started, we'll need to connect to our Elastic deployment using the Python client.\n",
"Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n",
"\n",
"First we need to install the `elasticsearch` Python client."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ffc5fa6f",
"outputId": "2afe8842-15be-4d34-9e0f-e4de7ffc7a13"
},
"outputs": [],
"source": [
"!pip install -qU \"elasticsearch<9\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0241694c"
},
"source": [
"## Initialize the Elasticsearch client\n",
"\n",
"Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "f38e0397",
"outputId": "33239952-fa18-46f0-b4ee-285b0b4054ee"
},
"outputs": [],
"source": [
"from elasticsearch import Elasticsearch\n",
"from getpass import getpass\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
"\n",
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
"\n",
"# Create the client instance\n",
"client = Elasticsearch(\n",
" # For local development\n",
" # hosts=[\"http://localhost:9200\"]\n",
" cloud_id=ELASTIC_CLOUD_ID,\n",
" api_key=ELASTIC_API_KEY,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fcd165fa"
},
"source": [
"If you're running Elasticsearch locally or self-managed, you can pass in the Elasticsearch host instead. [Read more](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later) on how to connect to Elasticsearch locally."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enable Telemetry\n",
"\n",
"Knowing that you are using this notebook helps us decide where to invest our efforts to improve our products. We would like to ask you that you run the following code to let us gather anonymous usage statistics. See [telemetry.py](https://github.com/elastic/elasticsearch-labs/blob/main/telemetry/telemetry.py) for details. Thank you!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!curl -O -s https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/telemetry/telemetry.py\n",
"from telemetry import enable_telemetry\n",
"\n",
"client = enable_telemetry(client, \"06-synonyms-api\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1462ebd8"
},
"source": [
"### Test the Client\n",
"Before you continue, confirm that the client has connected with this test."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "25c618eb",
"outputId": "9eb26926-d63e-478b-8aa1-8bdb2a5dfbd8"
},
"outputs": [],
"source": [
"print(client.info())"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_ROfAyq7CL60"
},
"source": [
"## Configure & populate the index\n",
"\n",
"Our client is set up and connected to our Elastic deployment. Now we need to configure the index that will store our test data and populate it with some documents. We'll use a small index of books with the following fields:\n",
"\n",
"- `title`\n",
"- `authors`\n",
"- `publish_date`\n",
"- `num_reviews`\n",
"- `publisher`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create synonyms set\n",
"\n",
"Let's create our initial synonyms set first."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"synonyms_set = [{\"id\": \"synonym-1\", \"synonyms\": \"js, javascript, java script\"}]\n",
"\n",
"client.synonyms.put_synonym(id=\"my-synonyms-set\", synonyms_set=synonyms_set)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-phOM4SOFopW"
},
"source": [
"### Configure the index\n",
"\n",
"Ensure that you do not have a previously created index with the name `book_index`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pIl2dCpJGu1R",
"outputId": "294ae0c4-0cc0-45d8-ffd1-541115fdd31a"
},
"outputs": [],
"source": [
"client.indices.delete(index=\"book_index\", ignore_unavailable=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0fNVJ_JCHe04"
},
"source": [
"🔐 NOTE: at any time you can come back to this section and run the `delete` function above to remove your index and start from scratch."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IRMTg7siGykU"
},
"source": [
"\n",
"\n",
"In order to use synonyms, we need to define a [custom analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html) that uses the [`synonym`](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) or [`synonym_graph`](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html) token filter. Let's create an index that's configured to use an appropriate custom analyzer.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4AXB9IR8JjCT",
"outputId": "31d59878-88a8-4294-a727-0271d3890e1c"
},
"outputs": [],
"source": [
"settings = {\n",
" \"analysis\": {\n",
" \"analyzer\": {\n",
" \"my_custom_index_analyzer\": {\n",
" \"tokenizer\": \"standard\",\n",
" \"filter\": [\"lowercase\"],\n",
" },\n",
" \"my_custom_search_analyzer\": {\n",
" \"tokenizer\": \"standard\",\n",
" \"filter\": [\"lowercase\", \"my_synonym_filter\"],\n",
" },\n",
" },\n",
" \"filter\": {\n",
" \"my_synonym_filter\": {\n",
" \"type\": \"synonym_graph\",\n",
" \"synonyms_set\": \"my-synonyms-set\",\n",
" \"updateable\": True,\n",
" }\n",
" },\n",
" }\n",
"}\n",
"\n",
"mappings = {\n",
" \"properties\": {\n",
" \"title\": {\n",
" \"type\": \"text\",\n",
" \"analyzer\": \"my_custom_index_analyzer\",\n",
" \"search_analyzer\": \"my_custom_search_analyzer\",\n",
" },\n",
" \"summary\": {\n",
" \"type\": \"text\",\n",
" \"analyzer\": \"my_custom_index_analyzer\",\n",
" \"search_analyzer\": \"my_custom_search_analyzer\",\n",
" },\n",
" }\n",
"}\n",
"\n",
"client.indices.create(index=\"book_index\", mappings=mappings, settings=settings)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YYa3kdKvJtZW"
},
"source": [
"There are a few things to note in the configuration:\n",
"\n",
"- We are using the [`synonym_graph` token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html).\n",
"- We have defined two analyzers: `my_custom_index_analyzer` and `my_custom_search_analyzer`. `my_custom_search_analyzer` is used as a [search analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html).\n",
"- `my_synonym_filter` is used only in `my_custom_search_analyzer`.\n",
"\n",
"The `synonym_graph` token filter allows us to use multi-word synonyms. However, it is important to apply this filter only at search time, hence why we use it only in `my_custom_search_analyzer`. And since synonyms are only applied at search time, we can update them without reindexing.\n",
"\n",
"See [_The same, but different: Boosting the power of Elasticsearch with synonyms_](https://www.elastic.co/blog/boosting-the-power-of-elasticsearch-with-synonyms) for more background information about search-time synonyms."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "e6uvE1K9GeMm"
},
"source": [
"### Populate the index\n",
"\n",
"Run the following command to upload some test data, containing information about 10 popular programming books from this [dataset](https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "qX2jo_TzVwqR",
"outputId": "5a749972-a960-4218-b2df-58060dee265b"
},
"outputs": [],
"source": [
"import json\n",
"from urllib.request import urlopen\n",
"\n",
"url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json\"\n",
"response = urlopen(url)\n",
"books = json.loads(response.read())\n",
"\n",
"operations = []\n",
"for book in books:\n",
" operations.append({\"index\": {\"_index\": \"book_index\"}})\n",
" operations.append(book)\n",
"client.bulk(index=\"book_index\", operations=operations, refresh=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "50ghTAEYV4Yu"
},
"source": [
"## Aside: Pretty printing Elasticsearch search results\n",
"\n",
"Your `search` API calls will return hard-to-read nested JSON.\n",
"We'll create a little function called `pretty_search_response` to return nice, human-readable outputs from our examples."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "e1HgqDC4V_HW"
},
"outputs": [],
"source": [
"def pretty_search_response(response):\n",
" if len(response[\"hits\"][\"hits\"]) == 0:\n",
" print(\"Your search returned no results.\")\n",
" else:\n",
" for hit in response[\"hits\"][\"hits\"]:\n",
" id = hit[\"_id\"]\n",
" publication_date = hit[\"_source\"][\"publish_date\"]\n",
" score = hit[\"_score\"]\n",
" title = hit[\"_source\"][\"title\"]\n",
" summary = hit[\"_source\"][\"summary\"]\n",
" publisher = hit[\"_source\"][\"publisher\"]\n",
" num_reviews = hit[\"_source\"][\"num_reviews\"]\n",
" authors = hit[\"_source\"][\"authors\"]\n",
" pretty_output = f\"\\nID: {id}\\nPublication date: {publication_date}\\nTitle: {title}\\nSummary: {summary}\\nPublisher: {publisher}\\nReviews: {num_reviews}\\nAuthors: {authors}\\nScore: {score}\"\n",
" print(pretty_output)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OGwvVLQMW6lA"
},
"source": [
"## Run queries\n",
"\n",
"Let's use our synonyms in some Elasticsearch queries. We'll start by searching for books about Javascript."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KPvOrmTBYDet",
"outputId": "8d9f3de5-2508-4ca0-91b1-ece5e6099bea"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"ID: 3NfpXIsBGHjk6-WLlqOE\n",
"Publication date: 2018-12-04\n",
"Title: Eloquent JavaScript\n",
"Summary: A modern introduction to programming\n",
"Publisher: no starch press\n",
"Reviews: 38\n",
"Authors: ['marijn haverbeke']\n",
"Score: 20.307524\n",
"\n",
"ID: 29fpXIsBGHjk6-WLlqOE\n",
"Publication date: 2015-03-27\n",
"Title: You Don't Know JS: Up & Going\n",
"Summary: Introduction to JavaScript and programming as a whole\n",
"Publisher: oreilly\n",
"Reviews: 36\n",
"Authors: ['kyle simpson']\n",
"Score: 19.787104\n",
"\n",
"ID: 39fpXIsBGHjk6-WLlqOE\n",
"Publication date: 2008-05-15\n",
"Title: JavaScript: The Good Parts\n",
"Summary: A deep dive into the parts of JavaScript that are essential to writing maintainable code\n",
"Publisher: oreilly\n",
"Reviews: 51\n",
"Authors: ['douglas crockford']\n",
"Score: 17.064087\n"
]
}
],
"source": [
"response = client.search(\n",
" index=\"book_index\",\n",
" query={\n",
" \"multi_match\": {\n",
" \"query\": \"java script\",\n",
" \"fields\": [\n",
" \"title^10\",\n",
" \"summary\",\n",
" ],\n",
" }\n",
" },\n",
")\n",
"\n",
"pretty_search_response(response)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9KFJaht4Yxvh"
},
"source": [
"Notice that even though we searched for the term \"java script\", we got results containing the terms \"JS\" and \"JavaScript\". Our synonyms are working!\n",
"\n",
"Now let's try searching for books about AI."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "oj1ynL5nZz0u",
"outputId": "f1968d2c-83a5-4b3c-f397-44b16e7ab46e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Your search returned no results.\n"
]
}
],
"source": [
"response = client.search(\n",
" index=\"book_index\",\n",
" query={\n",
" \"multi_match\": {\n",
" \"query\": \"AI\",\n",
" \"fields\": [\n",
" \"title^10\",\n",
" \"summary\",\n",
" ],\n",
" }\n",
" },\n",
")\n",
"\n",
"pretty_search_response(response)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RtXj_JwyZ3DZ"
},
"source": [
"We didn't get any results! There are some books that use the terms \"artificial intelligence\", but not \"AI\". Let's try using the Synonyms API to add a new synonym rule for \"AI\" so the previous query returns results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8sZ4nkpzgwMy",
"outputId": "d425906a-3f6e-4dc2-89ed-ca6bbef70b0b"
},
"outputs": [],
"source": [
"client.synonyms.put_synonym_rule(\n",
" set_id=\"my-synonyms-set\",\n",
" rule_id=\"synonym-2\",\n",
" synonyms=\"ai, artificial intelligence\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KFgKAma1hMT_"
},
"source": [
"If we run the query again, we should now get some results."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KDx_V__QhIiy",
"outputId": "6d23e7f1-e129-4ee7-edf7-8e55ba1d0355"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"ID: 2dfpXIsBGHjk6-WLlqOE\n",
"Publication date: 2020-04-06\n",
"Title: Artificial Intelligence: A Modern Approach\n",
"Summary: Comprehensive introduction to the theory and practice of artificial intelligence\n",
"Publisher: pearson\n",
"Reviews: 39\n",
"Authors: ['stuart russell', 'peter norvig']\n",
"Score: 42.500813\n"
]
}
],
"source": [
"response = client.search(\n",
" index=\"book_index\",\n",
" query={\n",
" \"multi_match\": {\n",
" \"query\": \"AI\",\n",
" \"fields\": [\n",
" \"title^10\",\n",
" \"summary\",\n",
" ],\n",
" }\n",
" },\n",
")\n",
"\n",
"pretty_search_response(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"The Synonyms API allows you to dynamically create & modify the synonyms used in your search index in real time. After reading this notebook, you should have all you need to start integrating the Synonyms API into your search experience!"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3.12.3 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
},
"vscode": {
"interpreter": {
"hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}