notebooks/enterprise-search/app-search-engine-exporter.ipynb (1,011 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "zFIDykq6tpY8" }, "source": [ "# App Search Engine exporter to Elasticsearch\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/app-search-engine-exporter.ipynb)\n", "\n", "This notebook explains the steps of exporting an App Search engine together with its configurations in Elasticsearch. This is not meant to be an exhaustive example for all App Search features as those will vary based on your instance, but is meant to give a sense of how you can export, migrate, and enhance your application.\n", "\n", "NOTE: This notebook is designed to work with Elasticsearch **8.18** or higher. If you are running this notebook against an older version of Elasticsearch, we note commands that will need to be modified.\n", "\n", "We will look at:\n", "\n", "- how to export synonyms\n", "- how to export curations\n", "- how to create a new index in Elasticsearch\n", "- how to add sparse vector fields\n", "- how to query the new Elasticsearch index\n", "\n", "## Setup\n", "\n", "Let's start by making sure our Elasticsearch and Enterprise Search clients are installed. We'll also use `getpass` to ensure we can allow secure user inputs for our IDs and keys to access our Elasticsearch instance.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6-d_3fBqtmRa", "outputId": "a8ee102e-eaaa-45b5-b883-edb06a5bed4f" }, "outputs": [], "source": [ "# install packages\n", "import sys\n", "\n", "!{sys.executable} -m pip install -qU elasticsearch elastic-enterprise-search\n", "\n", "# import modules\n", "from getpass import getpass\n", "from elastic_enterprise_search import AppSearch\n", "from elasticsearch import Elasticsearch\n", "import json" ] }, { "cell_type": "markdown", "metadata": { "id": "wsldWRz7xSiZ" }, "source": [ "## Connect to Elasticsearch\n", "\n", "ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=search&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial. This notebook is designed to be run against an Elasticsearch deployment running on version 8.18 or higher.\n", "\n", "We'll use the **Cloud ID** to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment. \n", "\n", "You will also need your **API KEY** to access your deployment. You can [create a new API key](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key) from the `Stack Management -> API keys` menu in Kibana. Be sure to copy or write down your key in a safe place once it is created it will be displayed only upon creation.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "elasticsearch = Elasticsearch(\n", " # For local development\n", " # hosts=[\"http://localhost:9200\"]\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Connect to App Search\n", "\n", "For this notebook we will need access to an App Search private key that can access the App Search engine we want to export.\n", "We will be instantiating the Enterprise Search client with the provided credentials and then check that we are correctly authenticated to Enterprise Search by getting the App Search engine details.\n", "\n", "You can find your App Search endpoint and your search private key from the `Credentials` menu inside your App Search instance in Kibana.\n", "\n", "Also note here, we define our `ENGINE_NAME`. For this example, we are using the `national-parks-demo` sample engine that is available within App Search." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "APP_SEARCH_ENDPOINT = getpass(\"App Search endpoint: \")\n", "APP_SEARCH_PRIVATE_KEY = getpass(\"App Search private key: \")\n", "\n", "app_search = AppSearch(APP_SEARCH_ENDPOINT, bearer_auth=APP_SEARCH_PRIVATE_KEY)\n", "\n", "# modify this with your own engine name\n", "ENGINE_NAME = \"national-parks-demo\"" ] }, { "cell_type": "markdown", "metadata": { "id": "FSsSGl--uqFT" }, "source": [ "## Export App Search synonyms in Elasticsearch\n", "\n", "To get started with our export, we will first export any [synonyms](https://www.elastic.co/guide/en/app-search/current/synonyms-guide.html) we have in our App Search engine. \n", "\n", "The resulting synonyms will be placed into a new [Elasticsearch synoynm set](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-with-synonyms.html) named the same as our App Search egnine and used in analyzers for our synonyms-filter type later on.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kpV8K5jHvRK6" }, "outputs": [], "source": [ "elasticsearch.synonyms.put_synonym(id=ENGINE_NAME, synonyms_set=[])\n", "\n", "for synonym_set in app_search.list_synonym_sets(engine_name=ENGINE_NAME).body[\n", " \"results\"\n", "]:\n", " elasticsearch.synonyms.put_synonym_rule(\n", " set_id=ENGINE_NAME,\n", " rule_id=synonym_set[\"id\"],\n", " synonyms=\", \".join(synonym_set[\"synonyms\"]),\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick look at the synonyms we've migrated. We'll do this via the `GET _synonyms` endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(json.dumps(elasticsearch.synonyms.get_synonym(id=ENGINE_NAME).body, indent=2))" ] }, { "cell_type": "markdown", "metadata": { "id": "lyc-DcjnvTgH" }, "source": [ "## Export App Search curations in Elasticsearch\n", "\n", "Next, we will export any curations that may be in our App Search engine.\n", "\n", "To export App Search curations we will use Elasticsearch [query rules](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-using-query-rules.html). The code below will create the necessary `query_rules` to achieve this. Note that there is a default soft limit of 100 curations for `query_rules` that can be configured up to a hard limit of 1,000.\n", "\n", "NOTE: This example outputs query rules requiring `exact` matches, which are case-sensitive. If you need typo tolerance, consider using `fuzzy`. If you need different case values consider adding multiple values to your criteria. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "query_rules = []\n", "\n", "for curation in app_search.list_curations(engine_name=ENGINE_NAME).body[\"results\"]:\n", " if curation[\"promoted\"]:\n", " query_rules.append(\n", " {\n", " \"rule_id\": curation[\"id\"] + \"-pinned\",\n", " \"type\": \"pinned\",\n", " \"criteria\": [\n", " {\n", " \"type\": \"exact\",\n", " \"metadata\": \"user_query\",\n", " \"values\": curation[\"queries\"],\n", " }\n", " ],\n", " \"actions\": {\"ids\": curation[\"promoted\"]},\n", " }\n", " )\n", " if curation[\"hidden\"]:\n", " query_rules.append(\n", " {\n", " \"rule_id\": curation[\"id\"] + \"-exclude\",\n", " \"type\": \"exclude\",\n", " \"criteria\": [\n", " {\n", " \"type\": \"exact\",\n", " \"metadata\": \"user_query\",\n", " \"values\": curation[\"queries\"],\n", " }\n", " ],\n", " \"actions\": {\"ids\": curation[\"hidden\"]},\n", " }\n", " )\n", "\n", "elasticsearch.query_rules.put_ruleset(ruleset_id=ENGINE_NAME, rules=query_rules)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick look at the query rules we've migrated. We'll do this via the `GET _query_rules/ENGINE_NAME` endpoint. Note that curations with both pinned and hidden documents will be represented as two rules in the ruleset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " json.dumps(\n", " elasticsearch.query_rules.get_ruleset(ruleset_id=ENGINE_NAME).body, indent=2\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "yOKVxmSbvbt9" }, "source": [ "## Create a new Elasticsearch index\n", "\n", "We recommend reindexing your App Search engine data into a new Elasticsearch index instead of reusing the existing one. This allows you to update the index mapping to take advantage of modern features like semantic search and the newly created Elasticsearch synonym set.\n", "\n", "App Search has the following data types:\n", "\n", "- `text`\n", "- `number`\n", "- `date`\n", "- `geolocation`\n", " \n", "Each of these types is mapped to Elasticsearch field types.\n", "\n", "We can take a closer look at how App Search field types are mapped to Elasticsearch fields, by using the [`GET mapping API`](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-mapping.html).\n", "For App Search engines, the associated Elasticsearch index name is `.ent-search-engine-documents-[ENGINE_NAME]`, e.g. `.ent-search-engine-documents-national-parks-demo` for the App Search sample engine `national-parks-demo`.\n", "One thing to notice is how App Search uses [multi-fields](https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html) in Elasticsearch that allow for quickly changing the field type in App Search without requiring reindexing by creating subfields for each type of supported field:\n", "\n", "<details>\n", " <summary>An example schema can be found by clicking here</summary>\n", "\n", "```json\n", "\"[APP_SEARCH_ENGINE_FIELD_NAME]\": {\n", " \"type\": \"text\",\n", " \"fields\": {\n", " \"date\": {\n", " \"type\": \"date\",\n", " \"format\": \"strict_date_time||strict_date\",\n", " \"ignore_malformed\": true\n", " },\n", " \"delimiter\": {\n", " \"type\": \"text\",\n", " \"index_options\": \"freqs\",\n", " \"analyzer\": \"iq_text_delimiter\"\n", " },\n", " \"enum\": {\n", " \"type\": \"keyword\",\n", " \"ignore_above\": 2048\n", " },\n", " \"float\": {\n", " \"type\": \"double\",\n", " \"ignore_malformed\": true\n", " },\n", " \"joined\": {\n", " \"type\": \"text\",\n", " \"index_options\": \"freqs\",\n", " \"analyzer\": \"i_text_bigram\",\n", " \"search_analyzer\": \"q_text_bigram\"\n", " },\n", " \"location\": {\n", " \"type\": \"geo_point\",\n", " \"ignore_malformed\": true,\n", " \"ignore_z_value\": false\n", " },\n", " \"prefix\": {\n", " \"type\": \"text\",\n", " \"index_options\": \"docs\",\n", " \"analyzer\": \"i_prefix\",\n", " \"search_analyzer\": \"q_prefix\"\n", " },\n", " \"stem\": {\n", " \"type\": \"text\",\n", " \"analyzer\": \"iq_text_stem\"\n", " }\n", " },\n", " \"index_options\": \"freqs\",\n", " \"analyzer\": \"iq_text_base\"\n", "}\n", "```\n", "</details>\n", "\n", "In our case we can assume that we have a well established schema and we do not need to use all multi-fields.\n", "\n", "We can retrieve the field types of an App Search engine using the [Schema API](https://www.elastic.co/guide/en/app-search/current/schema.html) and then construct our mapping.\n", "\n", "Also note that below, we set up variables for our `SOURCE_INDEX` and `DEST_INDEX`. If you want your destination index to be named differently, you can edit it here as these variables are used throughout the rest of the notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we'll start by defining our source and destination indices. We'll also ensure that if the destination index is deleted if it exists, so that we start fresh." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# define SOURCE_INDEX and DEST_INDEX which we will continue to reuse; feel free to adjust DEST_INDEX\n", "SOURCE_INDEX = \".ent-search-engine-documents-\" + ENGINE_NAME\n", "DEST_INDEX = \"new-\" + ENGINE_NAME\n", "\n", "# delete the index if it's already created\n", "if elasticsearch.indices.exists(index=DEST_INDEX):\n", " elasticsearch.indices.delete(index=DEST_INDEX)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll create our settings which includes filters and and analyzers to use for our text fields.\n", "\n", "These are similar to the Elasticsearch analyzers we use for App Search. The main difference is that we are also adding a synonyms filter so that we can\n", "leverage the Elasticsearch synonym set we created in a previous step. If you want a different mapping for text fields, feel free to modify.\n", "\n", "To start with, we'll define a number of filters that we can reuse in our analyzer itself. These include:\n", "* `front_ngram`: defines a front loaded [n-gram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html) that can help create prefixes for terms.\n", "* `bigram_max_size`: defines a [maximum length](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-length-tokenfilter.html) for any bigram. In our example, we exclude any bigrams larger than 16 characters.\n", "* `en-stem-filter`: defines [a stemmer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html) for use with English text.\n", "* `bigram_joiner_unigrams`: a filter that [adds word n-grams](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html) into our token stream. This helps to expand the query to capture more context.\n", "* `delimiter`: a [word delimiter graph token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html) with this rules we've set on how to explicitly split tokens in our input.\n", "* `en-stop-words-filter`: a default [stop token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html) to remove common English terms from our input.\n", "* `synonyms-filter`: a [synonym graph token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html) that allows us to reuse the synonym set that we've defined above.\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "settings_analysis_filters = {\n", " \"front_ngram\": {\"type\": \"edge_ngram\", \"min_gram\": \"1\", \"max_gram\": \"12\"},\n", " \"bigram_joiner\": {\n", " \"max_shingle_size\": \"2\",\n", " \"token_separator\": \"\",\n", " \"output_unigrams\": \"false\",\n", " \"type\": \"shingle\",\n", " },\n", " \"bigram_max_size\": {\"type\": \"length\", \"max\": \"16\", \"min\": \"0\"},\n", " \"en-stem-filter\": {\"name\": \"light_english\", \"type\": \"stemmer\"},\n", " \"bigram_joiner_unigrams\": {\n", " \"max_shingle_size\": \"2\",\n", " \"token_separator\": \"\",\n", " \"output_unigrams\": \"true\",\n", " \"type\": \"shingle\",\n", " },\n", " \"delimiter\": {\n", " \"split_on_numerics\": \"true\",\n", " \"generate_word_parts\": \"true\",\n", " \"preserve_original\": \"false\",\n", " \"catenate_words\": \"true\",\n", " \"generate_number_parts\": \"true\",\n", " \"catenate_all\": \"true\",\n", " \"split_on_case_change\": \"true\",\n", " \"type\": \"word_delimiter_graph\",\n", " \"catenate_numbers\": \"true\",\n", " \"stem_english_possessive\": \"true\",\n", " },\n", " \"en-stop-words-filter\": {\"type\": \"stop\", \"stopwords\": \"_english_\"},\n", " \"synonyms-filter\": {\n", " \"type\": \"synonym_graph\",\n", " \"synonyms_set\": ENGINE_NAME,\n", " \"updateable\": True,\n", " },\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll create our analyzer that utilizes these filters. The various parts of the analyzer will be used in different parts of our field mapping for text, and will help us to be able to index and query our text in different ways. These include:\n", "\n", "* `iq_text_delimiter` is used for tokenizing and searching terms split on our specified delimiters in our text.\n", "* `i_prefix` and `q_prefix` define our indexing and query tokenizers for creating prefix versions of our terms.\n", "* `iq_text_stem` is used to create and query on stemmed versions of our tokens.\n", "* `i_text_bigram` and `q_text_bigram` define our tokenizers for indexing and querying to create bigram terms.\n", "* `i_text_base` and `q_text_base` define the indexing and query tokenization rules for general text tokenization." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "settings_analyzer = {\n", " \"i_prefix\": {\n", " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\", \"front_ngram\"],\n", " \"tokenizer\": \"standard\",\n", " },\n", " \"iq_text_delimiter\": {\n", " \"filter\": [\n", " \"delimiter\",\n", " \"cjk_width\",\n", " \"lowercase\",\n", " \"asciifolding\",\n", " \"en-stop-words-filter\",\n", " \"en-stem-filter\",\n", " ],\n", " \"tokenizer\": \"whitespace\",\n", " },\n", " \"q_prefix\": {\n", " \"filter\": [\"cjk_width\", \"lowercase\", \"asciifolding\"],\n", " \"tokenizer\": \"standard\",\n", " },\n", " \"i_text_base\": {\n", " \"filter\": [\n", " \"cjk_width\",\n", " \"lowercase\",\n", " \"asciifolding\",\n", " \"en-stop-words-filter\",\n", " ],\n", " \"tokenizer\": \"standard\",\n", " },\n", " \"q_text_base\": {\n", " \"filter\": [\n", " \"cjk_width\",\n", " \"lowercase\",\n", " \"asciifolding\",\n", " \"en-stop-words-filter\",\n", " \"synonyms-filter\",\n", " ],\n", " \"tokenizer\": \"standard\",\n", " },\n", " \"iq_text_stem\": {\n", " \"filter\": [\n", " \"cjk_width\",\n", " \"lowercase\",\n", " \"asciifolding\",\n", " \"en-stop-words-filter\",\n", " \"en-stem-filter\",\n", " ],\n", " \"tokenizer\": \"standard\",\n", " },\n", " \"i_text_bigram\": {\n", " \"filter\": [\n", " \"cjk_width\",\n", " \"lowercase\",\n", " \"asciifolding\",\n", " \"en-stem-filter\",\n", " \"bigram_joiner\",\n", " \"bigram_max_size\",\n", " ],\n", " \"tokenizer\": \"standard\",\n", " },\n", " \"q_text_bigram\": {\n", " \"filter\": [\n", " \"cjk_width\",\n", " \"lowercase\",\n", " \"asciifolding\",\n", " \"synonyms-filter\",\n", " \"en-stem-filter\",\n", " \"bigram_joiner_unigrams\",\n", " \"bigram_max_size\",\n", " ],\n", " \"tokenizer\": \"standard\",\n", " },\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we'll combine our filters and our analyzer into a settings object that we can use to define our destination index's settings.\n", "\n", "More information on creating custom analyzers can be found in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "settings = {\n", " \"analysis\": {\n", " \"filter\": settings_analysis_filters,\n", " \"analyzer\": settings_analyzer,\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our settings built for our analysis, we'll get the current schema from our App Search engine and use that to build the mappings for our destination index we will be migrating the data into.\n", "\n", "For any text fields, we'll explicitly define that mappings for how we want these fields to be stored. We define a number of fields here to emulate what App Search does underneath the hood. These include:\n", "* A `keyword` field that ignores any token greater than 2048 characters in length.\n", "* A `delimiter` field that captures any delimiters that we've defined in the above `delimiter` analysis.\n", "* A `joined` field that uses our bigram analysis from above. This will create pairs of joined tokens that can be used for phrase queries.\n", "* A `prefix` field that uses our prefix analysis from above. This is used for prefix wildcard to allow for partial matches as well as autocomplete queries.\n", "* A `stem` field that captures the stemmed versions of our tokens.\n", "\n", "Finally, the overall text field will be fully stored and analyzed using our base analyzer that we've defined above." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# get the App Search engine schema\n", "schema = app_search.get_schema(engine_name=ENGINE_NAME)\n", "\n", "# construct the Elasticsearch mapping\n", "mapping = {}\n", "\n", "for field_name in schema:\n", " field_type = schema[field_name]\n", "\n", " if field_type == \"date\":\n", " mapping[field_name] = {\n", " \"type\": \"date\",\n", " \"format\": \"strict_date_time||strict_date\",\n", " \"ignore_malformed\": True,\n", " }\n", " elif field_type == \"location\":\n", " mapping[field_name] = {\"type\": \"geo_point\", \"ignore_z_value\": False}\n", " elif field_type == \"number\":\n", " mapping[field_name] = {\"type\": \"double\"}\n", " elif field_type == \"text\":\n", " # feel free to modify this with your own mapping for text fields\n", " mapping[field_name] = {\n", " \"fields\": {\n", " \"keyword\": {\"type\": \"keyword\", \"ignore_above\": 2048},\n", " \"delimiter\": {\n", " \"type\": \"text\",\n", " \"index_options\": \"freqs\",\n", " \"analyzer\": \"iq_text_delimiter\",\n", " },\n", " \"joined\": {\n", " \"type\": \"text\",\n", " \"index_options\": \"freqs\",\n", " \"analyzer\": \"i_text_bigram\",\n", " \"search_analyzer\": \"q_text_bigram\",\n", " },\n", " \"prefix\": {\n", " \"type\": \"text\",\n", " \"index_options\": \"docs\",\n", " \"analyzer\": \"i_prefix\",\n", " \"search_analyzer\": \"q_prefix\",\n", " },\n", " \"stem\": {\"type\": \"text\", \"analyzer\": \"iq_text_stem\"},\n", " },\n", " \"type\": \"text\",\n", " \"index_options\": \"freqs\",\n", " \"analyzer\": \"i_text_base\",\n", " \"search_analyzer\": \"q_text_base\",\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, we create our destination index that uses our mappings and analysis settings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and actually create our index\n", "elasticsearch.indices.create(\n", " index=DEST_INDEX, mappings={\"properties\": mapping}, settings=settings\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Add semantic text fields for semantic search (optional)\n", "\n", "One of the advantages of exporting our index directly to Elasticsearch is that we can easily perform semantic search with ELSER. To do this, we'll need to add a `semantic_text` field to our index to use it. We will set up a `semantic_text` field using our default ELSER endpoint.\n", "\n", "Note that to use this feature, your cluster must be running at least version 8.15.0 and have at least one ML node set up with enough resources allocated to it.\n", "\n", "If you do not have an ELSER endpoint running, it will be automatically downloaded, deployed and started for you when you use `semantic_text`. This means the first few commands may take a while as the model loads. For Elasticsearch versions below 8.17, you will need to create an inference endpoint and add it to the `semantic_text` mapping." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using semantic text fields for ingest and query\n", "\n", "First, we'll augment our text fields with `semantic_text` fields in our index. We'll do this by creating a `semtantic_text` field, and providing a `copy_to` directive from the original source field to copy the text into our semantic text fields.\n", "\n", "In the example below, we are using the `description` and `title` fields from our example index to add semantic search on those fields." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# by default we are adding a `semantic_text` field for the \"description\" and \"title\" fields in our schema\n", "# feel free to modify this list to only include the fields that are relevant\n", "SEMANTIC_TEXT_FIELDS = [\"description\", \"title\"]\n", "\n", "# add the semantic_text field to our mapping for each field defined\n", "for field_name in SEMANTIC_TEXT_FIELDS:\n", " semantic_field_name = field_name + \"_semantic\"\n", " mapping[semantic_field_name] = {\"type\": \"semantic_text\"}\n", "\n", "# and for our text fields, add a \"copy_to\" directive to copy the text to the semantic_text field\n", "for field_name in SEMANTIC_TEXT_FIELDS:\n", " semantic_field_name = field_name + \"_semantic\"\n", " mapping[field_name].update({\"copy_to\": semantic_field_name})\n", "\n", "elasticsearch.indices.put_mapping(index=DEST_INDEX, properties=mapping)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reindex the data\n", "Now that we have created the Elasticsearch index, it's time to reindex our data in the new index. If you are using the `semantic_text` fields as defined above with a `_semantic` suffix, and then the reindexing process will automatically infer the sparse vector values from ELSER and use those for the vectors as the reindex takes place." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "reindex_task = elasticsearch.reindex(\n", " source={\"index\": SOURCE_INDEX},\n", " dest={\"index\": DEST_INDEX},\n", " wait_for_completion=False,\n", ")\n", "\n", "task_id = reindex_task[\"task\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that above in the reindex command, we set `wait_for_completion` to false. Inference can possibly take a while and we might run the risk of our command timing out.\n", "The call above will return a task that we can watch and see its progress the the `tasks` endpoint:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "task_response = elasticsearch.tasks.get(task_id=task_id)\n", "print(json.dumps(task_response.body, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Query the new Elasticsearch index\n", "\n", "We will exemplify:\n", "\n", "- [how to replicate App Search queries](#how-to-build-app-search-like-queries)\n", "- [how to do semantic search using ELSER](#how-to-do-semantic-search-using-elser)\n", "- [how to combine App Search queries and ELSER](#how-to-combine-app-search-queries-with-elser)\n", "\n", "### How to build App Search like queries\n", "\n", "App Search exposes a [search_explain API](https://www.elastic.co/guide/en/app-search/current/search-explain.html) that receives an App Search query and returns the Elasticsearch query built by App Search.\n", "\n", "```bash\n", "curl -X POST '${ENTERPRISE_SEARCH_BASE_URL}/api/as/v1/engines/national-parks-demo/search_explain' \\\n", "-H 'Content-Type: application/json' \\\n", "-H 'Authorization: Bearer private-xxxxxx' \\\n", "-d '{\n", " \"query\": \"park\"\n", "}'\n", "```\n", "\n", "From the output of the API call above, we can see the actual Elasticsearch query that will be used. Below, we are using this query as a base to build our own App Search like query using query rules and our Elasticsearch synonyms. The query is further enhanced by augmentation with the built-in App Search multifield types for such things as stemming and prefix matching.\n", "\n", "To walk through a bit of what is happening in the query below. First, we gather some preliminary information about the fields we want to query and return.\n", "1) We gather the fields we want for our result. This includes all the keys in the schema from above.\n", "2) Next, we gather all of our text fields in our schema\n", "3) And finally we gather the \"best fields\" which are those we want to query on using our stemmer." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "QUERY_STRING = \"park\"\n", "\n", "result_fields = list(schema.keys())\n", "\n", "text_fields = [field_name for field_name in schema if schema[field_name] == \"text\"]\n", "best_fields = [field_name + \".stem\" for field_name in text_fields]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, from our text fields, we create set of fields with specified weights for our various analyzers.\n", "\n", "* For the text field itself, we weight this as neutral, with a `1.0`\n", "* For any stem fields, we weight this _slightly_ less to pull in closely stemmed words in the query.\n", "* Any prefixes, we weight this with a minimal weight to ensure these do not dominate our scoring.\n", "* For any potential bigram phrase matches, we weight these as well with a `0.75`\n", "* Finally for our delimiter analyzed terms, we wight these somewhere in the middle.\n", "\n", "These are the default weightings that App Search uses. Feel free to experiement with these values to find a balance that works for you." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "cross_fields = []\n", "\n", "for text_field in text_fields:\n", " cross_fields.append(text_field + \"^1.0\")\n", " cross_fields.append(text_field + \".stem^0.95\")\n", " cross_fields.append(text_field + \".prefix^0.1\")\n", " cross_fields.append(text_field + \".joined^0.75\")\n", " cross_fields.append(text_field + \".delimiter^0.4\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're ready to create our actual payload for our query. This is analagous to the query that App Search uses when querying.\n", "\n", "Within this query, we first set an organic query rule. This defines a boolean query under the hood that allows a match to be found and scored either in our cross fields we defined above, or in the \"best fields\" as defined.\n", "\n", "For the results, we sort on our score descending as the primary sort, with the document id as the secondary.\n", "\n", "We apply highlights to returned text search descriptions, request a return size of the top 10 hits, and for each hit, return the result fields." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "app_search_query_payload = {\n", " \"query\": {\n", " \"rule\": {\n", " \"organic\": {\n", " \"bool\": {\n", " \"should\": [\n", " {\n", " \"multi_match\": {\n", " \"query\": QUERY_STRING,\n", " \"minimum_should_match\": \"1<-1 3<49%\",\n", " \"type\": \"cross_fields\",\n", " \"fields\": cross_fields,\n", " }\n", " },\n", " {\n", " \"multi_match\": {\n", " \"query\": QUERY_STRING,\n", " \"minimum_should_match\": \"1<-1 3<49%\",\n", " \"type\": \"best_fields\",\n", " \"fuzziness\": \"AUTO\",\n", " \"prefix_length\": 2,\n", " \"fields\": best_fields,\n", " }\n", " },\n", " ]\n", " }\n", " },\n", " \"ruleset_ids\": [ENGINE_NAME],\n", " \"match_criteria\": {\"user_query\": QUERY_STRING},\n", " }\n", " },\n", " \"sort\": [{\"_score\": \"desc\"}, {\"_doc\": \"desc\"}],\n", " \"highlight\": {\n", " \"fragment_size\": 300,\n", " \"type\": \"plain\",\n", " \"number_of_fragments\": 1,\n", " \"order\": \"score\",\n", " \"encoder\": \"html\",\n", " \"require_field_match\": False,\n", " \"fields\": {\"description\": {\"pre_tags\": [\"<em>\"], \"post_tags\": [\"</em>\"]}},\n", " },\n", " \"size\": 10,\n", " \"_source\": result_fields,\n", "}\n", "\n", "print(f\"Elasticsearch payload:\\n{json.dumps(app_search_query_payload, indent=2)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our fully flushed out query, we can use that to perform the actual search:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = elasticsearch.search(\n", " index=DEST_INDEX,\n", " query=app_search_query_payload[\"query\"],\n", " highlight=app_search_query_payload[\"highlight\"],\n", " source=app_search_query_payload[\"_source\"],\n", " sort=app_search_query_payload[\"sort\"],\n", " size=app_search_query_payload[\"size\"],\n", ")\n", "print(json.dumps(results.body, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to do semantic search using ELSER with semantic text fields\n", "\n", "If you [enabled and reindexed your data with ELSER](#add-sparse_vector-fields-for-semantic-search-optional), we can now use this to do semantic search.\n", "For each `semantic_text` field type, we can define a [match query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html) to easily perform a semantic search on these fields.\n", "\n", "NOTE: For Elasticsearch versions prior to 8.18, a [semantic query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-semantic-query.html) should be used to perform a semantic search on these fields.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# replace with your own\n", "QUERY_STRING = \"best sunset view\"\n", "semantic_text_queries = []\n", "\n", "for field_name in SEMANTIC_TEXT_FIELDS:\n", " semantic_field_name = field_name + \"_semantic\"\n", " semantic_text_queries.append({\"match\": {semantic_field_name: QUERY_STRING}})\n", "\n", "semantic_query = {\"bool\": {\"should\": semantic_text_queries}}\n", "print(f\"Elasticsearch query:\\n{json.dumps(semantic_query, indent=2)}\\n\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = elasticsearch.search(index=DEST_INDEX, query=semantic_query, min_score=1)\n", "print(f\"Query results:\\n{json.dumps(results.body, indent=2)}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to combine App Search queries with semantic text\n", "\n", "We will now provide an example on how to combine the previous two queries into a single query that applies both BM25 search and semantic search.\n", "In the previous examples, we have a `bool` query with `should` clauses.\n", "We will combine them in a single `bool` query and wrap this `bool` query in a `rule_query`.\n", "The `rule_query` is used to pin results based on the query string, similarly to App Search curations.\n", "The high-level structure of the query is following:\n", "\n", "```json\n", "GET [DEST-INDEX]\n", "{\n", " \"query\": {\n", " \"rule_query\": {\n", " \"organic\": {\n", " \"bool\": {\n", " \"should\": [\n", " // multi_match query with best_fields from App Search generated query\n", " // multi_match query with cross_fields from App Search generated query\n", " // match queries for semantic_text fields\n", " ]\n", " }\n", " } \n", " }\n", " }\n", "}\n", "```\n", "\n", "We are again using `min_score` to exclude less relevant results.\n", "In our example we are not boosting any of the `should` clauses, but this can be a way to boost ELSER results over BM25 results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "payload = app_search_query_payload.copy()\n", "\n", "for semantic_text_query in semantic_text_queries:\n", " payload[\"query\"][\"rule\"][\"organic\"][\"bool\"][\"should\"].append(semantic_text_query)\n", "\n", "print(f\"Elasticsearch payload:\\n{json.dumps(payload, indent=2)}\\n\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = elasticsearch.search(\n", " index=DEST_INDEX,\n", " query=payload[\"query\"],\n", " highlight=payload[\"highlight\"],\n", " source=payload[\"_source\"],\n", " sort=payload[\"sort\"],\n", " size=payload[\"size\"],\n", " min_score=1,\n", ")\n", "\n", "print(f\"Semantic query results:\\n{json.dumps(results.body, indent=2)}\\n\")" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 0 }