notebooks/enterprise-search/elastic-crawler-to-open-crawler-migration.ipynb (576 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "id": "89b4646f-6a71-44e0-97b9-846319bf0162", "metadata": {}, "source": [ "## Hello, future Open Crawler user!\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/enterprise-search/elastic-crawler-to-open-crawler-migration.ipynb)\n", "\n", "This notebook is designed to help you migrate your Elastic Crawler configurations to Open Crawler-friendly YAML!\n", "\n", "We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run. Furthermore, we recommend that you only run each cell once as re-running cells may result in errors or incorrect YAML files.\n", "\n", "### Setup\n", "First, let's start by making sure `elasticsearch` and other required dependencies are installed and imported by running the following cell:" ] }, { "cell_type": "code", "execution_count": null, "id": "da411d2f-9aff-46af-845a-5fe9be19ea3c", "metadata": {}, "outputs": [], "source": [ "!pip install elasticsearch\n", "\n", "from getpass import getpass\n", "from elasticsearch import Elasticsearch\n", "\n", "import os\n", "import json\n", "import yaml\n", "import pprint" ] }, { "cell_type": "markdown", "id": "f4131f88-9895-4c0e-8b0a-6ec7b3b45653", "metadata": {}, "source": [ "We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:\n", "- Your **Elasticsearch Endpoint URL**\n", "- Your **Elasticsearch Endpoint Port number**\n", "- An **API key**\n", "\n", "You can find your Endpoint URL and port number by visiting your Elasticsearch Overview page in Kibana.\n", "\n", "You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place, as it will be displayed only once upon creation." ] }, { "cell_type": "code", "execution_count": null, "id": "08e6e3d2-62d3-4890-a6be-41fe0a931ef6", "metadata": {}, "outputs": [], "source": [ "ELASTIC_ENDPOINT = getpass(\"Elastic Endpoint: \")\n", "ELASTIC_PORT = getpass(\"Port\")\n", "API_KEY = getpass(\"Elastic Api Key: \")\n", "\n", "es_client = Elasticsearch(\n", " \":\".join([ELASTIC_ENDPOINT, ELASTIC_PORT]),\n", " api_key=API_KEY,\n", ")\n", "\n", "# ping ES to make sure we have positive connection\n", "es_client.info()[\"tagline\"]" ] }, { "cell_type": "markdown", "id": "85f99942-58ae-437d-a72b-70b8d1f4432c", "metadata": {}, "source": [ "Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!\n", "\n", "If not, please double-check your Cloud ID and API key that you provided above. " ] }, { "cell_type": "markdown", "id": "a55236e7-19dc-4f4c-92b9-d10848dd6af9", "metadata": {}, "source": [ "### Step 1: Acquire Basic Configurations\n", "\n", "First, we need to establish what Crawlers you have and their basic configuration details.\n", "This migration notebook will attempt to pull configurations for every distinct Crawler you have in your Elasticsearch instance." ] }, { "cell_type": "code", "execution_count": null, "id": "0a698b05-e939-42a5-aa31-51b1b1883e6f", "metadata": {}, "outputs": [], "source": [ "# in-memory data structure that maintains current state of the configs we've pulled\n", "inflight_configuration_data = {}\n", "\n", "crawler_configurations = es_client.search(\n", " index=\".ent-search-actastic-crawler2_configurations_v2\",\n", ")\n", "\n", "crawler_counter = 1\n", "for configuration in crawler_configurations[\"hits\"][\"hits\"]:\n", " source = configuration[\"_source\"]\n", "\n", " # extract values\n", " crawler_oid = source[\"id\"]\n", " output_index = source[\"index_name\"]\n", "\n", " print(f\"{crawler_counter}. {output_index}\")\n", " crawler_counter += 1\n", "\n", " crawl_schedule = (\n", " []\n", " ) # either no schedule or a specific schedule - determined in Step 4\n", " if (\n", " source[\"use_connector_schedule\"] == False and source[\"crawl_schedule\"]\n", " ): # an interval schedule is being used\n", " print(\n", " f\" {output_index} uses an interval schedule, which is not supported in Open Crawler!\"\n", " )\n", "\n", " # populate a temporary hashmap\n", " temp_conf_map = {\"output_index\": output_index, \"schedule\": crawl_schedule}\n", " # pre-populate some necessary fields in preparation for upcoming steps\n", " temp_conf_map[\"domains_temp\"] = {}\n", " temp_conf_map[\"output_sink\"] = \"elasticsearch\"\n", " temp_conf_map[\"full_html_extraction_enabled\"] = False\n", " temp_conf_map[\"elasticsearch\"] = {\n", " \"host\": \"\",\n", " \"port\": \"\",\n", " \"api_key\": \"\",\n", " }\n", " # populate the in-memory data structure\n", " inflight_configuration_data[crawler_oid] = temp_conf_map" ] }, { "cell_type": "markdown", "id": "2804d02b-870d-4173-9c5f-6d5eb434d49b", "metadata": {}, "source": [ "**Before continuing, please verify in the output above that the correct number of Crawlers was found.**\n", "\n", "Now that we have some basic data about your Crawlers, let's use this information to get more configuration values!" ] }, { "cell_type": "markdown", "id": "2b9e2da7-853c-40bd-9ee1-02c4d92b3b43", "metadata": {}, "source": [ "### Step 2: URLs, Sitemaps, and Crawl Rules\n", "\n", "In the next cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules." ] }, { "cell_type": "code", "execution_count": null, "id": "e1c64c3d-c8d7-4236-9ed9-c9b1cb5e7972", "metadata": {}, "outputs": [], "source": [ "crawler_ids_to_query = inflight_configuration_data.keys()\n", "\n", "crawler_counter = 1\n", "for crawler_oid in crawler_ids_to_query:\n", " # query ES to get the crawler's domain configurations\n", " crawler_domains = es_client.search(\n", " index=\".ent-search-actastic-crawler2_domains\",\n", " query={\"match\": {\"configuration_oid\": crawler_oid}},\n", " _source=[\n", " \"name\",\n", " \"configuration_oid\",\n", " \"id\",\n", " \"sitemaps\",\n", " \"crawl_rules\",\n", " \"seed_urls\",\n", " \"auth\",\n", " ],\n", " )\n", " print(f\"{crawler_counter}.) Crawler ID {crawler_oid}\")\n", " crawler_counter += 1\n", "\n", " # for each domain the Crawler has, grab its config values\n", " # and update the in-memory data structure\n", " for domain_info in crawler_domains[\"hits\"][\"hits\"]:\n", " source = domain_info[\"_source\"]\n", "\n", " # extract values\n", " domain_oid = str(source[\"id\"])\n", " domain_url = source[\"name\"]\n", " seed_urls = source[\"seed_urls\"]\n", " sitemap_urls = source[\"sitemaps\"]\n", " crawl_rules = source[\"crawl_rules\"]\n", "\n", " print(f\" Domain {domain_url} found!\")\n", "\n", " # transform seed, sitemap, and crawl rules into arrays\n", " seed_urls_list = []\n", " for seed_obj in seed_urls:\n", " seed_urls_list.append(seed_obj[\"url\"])\n", "\n", " sitemap_urls_list = []\n", " for sitemap_obj in sitemap_urls:\n", " sitemap_urls_list.append(sitemap_obj[\"url\"])\n", "\n", " crawl_rules_list = []\n", " for crawl_rules_obj in crawl_rules:\n", " crawl_rules_list.append(\n", " {\n", " \"policy\": crawl_rules_obj[\"policy\"],\n", " \"type\": crawl_rules_obj[\"rule\"],\n", " \"pattern\": crawl_rules_obj[\"pattern\"],\n", " }\n", " )\n", "\n", " # populate a temporary hashmap\n", " temp_domain_conf = {\"url\": domain_url}\n", " if seed_urls_list:\n", " temp_domain_conf[\"seed_urls\"] = seed_urls_list\n", " print(f\" Seed URls found: {seed_urls_list}\")\n", " if sitemap_urls_list:\n", " temp_domain_conf[\"sitemap_urls\"] = sitemap_urls_list\n", " print(f\" Sitemap URLs found: {sitemap_urls_list}\")\n", " if crawl_rules_list:\n", " temp_domain_conf[\"crawl_rules\"] = crawl_rules_list\n", " print(f\" Crawl rules found: {crawl_rules_list}\")\n", "\n", " # populate the in-memory data structure\n", " inflight_configuration_data[crawler_oid][\"domains_temp\"][\n", " domain_oid\n", " ] = temp_domain_conf\n", " print()" ] }, { "cell_type": "markdown", "id": "575c00ac-7c84-465e-83d7-aa51f8e5310d", "metadata": {}, "source": [ "### Step 3: Extracting the Extraction Rules\n", "\n", "In the next cell, we will find any extraction rules you set for your Elastic Crawlers." ] }, { "cell_type": "code", "execution_count": null, "id": "61a7df7a-72ad-4330-a30c-da319befd55c", "metadata": {}, "outputs": [], "source": [ "extraction_rules = es_client.search(\n", " index=\".ent-search-actastic-crawler2_extraction_rules\",\n", " _source=[\"configuration_oid\", \"domain_oid\", \"rules\", \"url_filters\"],\n", ")\n", "\n", "extr_count = 1\n", "for exr_rule in extraction_rules[\"hits\"][\"hits\"]:\n", " source = exr_rule[\"_source\"]\n", "\n", " config_oid = source[\"configuration_oid\"]\n", " domain_oid = source[\"domain_oid\"]\n", "\n", " # ensure the config and domain oids actually exist in our in-memory data structure\n", " if (\n", " config_oid in inflight_configuration_data\n", " and domain_oid in inflight_configuration_data[config_oid][\"domains_temp\"]\n", " ):\n", "\n", " # initialize extraction rulesets an empty array if it doesn't exist yet\n", " if (\n", " not \"extraction_rulesets\"\n", " in inflight_configuration_data[config_oid][\"domains_temp\"][domain_oid]\n", " ):\n", " inflight_configuration_data[config_oid][\"domains_temp\"][domain_oid][\n", " \"extraction_rulesets\"\n", " ] = []\n", "\n", " all_rules = source[\"rules\"]\n", " all_url_filters = source[\"url_filters\"]\n", "\n", " # extract url filters\n", " url_filters = []\n", " if all_url_filters:\n", " url_filters = [\n", " {\n", " \"type\": all_url_filters[0][\"filter\"],\n", " \"pattern\": all_url_filters[0][\"pattern\"],\n", " }\n", " ]\n", "\n", " # extract rulesets\n", " action_translation_map = {\n", " \"fixed\": \"set\",\n", " \"extracted\": \"extract\",\n", " }\n", "\n", " ruleset = []\n", " if all_rules:\n", " ruleset = [\n", " {\n", " \"action\": action_translation_map[\n", " all_rules[0][\"content_from\"][\"value_type\"]\n", " ],\n", " \"field_name\": all_rules[0][\"field_name\"],\n", " \"selector\": all_rules[0][\"selector\"],\n", " \"join_as\": all_rules[0][\"multiple_objects_handling\"],\n", " \"value\": all_rules[0][\"content_from\"][\"value\"],\n", " \"source\": all_rules[0][\"source_type\"],\n", " }\n", " ]\n", "\n", " temp_extraction_rulesets = {\n", " \"url_filters\": url_filters,\n", " \"rules\": ruleset,\n", " }\n", "\n", " print(\n", " f\"{extr_count}.) Crawler {config_oid} has extraction rules {temp_extraction_rulesets}\\n\"\n", " )\n", " extr_count += 1\n", "\n", " inflight_configuration_data[config_oid][\"domains_temp\"][domain_oid][\n", " \"extraction_rulesets\"\n", " ].append(temp_extraction_rulesets)" ] }, { "cell_type": "markdown", "id": "538fb054-1399-4b88-bd1e-fef116491421", "metadata": {}, "source": [ "### Step 4: Schedules\n", "\n", "In the next cell, we will gather any specific time schedules your Crawlers have set. Please note that _interval time schedules_ are not supported by Open Crawler and will be ignored." ] }, { "cell_type": "code", "execution_count": null, "id": "d880e081-f960-41c7-921e-26896f248eab", "metadata": {}, "outputs": [], "source": [ "def convert_quartz_to_cron(quartz_schedule: str) -> str:\n", " _, minutes, hours, day_of_month, month, day_of_week, year = (\n", " quartz_schedule.split(\" \") + [None]\n", " )[:7]\n", "\n", " # Day of week is 1-7 starting from Sunday in Quartz\n", " # and from Monday in regular Cron, adjust:\n", " # Days before: 1 - SUN, 2 - MON ... 7 - SAT\n", " # Days after: 1 - MON, 2 - TUE ... 7 - SUN\n", " if day_of_week.isnumeric():\n", " day_of_week = (int(day_of_week) - 2) % 7 + 1\n", "\n", " # ignore year\n", " repackaged_definition = f\"{minutes} {hours} {day_of_month} {month} {day_of_week} \"\n", "\n", " # ? comes from Quartz Cron, regular cron doesn't handle it well\n", " repackaged_definition = repackaged_definition.replace(\"?\", \"*\")\n", " return repackaged_definition\n", "\n", "\n", "# ---------------------------------------------------------------\n", "\n", "crawler_counter = 1\n", "for crawler_oid, crawler_config in inflight_configuration_data.items():\n", " output_index = crawler_config[\"output_index\"]\n", "\n", " existing_schedule_value = crawler_config[\"schedule\"]\n", "\n", " if not existing_schedule_value:\n", " # query ES to get this Crawler's specific time schedule\n", " schedules_result = es_client.search(\n", " index=\".elastic-connectors-v1\",\n", " query={\"match\": {\"index_name\": output_index}},\n", " _source=[\"index_name\", \"scheduling\"],\n", " )\n", " # update schedule field with cron expression if specific time scheduling is enabled\n", " if schedules_result[\"hits\"][\"hits\"][0][\"_source\"][\"scheduling\"][\"full\"][\n", " \"enabled\"\n", " ]:\n", " quartz_schedule = schedules_result[\"hits\"][\"hits\"][0][\"_source\"][\n", " \"scheduling\"\n", " ][\"full\"][\"interval\"]\n", " crawler_config[\"schedule\"] = convert_quartz_to_cron(quartz_schedule)\n", " print(\n", " f\"{crawler_counter}.) Crawler {output_index} has the schedule {crawler_config['schedule']}\"\n", " )\n", " crawler_counter += 1" ] }, { "cell_type": "markdown", "id": "b1586df2-283d-435f-9b08-ba9fad3a7e0a", "metadata": {}, "source": [ "### Step 5: Creating the Open Crawler YAML configuration files\n", "\n", "In this final step, we will create the actual YAML files you need to get up and running with Open Crawler!\n", "\n", "The next cell performs some final transformations to the in-memory data structure that is keeping track of your configurations." ] }, { "cell_type": "code", "execution_count": null, "id": "dd70f102-33ee-4106-8861-0aa0f9a223a1", "metadata": {}, "outputs": [], "source": [ "# Final transform of the in-memory data structure to a form we can dump to YAML\n", "# for each crawler, collect all of its domain configurations into a list\n", "for crawler_oid, crawler_config in inflight_configuration_data.items():\n", " all_crawler_domains = []\n", "\n", " for domain_config in crawler_config[\"domains_temp\"].values():\n", " all_crawler_domains.append(domain_config)\n", " # create a new key called \"domains\" that points to a list of domain configs only - no domain_oid values as keys\n", " crawler_config[\"domains\"] = all_crawler_domains\n", " # delete the temporary domain key\n", " del crawler_config[\"domains_temp\"]\n", " print(f\"Transform for {crawler_oid} complete!\")" ] }, { "cell_type": "markdown", "id": "e611a486-e12f-4951-ab95-ca54241a7a06", "metadata": {}, "source": [ "#### **Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.**\n", "\n", "In the next cell, please enter the following details about the _Elasticsearch instance you will be using with Open Crawler_. This instance can be Elastic Cloud Hosted, Serverless, or a local instance.\n", "\n", "- The Elasticsearch endpoint URL\n", "- The port number of your Elasticsearch endpoint _(Optional, will default to 443 if left blank)_\n", "- An API key" ] }, { "cell_type": "code", "execution_count": null, "id": "213880cc-cbf3-40d9-8c7d-6fcf6428c16b", "metadata": {}, "outputs": [], "source": [ "ENDPOINT = getpass(\"Elasticsearch endpoint URL: \")\n", "PORT = getpass(\"[OPTIONAL] Elasticsearch endpoint port number: \")\n", "OUTPUT_API_KEY = getpass(\"Elasticsearch API key: \")\n", "\n", "# set the above values in each Crawler's configuration\n", "for crawler_config in inflight_configuration_data.values():\n", " crawler_config[\"elasticsearch\"][\"host\"] = ENDPOINT\n", " crawler_config[\"elasticsearch\"][\"port\"] = int(PORT) if PORT else 443\n", " crawler_config[\"elasticsearch\"][\"api_key\"] = OUTPUT_API_KEY\n", "\n", "# ping ES to make sure we have positive connection\n", "es_client = Elasticsearch(\n", " \":\".join([ENDPOINT, PORT]),\n", " api_key=OUTPUT_API_KEY,\n", ")\n", "\n", "es_client.info()[\"tagline\"]" ] }, { "cell_type": "markdown", "id": "67dfc7c6-429e-42f0-ab08-2c84d72945cb", "metadata": {}, "source": [ "#### **This is the final step! You have two options here:**\n", "\n", "- The \"Write to YAML\" cell will create _n_ number of YAML files, one for each Crawler you have.\n", "- The \"Print to output\" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.\n", "\n", "Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files." ] }, { "cell_type": "markdown", "id": "7ca5ad33-364c-4d13-88fc-db19052363d5", "metadata": {}, "source": [ "#### Option 1: Write to YAML file" ] }, { "cell_type": "code", "execution_count": null, "id": "6adc53db-d781-4b72-a5f3-441364f354b8", "metadata": {}, "outputs": [], "source": [ "# Dump each Crawler's configuration into its own YAML file\n", "for crawler_config in inflight_configuration_data.values():\n", " base_dir = os.getcwd()\n", " file_name = (\n", " f\"{crawler_config['output_index']}-config.yml\" # autogen a custom filename\n", " )\n", " output_path = os.path.join(base_dir, file_name)\n", "\n", " if os.path.exists(base_dir):\n", " with open(output_path, \"w\") as file:\n", " yaml.safe_dump(crawler_config, file, sort_keys=False)\n", " print(f\" Wrote {file_name} to {output_path}\")" ] }, { "cell_type": "markdown", "id": "35c56a2b-4acd-47f5-90e3-9dd39fa4383f", "metadata": {}, "source": [ "#### Option 2: Print to output" ] }, { "cell_type": "code", "execution_count": null, "id": "525aabb8-0537-4ba6-8109-109490dddafe", "metadata": {}, "outputs": [], "source": [ "for crawler_config in inflight_configuration_data.values():\n", " yaml_out = yaml.safe_dump(crawler_config, sort_keys=False)\n", "\n", " print(f\"YAML config => {crawler_config['output_index']}-config.yml\\n--------\")\n", " print(yaml_out)\n", " print(\n", " \"--------------------------------------------------------------------------------\"\n", " )" ] }, { "cell_type": "markdown", "id": "dd4d18de-7b3b-4ebe-831b-c96bc55d6eb9", "metadata": {}, "source": [ "### Next Steps\n", "\n", "Now that the YAML files have been generated, you can visit the Open Crawler GitHub repository to learn more about how to deploy Open Crawler: https://github.com/elastic/crawler#quickstart\n", "\n", "If you find any problems with this Notebook, please feel free to create an issue in the elasticsearch-labs repository: https://github.com/elastic/elasticsearch-labs/issues" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }