notebooks/esql/esql-getting-started.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "yfVx1CGq7dZU" }, "source": [ "<a target=\"_blank\" href=\"https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/esql/esql-getting-started.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" ] }, { "cell_type": "markdown", "metadata": { "id": "rkpAlNln3JGH" }, "source": [ "# Getting started with ES|QL\n", "\n", "In this notebook you'll learn the basics of the Elasticsearch Query Language (ES|QL).\n", "You'll be using the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html).\n", "\n", "You'll learn how to:\n", "\n", "- Run an ES|QL query\n", "- Use processing commands\n", "- Sort a table\n", "- Query data\n", "- Chain processing commands\n", "- Compute values\n", "- Calculate statistics\n", "- Access columns\n", "- Create a histogram\n", "- Enrich data\n", "- Process data\n", "\n", "> ℹ️ ES|QL is generally available as of Elastic stack version **8.14.0**." ] }, { "cell_type": "markdown", "metadata": { "id": "HXktLuMB3JGI" }, "source": [ "## Create Elastic Cloud deployment\n", "\n", "If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=search&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.\n", "\n", "Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page and select **Create deployment**. Leave all settings with their default values." ] }, { "cell_type": "markdown", "metadata": { "id": "dpyq9sea3JGI" }, "source": [ "## Install packages and import modules\n", "\n", "To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n", "\n", "First we need to install the `elasticsearch` Python client." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vG5NxAmpa-ZF", "outputId": "fa806171-8e14-4c6c-9b5d-362aadad09cf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting elasticsearch\n", " Downloading elasticsearch-8.14.0-py3-none-any.whl (480 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m480.2/480.2 kB\u001b[0m \u001b[31m3.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hCollecting elastic-transport<9,>=8.13 (from elasticsearch)\n", " Downloading elastic_transport-8.13.1-py3-none-any.whl (64 kB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m64.5/64.5 kB\u001b[0m \u001b[31m5.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25hRequirement already satisfied: urllib3<3,>=1.26.2 in /usr/local/lib/python3.10/dist-packages (from elastic-transport<9,>=8.13->elasticsearch) (2.0.7)\n", "Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from elastic-transport<9,>=8.13->elasticsearch) (2024.6.2)\n", "Installing collected packages: elastic-transport, elasticsearch\n", "Successfully installed elastic-transport-8.13.1 elasticsearch-8.14.0\n" ] } ], "source": [ "# Install packages\n", "\n", "!pip install elasticsearch" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "b2vOwH2lbFFF" }, "outputs": [], "source": [ "# Import packages\n", "\n", "from elasticsearch import Elasticsearch\n", "from elasticsearch.helpers import bulk\n", "from getpass import getpass # standard library module for secure credential input" ] }, { "cell_type": "markdown", "metadata": { "id": "voinbxva3JGJ" }, "source": [ "## Initialize the Elasticsearch client\n", "\n", "Now we can instantiate the Elasticsearch Python client, providing the CLOUD ID and [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key) for your deployment.\n", "\n", "> ℹ️ If you're running Elasticsearch locally or on self-managed infrastructure, you'll need to pass in the Elasticsearch host instead. [Read the docs](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later) about how to connect to Elasticsearch locally." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "rBbdkhAIbtym", "outputId": "8bfde8e3-205c-4e48-fe2e-9bb11bac3ded" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Elastic Cloud ID: ··········\n", "Elastic API Key: ··········\n" ] } ], "source": [ "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n", "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n", "\n", "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n", "ELASTIC_API_KEY = getpass(\"Elastic API Key: \")\n", "\n", "# Create the client instance\n", "client = Elasticsearch(\n", " # For local development\n", " # hosts=[\"http://localhost:9200\"]\n", " cloud_id=ELASTIC_CLOUD_ID,\n", " api_key=ELASTIC_API_KEY,\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "PMU9fyT07j3C" }, "source": [ "## Add sample data to Elasticsearch\n", "\n", "Before we index our sample dataset, let's create an index named `sample_data` with the correct [mappings](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "VZlVzRNP8IXo" }, "outputs": [], "source": [ "index_name = \"sample_data\"\n", "\n", "mappings = {\n", " \"mappings\": {\n", " \"properties\": {\"client_ip\": {\"type\": \"ip\"}, \"message\": {\"type\": \"keyword\"}}\n", " }\n", "}\n", "\n", "# Create the index\n", "if not client.indices.exists(index=index_name):\n", " client.indices.create(index=index_name, body=mappings)" ] }, { "cell_type": "markdown", "metadata": { "id": "QjuRVtqc8esR" }, "source": [ "Now we can index the data using the Elasticsearch Python client's [bulk helpers](https://elasticsearch-py.readthedocs.io/en/latest/helpers.html#bulk-helpers)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gKYu3MHQ8te5", "outputId": "f5f744ec-0e1c-4256-9cc8-f5afb1a25ab6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully indexed 7 documents.\n" ] } ], "source": [ "# Documents to be indexed\n", "documents = [\n", " {\n", " \"@timestamp\": \"2023-10-23T12:15:03.360Z\",\n", " \"client_ip\": \"172.21.2.162\",\n", " \"message\": \"Connected to 10.1.0.3\",\n", " \"event_duration\": 3450233,\n", " },\n", " {\n", " \"@timestamp\": \"2023-10-23T12:27:28.948Z\",\n", " \"client_ip\": \"172.21.2.113\",\n", " \"message\": \"Connected to 10.1.0.2\",\n", " \"event_duration\": 2764889,\n", " },\n", " {\n", " \"@timestamp\": \"2023-10-23T13:33:34.937Z\",\n", " \"client_ip\": \"172.21.0.5\",\n", " \"message\": \"Disconnected\",\n", " \"event_duration\": 1232382,\n", " },\n", " {\n", " \"@timestamp\": \"2023-10-23T13:51:54.732Z\",\n", " \"client_ip\": \"172.21.3.15\",\n", " \"message\": \"Connection error\",\n", " \"event_duration\": 725448,\n", " },\n", " {\n", " \"@timestamp\": \"2023-10-23T13:52:55.015Z\",\n", " \"client_ip\": \"172.21.3.15\",\n", " \"message\": \"Connection error\",\n", " \"event_duration\": 8268153,\n", " },\n", " {\n", " \"@timestamp\": \"2023-10-23T13:53:55.832Z\",\n", " \"client_ip\": \"172.21.3.15\",\n", " \"message\": \"Connection error\",\n", " \"event_duration\": 5033755,\n", " },\n", " {\n", " \"@timestamp\": \"2023-10-23T13:55:01.543Z\",\n", " \"client_ip\": \"172.21.3.15\",\n", " \"message\": \"Connected to 10.1.0.1\",\n", " \"event_duration\": 1756467,\n", " },\n", "]\n", "\n", "# Prepare the actions for the bulk API using list comprehension\n", "actions = [{\"_index\": index_name, \"_source\": doc} for doc in documents]\n", "\n", "# Perform the bulk index operation and capture the response\n", "success, failed = bulk(client, actions)\n", "\n", "if failed:\n", " print(f\"Some documents failed to index: {failed}\")\n", "else:\n", " print(f\"Successfully indexed {success} documents.\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "nkFLjdQzcPjQ" }, "outputs": [], "source": [ "# Suppress specific Elasticsearch warnings about default limit of [500] that pollute responses\n", "\n", "import warnings\n", "from elasticsearch import ElasticsearchWarning\n", "\n", "warnings.filterwarnings(\"ignore\", category=ElasticsearchWarning)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "UC4KZFRYjzkc" }, "outputs": [], "source": [ "# Format response to return human-readable tables\n", "\n", "\n", "def format_response(response_data):\n", " column_names = [col[\"name\"] for col in response_data[\"columns\"]]\n", " column_widths = [\n", " max(\n", " len(name),\n", " max(\n", " (\n", " len(str(row[i]) if row[i] is not None else \"None\")\n", " for row in response_data[\"values\"]\n", " ),\n", " default=0,\n", " ),\n", " )\n", " for i, name in enumerate(column_names)\n", " ]\n", " row_format = \" | \".join([\"{:<\" + str(width) + \"}\" for width in column_widths])\n", " print(row_format.format(*column_names))\n", " print(\"-\" * sum(column_widths) + \"-\" * (len(column_widths) - 1) * 3)\n", " for row in response_data[\"values\"]:\n", " # Convert None values in the row to \"None\" before formatting\n", " formatted_row = [(str(cell) if cell is not None else \"None\") for cell in row]\n", " print(row_format.format(*formatted_row))" ] }, { "cell_type": "markdown", "metadata": { "id": "CQhFjEIp3JGL" }, "source": [ "## Your first ES|QL query\n", "\n", "Each ES|QL query starts with a [source command](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-source-commands \"Source commands\"). A source command produces a table, typically with data from Elasticsearch.\n", "\n", "The [`FROM`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-from \"FROM\") source command returns a table with documents from a data stream, index, or alias. Each row in the resulting table represents a document. This query returns up to 500 documents from the `sample_data` index:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Jyb40LkJh6sN", "outputId": "01dd3b73-b0bf-423d-9856-8764c1361266" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message \n", "--------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3\n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2\n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected \n", "2023-10-23T13:51:54.732Z | 172.21.3.15 | 725448 | Connection error \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error \n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error \n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1\n" ] } ], "source": [ "esql_query = \"FROM sample_data\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "QpBc6ABG3JGM" }, "source": [ "Each column corresponds to a field, and can be accessed by the name of that field.\n", "\n", "ℹ️ ES|QL keywords are case-insensitive. `FROM sample_data` is identical to `from sample_data`." ] }, { "cell_type": "markdown", "metadata": { "id": "B1P44sOfmehL" }, "source": [ "## Processing commands\n", "\n", "A source command can be followed by one or more [processing commands](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-processing-commands), separated by a pipe character: `|`. Processing commands change an input table by adding, removing, or changing rows and columns. Processing commands can perform filtering, projection, aggregation, and more.\n", "\n", "For example, you can use the `LIMIT` command to limit the number of rows that are returned, up to a maximum of 10,000 rows:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VGhogxc4mEUW", "outputId": "d0b17451-92ad-4f0f-c4b0-56544e3520b3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message \n", "--------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3\n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2\n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| LIMIT 3\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "RLPlQtYem4fc" }, "source": [ "### Sort a table\n", "\n", "Another processing command is the [`SORT`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-sort) command. By default, the rows returned by `FROM` don’t have a defined sort order. Use the `SORT` command to sort rows on one or more columns:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "m5LOqw1kmGkb", "outputId": "e9ebb9cf-e295-4030-9062-4349cc7a65e7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message \n", "--------------------------------------------------------------------------------\n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1\n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error \n", "2023-10-23T13:51:54.732Z | 172.21.3.15 | 725448 | Connection error \n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected \n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3\n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| SORT @timestamp DESC\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "9YBqc9ftoNkb" }, "source": [ "### Query the data\n", "\n", "Use the [`WHERE`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-where) command to query the data. For example, to find all events with a duration longer than 5ms:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wR4Pt165oZ6i", "outputId": "578cc098-7bf0-4800-c654-783f898e7d82" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message \n", "--------------------------------------------------------------------------\n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error\n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error\n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| WHERE event_duration > 5000000\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "C_6-0a4ZopKn" }, "source": [ "`WHERE` supports several [operators](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-operators).\n", "\n", "For example, you can use [`LIKE`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-like-operator) to run a wildcard query against the message column:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SsU1Nsh1ouFs", "outputId": "bb6c169f-0f48-4f5b-e8cc-890e5d891402" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message \n", "--------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3\n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2\n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1\n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| WHERE message LIKE \"Connected*\"\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "HjuCP76BEsDz" }, "source": [ "### More processing commands\n", "\n", "There are many other processing commands, like [`KEEP`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-keep \"KEEP\") and [`DROP`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-drop \"DROP\") to keep or drop columns, [`ENRICH`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-enrich \"ENRICH\") to enrich a table with data from indices in Elasticsearch, and [`DISSECT`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-dissect \"DISSECT\") and [`GROK`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-grok \"GROK\") to process data. Refer to [Processing commands](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-processing-commands \"Processing commands\") for an overview." ] }, { "cell_type": "markdown", "metadata": { "id": "ESaKzB8HFMXj" }, "source": [ "## Chain processing commands\n", "\n", "You can chain processing commands, separated by a pipe character: `|`. Each\n", "processing command works on the output table of the previous command. The result\n", "of a query is the table produced by the final processing command.\n", "\n", "The following example first sorts the table on `@timestamp`, and next limits the\n", "result set to 3 rows:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ztm9lRiIFTnS", "outputId": "f54dc210-3fd5-405d-9bf5-68e28dc75305" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message \n", "-------------------------------------------------------------------------------\n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1\n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| SORT @timestamp DESC\n", "| LIMIT 3\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "cogdcusMFgB3" }, "source": [ "> ℹ️ The order of processing commands is important. First limiting the result set to 3 rows before sorting those 3 rows would most likely return a result that is different than this example, where the sorting comes before the limit." ] }, { "cell_type": "markdown", "metadata": { "id": "vsJ5wFitFlow" }, "source": [ "## Compute values\n", "\n", "Use the [`EVAL`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-eval \"EVAL\") command to append columns to a table, with calculated values. For example, the following query appends a `duration_ms` column. The values in the column are computed by dividing `event_duration` by 1,000,000. In other words: `event_duration` converted from nanoseconds to milliseconds." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bxCjCNwxFx4e", "outputId": "885d60f6-661c-4a15-a76b-52d774e6bce7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message | duration_ms\n", "----------------------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3 | 3.450233 \n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2 | 2.764889 \n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected | 1.232382 \n", "2023-10-23T13:51:54.732Z | 172.21.3.15 | 725448 | Connection error | 0.725448 \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error | 8.268153 \n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error | 5.033755 \n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1 | 1.756467 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| EVAL duration_ms = event_duration/1000000.0\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "OY6HN6PtF8zy" }, "source": [ "`EVAL` supports several [functions](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-functions). For example, to round a number to the closest number with the specified number of digits, use the [`ROUND`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-round \"ROUND\") function:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RTx1xw2ZF__8", "outputId": "6b203c0c-e0b6-4226-fd2d-3132300dae70" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message | duration_ms\n", "----------------------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3 | 3.5 \n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2 | 2.8 \n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected | 1.2 \n", "2023-10-23T13:51:54.732Z | 172.21.3.15 | 725448 | Connection error | 0.7 \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error | 8.3 \n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error | 5.0 \n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1 | 1.8 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| EVAL duration_ms = ROUND(event_duration/1000000.0, 1)\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "NW3NYqMLGI_O" }, "source": [ "## Calculate statistics\n", "\n", "You can also use ES|QL to aggregate your data. Use the [`STATS ... BY`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-stats-by \"STATS ... BY\") command to calculate statistics.\n", "\n", "For example, to calculate the median duration:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "v-vamaNVGTwS", "outputId": "4c392179-57cd-4d1e-d0d4-3bd808574465" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "median_duration\n", "---------------\n", "2764889.0 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| STATS median_duration = MEDIAN(event_duration)\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "Tyr8TURjGXZm" }, "source": [ "You can calculate multiple stats with one command:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QLGT8GTvGZIg", "outputId": "cc5b3288-f53b-4162-9770-3519b93f367c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "median_duration | max_duration\n", "------------------------------\n", "2764889.0 | 8268153 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| STATS median_duration = MEDIAN(event_duration), max_duration = MAX(event_duration)\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZeLti0ByGc9S" }, "source": [ "Use BY to group calculated stats by one or more columns. For example, to calculate the median duration per client IP:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EWGMRhZrHH-c", "outputId": "0e1c48ff-0f67-45c7-b2ce-b1b35b616955" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "median_duration | client_ip \n", "------------------------------\n", "1232382.0 | 172.21.0.5 \n", "2764889.0 | 172.21.2.113\n", "3450233.0 | 172.21.2.162\n", "3395111.0 | 172.21.3.15 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| STATS median_duration = MEDIAN(event_duration) BY client_ip\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "y3zS4-zLHO5I" }, "source": [ "## Access columns\n", "\n", "You can access columns by their name. If a name contains special characters, [it needs to be quoted](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-syntax.html#esql-identifiers \"Identifiers\") with backticks (`` ` ``).\n", "\n", "Assigning an explicit name to a column created by `EVAL` or `STATS` is optional. If you don’t provide a name, the new column name is equal to the function expression. For example:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "V62P4GEgHojF", "outputId": "d6eddbc0-005a-43f8-be12-697faa6c6972" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message | event_duration/1000000.0\n", "-----------------------------------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3 | 3.450233 \n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2 | 2.764889 \n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected | 1.232382 \n", "2023-10-23T13:51:54.732Z | 172.21.3.15 | 725448 | Connection error | 0.725448 \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error | 8.268153 \n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error | 5.033755 \n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1 | 1.756467 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| EVAL event_duration/1000000.0\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "Al_cDCIIHpsR" }, "source": [ "In this query, `EVAL` adds a new column named `event_duration/1000000.0`. Because its name contains special characters, to access this column, quote it with backticks:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Jv6y2SjLHtBA", "outputId": "df6ad16b-cbe6-4c6f-cd12-0521f95c01f0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MEDIAN(`event_duration/1000000.0`)\n", "----------------------------------\n", "2.764889 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| EVAL event_duration/1000000.0\n", "| STATS MEDIAN(`event_duration/1000000.0`)\n", "\"\"\"\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZKB3miTcH1St" }, "source": [ "## Create a histogram\n", "\n", "To track statistics over time, ES|QL enables you to create histograms using the [`BUCKET`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-functions-operators.html#esql-bucket \"BUCKET\") function. `BUCKET` creates human-friendly bucket sizes and returns a value for each row that corresponds to the resulting bucket the row falls into.\n", "\n", "> ℹ️ The `BUCKET` function must be used together with the [`STATS ... BY`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-stats-by \"STATS ... BY\") command. It replaces the `AUTO_BUCKET` function which was removed in 8.14.0.\n", "\n", "For example, the following query creates hourly buckets for the data on October 23rd and creates a histogram to count the number of events per hour." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WtKXFOfyH4zt", "outputId": "7767905b-393f-40f9-bd82-008778626524" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "COUNT(*) | bucket \n", "-----------------------------------\n", "2 | 2023-10-23T12:00:00.000Z\n", "5 | 2023-10-23T13:00:00.000Z\n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| KEEP @timestamp\n", "| STATS COUNT(*) BY bucket = BUCKET(@timestamp, 24, \"2023-10-23T00:00:00Z\", \"2023-10-23T23:59:59Z\")\n", "\"\"\"\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "LKmuoDfRIH67" }, "source": [ "Or the median duration per hour:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gDHT5BHNIHQM", "outputId": "6c4a8bc9-defe-4069-abf7-4bee4196cb41" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "median_duration | bucket \n", "------------------------------------------\n", "3107561.0 | 2023-10-23T12:00:00.000Z\n", "1756467.0 | 2023-10-23T13:00:00.000Z\n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| KEEP @timestamp, event_duration\n", "| STATS median_duration = MEDIAN(event_duration) BY bucket = BUCKET(@timestamp, 24, \"2023-10-23T00:00:00Z\", \"2023-10-23T23:59:59Z\")\n", "\"\"\"\n", "\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "ETZb10tIIKyO" }, "source": [ "## Enrich data\n", "\n", "ES|QL enables you to [enrich](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-enrich-data.html \"Data enrichment\") a table with data from indices in Elasticsearch, using the [`ENRICH`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-enrich \"ENRICH\") command.\n", "\n", "> ℹ️ Before you can use `ENRICH`, you first need to [create](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-enrich-data.html#esql-create-enrich-policy \"Create an enrich policy\") and [execute](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-enrich-data.html#esql-execute-enrich-policy \"Execute the enrich policy\") an [enrich policy](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-enrich-data.html#esql-enrich-policy).\n", "\n", "The following requests create and execute a policy called `clientip_policy`. The policy links an IP address to an environment (\"Development\", \"QA\", or \"Production\").\n", "\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2QNKgM--IQtJ", "outputId": "ac3da396-08ec-45be-b736-4492e9bb3716" }, "outputs": [ { "data": { "text/plain": [ "ObjectApiResponse({'status': {'phase': 'COMPLETE'}})" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the mapping\n", "mapping = {\n", " \"mappings\": {\n", " \"properties\": {\"client_ip\": {\"type\": \"keyword\"}, \"env\": {\"type\": \"keyword\"}}\n", " }\n", "}\n", "\n", "# Create the index with the mapping\n", "client.indices.create(index=\"clientips\", body=mapping)\n", "\n", "# Prepare bulk data\n", "bulk_data = [\n", " {\"index\": {}},\n", " {\"client_ip\": \"172.21.0.5\", \"env\": \"Development\"},\n", " {\"index\": {}},\n", " {\"client_ip\": \"172.21.2.113\", \"env\": \"QA\"},\n", " {\"index\": {}},\n", " {\"client_ip\": \"172.21.2.162\", \"env\": \"QA\"},\n", " {\"index\": {}},\n", " {\"client_ip\": \"172.21.3.15\", \"env\": \"Production\"},\n", " {\"index\": {}},\n", " {\"client_ip\": \"172.21.3.16\", \"env\": \"Production\"},\n", "]\n", "\n", "# Bulk index the data\n", "client.bulk(index=\"clientips\", body=bulk_data)\n", "\n", "# Define the enrich policy\n", "policy = {\n", " \"match\": {\n", " \"indices\": \"clientips\",\n", " \"match_field\": \"client_ip\",\n", " \"enrich_fields\": [\"env\"],\n", " }\n", "}\n", "\n", "# Put the enrich policy\n", "client.enrich.put_policy(name=\"clientip_policy\", body=policy)\n", "\n", "# Execute the enrich policy and wait for completion\n", "client.enrich.execute_policy(name=\"clientip_policy\", wait_for_completion=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "GtauD2gHJ683" }, "source": [ "After creating and executing a policy, you can use it with the `ENRICH` command:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IHER1VM2KsqV", "outputId": "82106c80-c4a7-47dc-a7d9-d503c13b2a44" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | event_duration | client_ip | env \n", "----------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 3450233 | 172.21.2.162 | QA \n", "2023-10-23T12:27:28.948Z | 2764889 | 172.21.2.113 | QA \n", "2023-10-23T13:33:34.937Z | 1232382 | 172.21.0.5 | Development\n", "2023-10-23T13:51:54.732Z | 725448 | 172.21.3.15 | Production \n", "2023-10-23T13:52:55.015Z | 8268153 | 172.21.3.15 | Production \n", "2023-10-23T13:53:55.832Z | 5033755 | 172.21.3.15 | Production \n", "2023-10-23T13:55:01.543Z | 1756467 | 172.21.3.15 | Production \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| KEEP @timestamp, client_ip, event_duration\n", "| EVAL client_ip = TO_STRING(client_ip)\n", "| ENRICH clientip_policy ON client_ip WITH env\n", "\"\"\"\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "rWrJ99ReKvRT" }, "source": [ "You can use the new `env` column that’s added by the `ENRICH` command in subsequent commands. For example, to calculate the median duration per environment:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9Fyz2JkZK47P", "outputId": "1cdc5e84-daf7-4d87-eb4d-9d3c5e6cfdad" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "median_duration | env \n", "-----------------------------\n", "3107561.0 | QA \n", "1232382.0 | Development\n", "3395111.0 | Production \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| KEEP @timestamp, client_ip, event_duration\n", "| EVAL client_ip = TO_STRING(client_ip)\n", "| ENRICH clientip_policy ON client_ip WITH env\n", "| STATS median_duration = MEDIAN(event_duration) BY env\n", "\"\"\"\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "YonYg0UhK8Rk" }, "source": [ "For more about data enrichment with ES|QL, refer to [Data enrichment](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-enrich-data.html \"Data enrichment\")." ] }, { "cell_type": "markdown", "metadata": { "id": "Fsjoqo-eK9qX" }, "source": [ "## Process data\n", "\n", "Your data may contain unstructured strings that you want to [structure](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-process-data-with-dissect-and-grok.html \"Data processing with DISSECT and GROK\") to make it easier to analyze the data. For example, the sample data contains log messages like:\n", "\n", "```\n", "\"Connected to 10.1.0.3\"\n", "```\n", "\n", "By extracting the IP address from these messages, you can determine which IP has accepted the most client connections.\n", "\n", "To structure unstructured strings at query time, you can use the ES|QL [`DISSECT`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-dissect \"DISSECT\") and [`GROK`](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-commands.html#esql-grok \"GROK\") commands. `DISSECT` works by breaking up a string using a delimiter-based pattern. `GROK` works similarly, but uses regular expressions. This makes `GROK` more powerful, but generally also slower.\n", "\n", "In this case, no regular expressions are needed, as the `message` is straightforward: \"Connected to \", followed by the server IP. To match this string, you can use the following `DISSECT` command:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kwX_AFgeLEvQ", "outputId": "bf734a58-76eb-496a-817b-1fd398576526" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "@timestamp | client_ip | event_duration | message | server_ip\n", "--------------------------------------------------------------------------------------------\n", "2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233 | Connected to 10.1.0.3 | 10.1.0.3 \n", "2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889 | Connected to 10.1.0.2 | 10.1.0.2 \n", "2023-10-23T13:33:34.937Z | 172.21.0.5 | 1232382 | Disconnected | None \n", "2023-10-23T13:51:54.732Z | 172.21.3.15 | 725448 | Connection error | None \n", "2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153 | Connection error | None \n", "2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755 | Connection error | None \n", "2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467 | Connected to 10.1.0.1 | 10.1.0.1 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| DISSECT message \"Connected to %{server_ip}\"\n", "\"\"\"\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "1llLaCrWLTk_" }, "source": [ "This adds a `server_ip` column to those rows that have a `message` that matches this pattern. For other rows, the value of `server_ip` is `null`.\n", "\n", "You can use the new `server_ip` column that’s added by the `DISSECT` command in subsequent commands. For example, to determine how many connections each server has accepted:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8_uEN-H5LaXH", "outputId": "188babae-cf7a-4db8-800c-7bcf7576f8f5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "COUNT(*) | server_ip\n", "--------------------\n", "1 | 10.1.0.3 \n", "1 | 10.1.0.2 \n", "1 | 10.1.0.1 \n" ] } ], "source": [ "esql_query = \"\"\"\n", "FROM sample_data\n", "| WHERE STARTS_WITH(message, \"Connected to\")\n", "| DISSECT message \"Connected to %{server_ip}\"\n", "| STATS COUNT(*) BY server_ip\n", "\"\"\"\n", "response = client.esql.query(query=esql_query)\n", "format_response(response)" ] }, { "cell_type": "markdown", "metadata": { "id": "MtyvTCjyL6aC" }, "source": [ "> ℹ️ To learn more about data processing with ES|QL, refer to [Data processing with DISSECT and GROK](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-process-data-with-dissect-and-grok.html \"Data processing with DISSECT and GROK\")." ] }, { "cell_type": "markdown", "metadata": { "id": "HRE5HF_SLdqY" }, "source": [ "## Learn more\n", "\n", "To learn more about ES|QL, refer to:\n", "- [_Learning ES|QL_](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-language.html \"Learning ES|QL\")\n", "- [_Using ES|QL_](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql-using.html \"Using ES|QL\")" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3.12.3 64-bit", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.12.3" }, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 0 }

notebooks/esql/esql-getting-started.ipynb (1,371 lines of code) (raw):