supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb

{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "*Reranking with a locally hosted reranker model from HuggingFace*" ], "metadata": { "id": "bW9q8qD_bPhY" } }, { "cell_type": "markdown", "source": [ "# Setup the notebook" ], "metadata": { "id": "BecBOzyDbWik" } }, { "cell_type": "markdown", "source": [ "## Install required libs" ], "metadata": { "id": "6ayhDP72bZAe" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2Xz9uWQFbNkH" }, "outputs": [], "source": [ "!pip install -qqU elasticsearch\n", "!pip install -qqU eland[pytorch]\n", "!pip install -qqU datasets" ] }, { "cell_type": "markdown", "source": [ "## Import the required python libraries" ], "metadata": { "id": "LgHQaJh0bmJQ" } }, { "cell_type": "code", "source": [ "import os\n", "from elasticsearch import Elasticsearch, helpers, exceptions\n", "from urllib.request import urlopen\n", "from getpass import getpass\n", "import json\n", "import time\n", "from datasets import load_dataset\n", "import pandas as pd" ], "metadata": { "id": "CsL466H0bjNX" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Create an Elasticsearch Python client\n", "\n" ], "metadata": { "id": "gsQ4XIpkbpd4" } }, { "cell_type": "markdown", "source": [ "## Free Trial\n", "If you don't have an Elasticsearch cluster, or what one to test out. Head over to [cloud.elastic.co](https://cloud.elastic.co/registration?onboarding_token=search&cta=cloud-registration&tech=trial&plcmt=article%20content&pg=search-labs) and sign up. You can sign up and have a serverless project up and running in only a few mintues!" ], "metadata": { "id": "M_JppGte1Wc5" } }, { "cell_type": "markdown", "source": [ "We are using an Elastic Cloud cloud_id and deployment (cluster) API key.\n", "\n", "[See this guide](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) for finding the `cloud_id` and creating an `api_key`" ], "metadata": { "id": "gdsEjwhV1tr5" } }, { "cell_type": "code", "source": [ "cloud_id = getpass(prompt=\"Enter your Elasticsearch Cloud ID: \")\n", "api_key = getpass(prompt=\"Enter your Elasticsearch API key: \")\n", "\n", "\n", "es = Elasticsearch(cloud_id=cloud_id, api_key=api_key)\n", "\n", "try:\n", " es.info()\n", " print(\"Successfully connected to Elasticsearch!\")\n", "except exceptions.ConnectionError as e:\n", " print(f\"Error connecting to Elasticsearch: {e}\")" ], "metadata": { "id": "UY5WCB0HUVTb" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Ready Elasticsearch" ], "metadata": { "id": "jQdzhNbB_e3n" } }, { "cell_type": "markdown", "source": [ "## Hugging Face Reranking Model\n", "Run this cell to:\n", "- Use Eland's `eland_import_hub_model` command to upload the reranking model to Elasticsearch.\n", "\n", "For this example we've chosen the [`cross-encoder/ms-marco-MiniLM-L-6-v2`](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2) text similarity model.\n", "<br><br>\n", "**Note**:\n", "While we are importing the model for use as a reranker, Eland and Elasticsearch do not have a dedicated rerank task type, so we still use `text_similarity`" ], "metadata": { "id": "5bsLLnqCfNKk" } }, { "cell_type": "code", "source": [ "model_id = \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n", "\n", "!eland_import_hub_model \\\n", " --cloud-id $cloud_id \\\n", " --es-api-key $api_key \\\n", " --hub-model-id $model_id \\\n", " --task-type text_similarity" ], "metadata": { "id": "J2MTEYrUfk9R" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Create Inference Endpoint\n", "Run this cell to:\n", "- Create an inference Endpoint\n", "- Deploy the reranking model we impoted in the previous section\n", "We need to create an endpoint queries can use for reranking\n", "\n", "Key points about the `model_config`\n", "- `service` - in this case `elasticsearch` will tell the inference API to use a locally hosted (in Elasticsearch) model\n", "- `num_allocations` sets the number of allocations to 1\n", " - Allocations are independent units of work for NLP tasks. Scaling this allows for an increase in concurrent throughput\n", "- `num_threads` - sets the number of threads per allocation to 1\n", " - Threads per allocation affect the number of threads used by each allocation during inference. Scaling this generally increased the speed of inference requests (to a point).\n", "- `model_id` - This is the id of the model as it is named in Elasticsearch\n", "\n" ], "metadata": { "id": "-rrQV6SAgWz8" } }, { "cell_type": "code", "source": [ "model_config = {\n", " \"service\": \"elasticsearch\",\n", " \"service_settings\": {\n", " \"num_allocations\": 1,\n", " \"num_threads\": 1,\n", " \"model_id\": \"cross-encoder__ms-marco-minilm-l-6-v2\",\n", " },\n", " \"task_settings\": {\"return_documents\": True},\n", "}\n", "\n", "inference_id = \"semantic-reranking\"\n", "\n", "create_endpoint = es.inference.put(\n", " inference_id=inference_id, task_type=\"rerank\", body=model_config\n", ")\n", "\n", "create_endpoint.body" ], "metadata": { "id": "Abu084BYgWCE", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d58cc940-281e-4d56-d6e6-040e4881e78a" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'inference_id': 'semantic-reranking',\n", " 'task_type': 'rerank',\n", " 'service': 'elasticsearch',\n", " 'service_settings': {'num_allocations': 1,\n", " 'num_threads': 1,\n", " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", " 'task_settings': {'return_documents': True}}" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "source": [ "### Verify it was created\n", "\n", "- Run the two cells in this section to verify:\n", "- The Inference Endpoint has been completed\n", "- The model has been deployed\n", "\n", "You should see JSON output with information about the semantic endpoint" ], "metadata": { "id": "X8rQXMrHhMkS" } }, { "cell_type": "code", "source": [ "check_endpoint = es.inference.get(\n", " inference_id=inference_id,\n", ")\n", "\n", "check_endpoint.body" ], "metadata": { "id": "n3Yk7rgYhP-N", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d9e68225-5796-411e-964a-6db3be5541aa" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'endpoints': [{'inference_id': 'semantic-reranking',\n", " 'task_type': 'rerank',\n", " 'service': 'elasticsearch',\n", " 'service_settings': {'num_allocations': 1,\n", " 'num_threads': 1,\n", " 'model_id': 'cross-encoder__ms-marco-minilm-l-6-v2'},\n", " 'task_settings': {'return_documents': True}}]}" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "markdown", "source": [ "## Create the index mapping\n", "\n", "We are going to index the `title` and `abstract` from the dataset. " ], "metadata": { "id": "4vqimyNWAhWb" } }, { "cell_type": "code", "source": [ "index_name = \"arxiv-papers\"\n", "\n", "index_mapping = {\n", " \"mappings\": {\n", " \"properties\": {\"title\": {\"type\": \"text\"}, \"abstract\": {\"type\": \"text\"}}\n", " }\n", "}\n", "\n", "\n", "try:\n", " es.indices.create(index=index_name, body=index_mapping)\n", " print(f\"Index '{index_name}' created successfully.\")\n", "except exceptions.RequestError as e:\n", " if e.error == \"resource_already_exists_exception\":\n", " print(f\"Index '{index_name}' already exists.\")\n", " else:\n", " print(f\"Error creating index '{index_name}': {e}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "4bc9ba0b-9be3-410a-d1d6-f1da04bbfec7", "id": "DPADF_7ytTmR" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Index 'arxiv-papers' created successfully.\n" ] } ] }, { "cell_type": "markdown", "source": [ "## Ready the dataset\n", "We are going to use the [CShorten/ML-ArXiv-Papers](https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers) dataset." ], "metadata": { "id": "FqQmaT5P-Nhx" } }, { "cell_type": "markdown", "source": [ "## Download Dataset\n", "**Note** You may get a warning *The secret `HF_TOKEN` does not exist in your Colab secrets*.\n", "\n", "You can safely ignore this." ], "metadata": { "id": "aN0dbYO7oB47" } }, { "cell_type": "code", "source": [ "dataset = load_dataset(\"CShorten/ML-ArXiv-Papers\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IVnpj5bBoEBL", "outputId": "bc6371d9-d66f-482c-95f8-8cdb89e4f0ef" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: \n", "The secret `HF_TOKEN` does not exist in your Colab secrets.\n", "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n", "You will be able to reuse this secret in all of your notebooks.\n", "Please note that authentication is recommended but still optional to access public models or datasets.\n", " warnings.warn(\n" ] } ] }, { "cell_type": "markdown", "source": [ "### Index into Elasticsearch\n", "\n", "We will loop through the dataset and send batches of rows to Elasticsearch\n", "- This may take a couple minutes depending on your cluster sizing." ], "metadata": { "id": "GQxDITCpAKWb" } }, { "cell_type": "code", "source": [ "def bulk_insert_elasticsearch(dataset, index_name, chunk_size=1000):\n", " actions = []\n", " for record in dataset:\n", " action = {\n", " \"_index\": index_name,\n", " \"_source\": {\"title\": record[\"title\"], \"abstract\": record[\"abstract\"]},\n", " }\n", " actions.append(action)\n", "\n", " if len(actions) == chunk_size:\n", " helpers.bulk(es, actions)\n", " actions = []\n", "\n", " if actions:\n", " helpers.bulk(es, actions)\n", "\n", "\n", "bulk_insert_elasticsearch(dataset[\"train\"], index_name)" ], "metadata": { "id": "tDZ0qEbW-ozW" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "# Query with Reranking\n", "\n", "This contains a `text_similarity_reranker` retriever, which:\n", "\n", "- Uses a standard retriever to:\n", " - Perform a lexical query against the `title` field\n", "- Performs a reranking:\n", " - Takes as input the top 100 results from the previous search\n", " - `rank_window_size`: 100\n", " - Takes as input the query\n", " - `inference_text`: query\n", "- Uses our previously created reranking API and model\n", "\n" ], "metadata": { "id": "2bwvzLfRjJ2n" } }, { "cell_type": "code", "source": [ "query = \"sparse vector embedding\"\n", "\n", "# Query scored from score\n", "response_scored = es.search(\n", " index=\"arxiv-papers\",\n", " body={\n", " \"size\": 10,\n", " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", " \"fields\": [\"title\", \"abstract\"],\n", " \"_source\": False,\n", " },\n", ")\n", "\n", "# Query with Semantic Reranker\n", "response_reranked = es.search(\n", " index=\"arxiv-papers\",\n", " body={\n", " \"size\": 10,\n", " \"retriever\": {\n", " \"text_similarity_reranker\": {\n", " \"retriever\": {\"standard\": {\"query\": {\"match\": {\"title\": query}}}},\n", " \"field\": \"abstract\",\n", " \"inference_id\": \"semantic-reranking\",\n", " \"inference_text\": query,\n", " \"rank_window_size\": 100,\n", " }\n", " },\n", " \"fields\": [\"title\", \"abstract\"],\n", " \"_source\": False,\n", " },\n", ")" ], "metadata": { "id": "HWXQBS35jQ3n" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Print the table comparing the scored and reranked results" ], "metadata": { "id": "Hnam80Irbj6a" } }, { "cell_type": "code", "source": [ "titles_scored = [\n", " paper[\"fields\"][\"title\"][0] for paper in response_scored.body[\"hits\"][\"hits\"]\n", "]\n", "titles_reranked = [\n", " paper[\"fields\"][\"title\"][0] for paper in response_reranked.body[\"hits\"][\"hits\"]\n", "]\n", "\n", "# Creating a DataFrame\n", "df = pd.DataFrame(\n", " {\"Scored Results\": titles_scored, \"Reranked Results\": titles_reranked}\n", ")\n", "\n", "df_styled = df.style.set_properties(**{\"text-align\": \"left\"}).set_caption(\n", " f\"Comparison of Scored and Semantic Reranked Results - Query: '{query}'\"\n", ")\n", "\n", "# Display the table\n", "df_styled" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 415 }, "id": "yTTNYCYcBtll", "outputId": "0f0af538-13fd-4e62-e3d2-1dd185689904" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "<pandas.io.formats.style.Styler at 0x7dad78316950>" ], "text/html": [ "<style type=\"text/css\">\n", "#T_b97e6_row0_col0, #T_b97e6_row0_col1, #T_b97e6_row1_col0, #T_b97e6_row1_col1, #T_b97e6_row2_col0, #T_b97e6_row2_col1, #T_b97e6_row3_col0, #T_b97e6_row3_col1, #T_b97e6_row4_col0, #T_b97e6_row4_col1, #T_b97e6_row5_col0, #T_b97e6_row5_col1, #T_b97e6_row6_col0, #T_b97e6_row6_col1, #T_b97e6_row7_col0, #T_b97e6_row7_col1, #T_b97e6_row8_col0, #T_b97e6_row8_col1, #T_b97e6_row9_col0, #T_b97e6_row9_col1 {\n", " text-align: left;\n", "}\n", "</style>\n", "<table id=\"T_b97e6\" class=\"dataframe\">\n", " <caption>Comparison of Scored and Semantic Reranked Results - Query: 'sparse vector embedding'</caption>\n", " <thead>\n", " <tr>\n", " <th class=\"blank level0\" > </th>\n", " <th id=\"T_b97e6_level0_col0\" class=\"col_heading level0 col0\" >Scored Results</th>\n", " <th id=\"T_b97e6_level0_col1\" class=\"col_heading level0 col1\" >Reranked Results</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n", " <td id=\"T_b97e6_row0_col0\" class=\"data row0 col0\" >Compact Speaker Embedding: lrx-vector</td>\n", " <td id=\"T_b97e6_row0_col1\" class=\"data row0 col1\" >Scaling Up Sparse Support Vector Machines by Simultaneous Feature and\n", " Sample Reduction</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n", " <td id=\"T_b97e6_row1_col0\" class=\"data row1 col0\" >Quantum Sparse Support Vector Machines</td>\n", " <td id=\"T_b97e6_row1_col1\" class=\"data row1 col1\" >Spaceland Embedding of Sparse Stochastic Graphs</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n", " <td id=\"T_b97e6_row2_col0\" class=\"data row2 col0\" >Sparse Support Vector Infinite Push</td>\n", " <td id=\"T_b97e6_row2_col1\" class=\"data row2 col1\" >Elliptical Ordinal Embedding</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n", " <td id=\"T_b97e6_row3_col0\" class=\"data row3 col0\" >The Sparse Vector Technique, Revisited</td>\n", " <td id=\"T_b97e6_row3_col1\" class=\"data row3 col1\" >Minimum-Distortion Embedding</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row4\" class=\"row_heading level0 row4\" >4</th>\n", " <td id=\"T_b97e6_row4_col0\" class=\"data row4 col0\" >L-Vector: Neural Label Embedding for Domain Adaptation</td>\n", " <td id=\"T_b97e6_row4_col1\" class=\"data row4 col1\" >Free Gap Information from the Differentially Private Sparse Vector and\n", " Noisy Max Mechanisms</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row5\" class=\"row_heading level0 row5\" >5</th>\n", " <td id=\"T_b97e6_row5_col0\" class=\"data row5 col0\" >Spaceland Embedding of Sparse Stochastic Graphs</td>\n", " <td id=\"T_b97e6_row5_col1\" class=\"data row5 col1\" >Interpolated Discretized Embedding of Single Vectors and Vector Pairs\n", " for Classification, Metric Learning and Distance Approximation</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row6\" class=\"row_heading level0 row6\" >6</th>\n", " <td id=\"T_b97e6_row6_col0\" class=\"data row6 col0\" >Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", " Correlation</td>\n", " <td id=\"T_b97e6_row6_col1\" class=\"data row6 col1\" >Attention Word Embedding</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row7\" class=\"row_heading level0 row7\" >7</th>\n", " <td id=\"T_b97e6_row7_col0\" class=\"data row7 col0\" >Stable Sparse Subspace Embedding for Dimensionality Reduction</td>\n", " <td id=\"T_b97e6_row7_col1\" class=\"data row7 col1\" >Binary Speaker Embedding</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row8\" class=\"row_heading level0 row8\" >8</th>\n", " <td id=\"T_b97e6_row8_col0\" class=\"data row8 col0\" >Auto-weighted Mutli-view Sparse Reconstructive Embedding</td>\n", " <td id=\"T_b97e6_row8_col1\" class=\"data row8 col1\" >NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization</td>\n", " </tr>\n", " <tr>\n", " <th id=\"T_b97e6_level0_row9\" class=\"row_heading level0 row9\" >9</th>\n", " <td id=\"T_b97e6_row9_col0\" class=\"data row9 col0\" >Embedding Words in Non-Vector Space with Unsupervised Graph Learning</td>\n", " <td id=\"T_b97e6_row9_col1\" class=\"data row9 col1\" >Estimating Vector Fields on Manifolds and the Embedding of Directed\n", " Graphs</td>\n", " </tr>\n", " </tbody>\n", "</table>\n" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "markdown", "source": [ "## Print out Title and Abstract\n", "This will print the title and the abstract for the top 10 results after semantic reranking." ], "metadata": { "id": "A0HyNZoWyeun" } }, { "cell_type": "code", "source": [ "for paper in response_reranked.body[\"hits\"][\"hits\"]:\n", " print(\n", " f\"Title {paper['fields']['title'][0]} \\n Abstract: {paper['fields']['abstract'][0]}\"\n", " )" ], "metadata": { "id": "4ZEx-46rn3in", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "e83318ad-ca42-4aa7-98d4-37c4428eb70a" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Title Compact Speaker Embedding: lrx-vector \n", " Abstract: Deep neural networks (DNN) have recently been widely used in speaker\n", "recognition systems, achieving state-of-the-art performance on various\n", "benchmarks. The x-vector architecture is especially popular in this research\n", "community, due to its excellent performance and manageable computational\n", "complexity. In this paper, we present the lrx-vector system, which is the\n", "low-rank factorized version of the x-vector embedding network. The primary\n", "objective of this topology is to further reduce the memory requirement of the\n", "speaker recognition system. We discuss the deployment of knowledge distillation\n", "for training the lrx-vector system and compare against low-rank factorization\n", "with SVD. On the VOiCES 2019 far-field corpus we were able to reduce the\n", "weights by 28% compared to the full-rank x-vector system while keeping the\n", "recognition rate constant (1.83% EER).\n", "\n", "Title Quantum Sparse Support Vector Machines \n", " Abstract: We analyze the computational complexity of Quantum Sparse Support Vector\n", "Machine, a linear classifier that minimizes the hinge loss and the $L_1$ norm\n", "of the feature weights vector and relies on a quantum linear programming solver\n", "instead of a classical solver. Sparse SVM leads to sparse models that use only\n", "a small fraction of the input features in making decisions, and is especially\n", "useful when the total number of features, $p$, approaches or exceeds the number\n", "of training samples, $m$. We prove a $\\Omega(m)$ worst-case lower bound for\n", "computational complexity of any quantum training algorithm relying on black-box\n", "access to training samples; quantum sparse SVM has at least linear worst-case\n", "complexity. However, we prove that there are realistic scenarios in which a\n", "sparse linear classifier is expected to have high accuracy, and can be trained\n", "in sublinear time in terms of both the number of training samples and the\n", "number of features.\n", "\n", "Title Sparse Support Vector Infinite Push \n", " Abstract: In this paper, we address the problem of embedded feature selection for\n", "ranking on top of the list problems. We pose this problem as a regularized\n", "empirical risk minimization with $p$-norm push loss function ($p=\\infty$) and\n", "sparsity inducing regularizers. We leverage the issues related to this\n", "challenging optimization problem by considering an alternating direction method\n", "of multipliers algorithm which is built upon proximal operators of the loss\n", "function and the regularizer. Our main technical contribution is thus to\n", "provide a numerical scheme for computing the infinite push loss function\n", "proximal operator. Experimental results on toy, DNA microarray and BCI problems\n", "show how our novel algorithm compares favorably to competitors for ranking on\n", "top while using fewer variables in the scoring function.\n", "\n", "Title The Sparse Vector Technique, Revisited \n", " Abstract: We revisit one of the most basic and widely applicable techniques in the\n", "literature of differential privacy - the sparse vector technique [Dwork et al.,\n", "STOC 2009]. This simple algorithm privately tests whether the value of a given\n", "query on a database is close to what we expect it to be. It allows to ask an\n", "unbounded number of queries as long as the answer is close to what we expect,\n", "and halts following the first query for which this is not the case.\n", " We suggest an alternative, equally simple, algorithm that can continue\n", "testing queries as long as any single individual does not contribute to the\n", "answer of too many queries whose answer deviates substantially form what we\n", "expect. Our analysis is subtle and some of its ingredients may be more widely\n", "applicable. In some cases our new algorithm allows to privately extract much\n", "more information from the database than the original.\n", " We demonstrate this by applying our algorithm to the shifting heavy-hitters\n", "problem: On every time step, each of $n$ users gets a new input, and the task\n", "is to privately identify all the current heavy-hitters. That is, on time step\n", "$i$, the goal is to identify all data elements $x$ such that many of the users\n", "have $x$ as their current input. We present an algorithm for this problem with\n", "improved error guarantees over what can be obtained using existing techniques.\n", "Specifically, the error of our algorithm depends on the maximal number of times\n", "that a single user holds a heavy-hitter as input, rather than the total number\n", "of times in which a heavy-hitter exists.\n", "\n", "Title L-Vector: Neural Label Embedding for Domain Adaptation \n", " Abstract: We propose a novel neural label embedding (NLE) scheme for the domain\n", "adaptation of a deep neural network (DNN) acoustic model with unpaired data\n", "samples from source and target domains. With NLE method, we distill the\n", "knowledge from a powerful source-domain DNN into a dictionary of label\n", "embeddings, or l-vectors, one for each senone class. Each l-vector is a\n", "representation of the senone-specific output distributions of the source-domain\n", "DNN and is learned to minimize the average L2, Kullback-Leibler (KL) or\n", "symmetric KL distance to the output vectors with the same label through simple\n", "averaging or standard back-propagation. During adaptation, the l-vectors serve\n", "as the soft targets to train the target-domain model with cross-entropy loss.\n", "Without parallel data constraint as in the teacher-student learning, NLE is\n", "specially suited for the situation where the paired target-domain data cannot\n", "be simulated from the source-domain data. We adapt a 6400 hours\n", "multi-conditional US English acoustic model to each of the 9 accented English\n", "(80 to 830 hours) and kids' speech (80 hours). NLE achieves up to 14.1%\n", "relative word error rate reduction over direct re-training with one-hot labels.\n", "\n", "Title Spaceland Embedding of Sparse Stochastic Graphs \n", " Abstract: We introduce a nonlinear method for directly embedding large, sparse,\n", "stochastic graphs into low-dimensional spaces, without requiring vertex\n", "features to reside in, or be transformed into, a metric space. Graph data and\n", "models are prevalent in real-world applications. Direct graph embedding is\n", "fundamental to many graph analysis tasks, in addition to graph visualization.\n", "We name the novel approach SG-t-SNE, as it is inspired by and builds upon the\n", "core principle of t-SNE, a widely used method for nonlinear dimensionality\n", "reduction and data visualization. We also introduce t-SNE-$\\Pi$, a\n", "high-performance software for 2D, 3D embedding of large sparse graphs on\n", "personal computers with superior efficiency. It empowers SG-t-SNE with modern\n", "computing techniques for exploiting in tandem both matrix structures and memory\n", "architectures. We present elucidating embedding results on one synthetic graph\n", "and four real-world networks.\n", "\n", "Title Sparse Signal Recovery in the Presence of Intra-Vector and Inter-Vector\n", " Correlation \n", " Abstract: This work discusses the problem of sparse signal recovery when there is\n", "correlation among the values of non-zero entries. We examine intra-vector\n", "correlation in the context of the block sparse model and inter-vector\n", "correlation in the context of the multiple measurement vector model, as well as\n", "their combination. Algorithms based on the sparse Bayesian learning are\n", "presented and the benefits of incorporating correlation at the algorithm level\n", "are discussed. The impact of correlation on the limits of support recovery is\n", "also discussed highlighting the different impact intra-vector and inter-vector\n", "correlations have on such limits.\n", "\n", "Title Stable Sparse Subspace Embedding for Dimensionality Reduction \n", " Abstract: Sparse random projection (RP) is a popular tool for dimensionality reduction\n", "that shows promising performance with low computational complexity. However, in\n", "the existing sparse RP matrices, the positions of non-zero entries are usually\n", "randomly selected. Although they adopt uniform sampling with replacement, due\n", "to large sampling variance, the number of non-zeros is uneven among rows of the\n", "projection matrix which is generated in one trial, and more data information\n", "may be lost after dimension reduction. To break this bottleneck, based on\n", "random sampling without replacement in statistics, this paper builds a stable\n", "sparse subspace embedded matrix (S-SSE), in which non-zeros are uniformly\n", "distributed. It is proved that the S-SSE is stabler than the existing matrix,\n", "and it can maintain Euclidean distance between points well after dimension\n", "reduction. Our empirical studies corroborate our theoretical findings and\n", "demonstrate that our approach can indeed achieve satisfactory performance.\n", "\n", "Title Auto-weighted Mutli-view Sparse Reconstructive Embedding \n", " Abstract: With the development of multimedia era, multi-view data is generated in\n", "various fields. Contrast with those single-view data, multi-view data brings\n", "more useful information and should be carefully excavated. Therefore, it is\n", "essential to fully exploit the complementary information embedded in multiple\n", "views to enhance the performances of many tasks. Especially for those\n", "high-dimensional data, how to develop a multi-view dimension reduction\n", "algorithm to obtain the low-dimensional representations is of vital importance\n", "but chanllenging. In this paper, we propose a novel multi-view dimensional\n", "reduction algorithm named Auto-weighted Mutli-view Sparse Reconstructive\n", "Embedding (AMSRE) to deal with this problem. AMSRE fully exploits the sparse\n", "reconstructive correlations between features from multiple views. Furthermore,\n", "it is equipped with an auto-weighted technique to treat multiple views\n", "discriminatively according to their contributions. Various experiments have\n", "verified the excellent performances of the proposed AMSRE.\n", "\n", "Title Embedding Words in Non-Vector Space with Unsupervised Graph Learning \n", " Abstract: It has become a de-facto standard to represent words as elements of a vector\n", "space (word2vec, GloVe). While this approach is convenient, it is unnatural for\n", "language: words form a graph with a latent hierarchical structure, and this\n", "structure has to be revealed and encoded by word embeddings. We introduce\n", "GraphGlove: unsupervised graph word representations which are learned\n", "end-to-end. In our setting, each word is a node in a weighted graph and the\n", "distance between words is the shortest path distance between the corresponding\n", "nodes. We adopt a recent method learning a representation of data in the form\n", "of a differentiable weighted graph and use it to modify the GloVe training\n", "algorithm. We show that our graph-based representations substantially\n", "outperform vector-based methods on word similarity and analogy tasks. Our\n", "analysis reveals that the structure of the learned graphs is hierarchical and\n", "similar to that of WordNet, the geometry is highly non-trivial and contains\n", "subgraphs with different local topology.\n", "\n" ] } ] } ] }

supporting-blog-content/re-ranking-elasticsearch-hosted/re-ranking-elasticsearch-hosted.ipynb (821 lines of code) (raw):