notebooks/suggest_intent_data

{ "cells": [ { "cell_type": "markdown", "id": "771600b9-1422-4717-b119-8c45cafb6725", "metadata": {}, "source": [ "Purpose of the notebook:\n", "\n", "Evaluate the current NER approach. This approach uses existing models supported by Transformers.js library.\n", "We see where it fails. \n", "With the hypothesis classifier based approach might be better for we prepare and label the data\n", "(https://www.microsoft.com/en-us/download/details.aspx?id=58227)" ] }, { "cell_type": "code", "execution_count": null, "id": "6237221a-0d4c-43ec-b18d-95f99807d653", "metadata": {}, "outputs": [], "source": [ "## imports\n", "\n", "from transformers import pipeline\n", "import pandas as pd\n", "from tqdm import tqdm\n", "from pprint import pprint\n", "import random" ] }, { "cell_type": "markdown", "id": "bd5974fe-47f3-41fd-b496-a25af8668a15", "metadata": {}, "source": [ "#### Some examples of where the NER based approach is failing" ] }, { "cell_type": "code", "execution_count": null, "id": "145a1821-f31a-48f9-9222-d2a7ab1270bf", "metadata": {}, "outputs": [], "source": [ "classifier = pipeline(\"zero-shot-classification\", model='typeform/mobilebert-uncased-mnli', device='cpu')\n", "\n", "\n", "texts = [\n", " \"what is democracy\",\n", " \"restaurants in oakville\",\n", " \"buy iphone\",\n", " \"bank login\",\n", " \"temperature in San Jose\",\n", " \"wood floor buckling repair\",\n", " \"wood floor cost estimator\",\n", " \"panera bread menu price\",\n", " \"how much is hbo now subscription\",\n", " \"how much is a golden retriever puppy\",\n", " \"how much is nebraska's sales tax\",\n", " \"how much is donald trump jr worth\",\n", " \"how much is a liposuction\",\n", " \"does mushroom cause food allergy\",\n", "]\n", "\n", "\n", "\n", "intent_labels_lkp = {\n", " \"yelp_intent\": \"search for local service, food, home repair, maintenance, cost estimation excluding weather intents\",\n", " # \"yelp_intent\": \"to discover, connect and transact with local businesses\",\n", " \"information_intent\": \"search for general knowledge what some concept is and not related to weather, services, or products\",\n", " \"weather_intent\": \"check weather conditions like forecast, temperature, radar, storms, or pollen\",\n", " \"purchase_intent\": \"make an online purchase\",\n", " \"navigation_intent\": \"navigate to a specific website\"\n", "}\n", "\n", "intent_desc_lkp = {intent_desc: intent_key for intent_key, intent_desc in intent_labels_lkp.items()}\n", "\n", "# Refined intent labels\n", "intent_labels = [\n", " intent_labels_lkp[\"yelp_intent\"],\n", " intent_labels_lkp[\"information_intent\"],\n", " intent_labels_lkp[\"weather_intent\"],\n", " intent_labels_lkp[\"purchase_intent\"],\n", " intent_labels_lkp[\"navigation_intent\"],\n", "]\n", "\n", "result = classifier(texts, candidate_labels=intent_labels)\n", "# pprint(result)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6dd4e9fb-7ea9-48fc-8c81-c17b7efbd094", "metadata": {}, "outputs": [], "source": [ "# result_df = pd.DataFrame(result)\n", "def prepare_df_from_reesult(result):\n", " updated_result = []\n", " for idx, res in enumerate(result):\n", " labels_and_scores = {'sequence': res['sequence']}\n", " for label, score in zip(res['labels'], res['scores']):\n", " labels_and_scores[intent_desc_lkp[label]] = score\n", " updated_result.append(labels_and_scores)\n", " \n", " return pd.DataFrame(updated_result)\n", "\n", "updated_result_df = prepare_df_from_reesult(result)" ] }, { "cell_type": "code", "execution_count": null, "id": "70d1a138-6fe9-4805-955c-0be03395e726", "metadata": {}, "outputs": [], "source": [ "updated_result_df" ] }, { "cell_type": "markdown", "id": "969dc690-2cf1-421c-9ee4-fd6f6cb06367", "metadata": {}, "source": [ "Some of the above results are bit unclear. `does the mushroom cause food allergy` is more of a information intent than a yelp intent.\n", "There were many other cases which showed that NER alone may not be suitable for this problem. We need to solve the intent classification problem in this use case" ] }, { "cell_type": "markdown", "id": "6ac113fa-b54b-4722-a9b8-a20ef10580a5", "metadata": {}, "source": [ "#### Marco data\n", "\n", "This dataset can be downloaded from https://www.microsoft.com/en-us/download/details.aspx?id=58227" ] }, { "cell_type": "code", "execution_count": null, "id": "99ebeafb-40f7-429c-8714-653f0e930c0e", "metadata": {}, "outputs": [], "source": [ "marco_text_queries = set()\n", "with open(\"../data/full_marco_sessions_ann_split.train.tsv\", \"r\") as f:\n", " marco_texts = f.read().split('\\n')\n", " for text in marco_texts:\n", " for query in text.split(\"\\t\"):\n", " if \"marco-gen-train\" not in query and len(query) >= 3:\n", " marco_text_queries.add(query.lower())\n", "\n", "marco_text_queries_list = list(marco_text_queries)" ] }, { "cell_type": "code", "execution_count": null, "id": "832b930e-92c3-4d2b-a4c2-365871082ac1", "metadata": {}, "outputs": [], "source": [ "len(marco_text_queries_list)" ] }, { "cell_type": "code", "execution_count": null, "id": "4db7dc89-d8c2-4c52-b95a-e0f44298b955", "metadata": {}, "outputs": [], "source": [ "## some example queries\n", "\n", "marco_text_queries_list[:50]" ] }, { "cell_type": "code", "execution_count": null, "id": "67a278b9-c847-4297-8fe5-730ca56d1b1a", "metadata": {}, "outputs": [], "source": [ "marco_df = pd.DataFrame({\"sequence\": marco_text_queries_list})" ] }, { "cell_type": "code", "execution_count": null, "id": "b6d7d8e6-5777-43d2-9594-aa40d51a9dd6", "metadata": {}, "outputs": [], "source": [ "def labeling_stats(df):\n", " if 'target' not in marco_df.columns:\n", " df['target'] = None\n", " print(f\"Size of the dataset = {len(df)}\")\n", " print(f\"Number of examples to be labeled = {df['target'].isna().sum()}\")\n", " print(f\"Number of examples labeled = {(~df['target'].isna()).sum()}\")\n", " print(\"Labels distributed as \\n\", df['target'].value_counts())\n", "\n", "\n", "## Prints labeling stats\n", "labeling_stats(marco_df)" ] }, { "cell_type": "markdown", "id": "3e8a6921-c611-4216-8907-806c8f36211f", "metadata": {}, "source": [ "#### Find potential ngram mappings for targets" ] }, { "cell_type": "code", "execution_count": null, "id": "bb75e0fc-49e7-47e6-9dae-48179bf668a1", "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from itertools import islice\n", "\n", "# Generalize function to extract n-grams\n", "def extract_ngrams(query, n):\n", " words = query.split()\n", " ngrams = zip(*[islice(words, i, None) for i in range(n)]) # Generate n-grams\n", " return [' '.join(ngram) for ngram in ngrams] # Join n-grams into a single string\n", "\n", "# Flatten the n-grams into a list and count them\n", "def count_ngrams(queries_list, n):\n", " all_ngrams = [ngram for query in queries_list for ngram in extract_ngrams(query, n)]\n", " ngram_counter = Counter(all_ngrams)\n", " return ngram_counter\n" ] }, { "cell_type": "code", "execution_count": null, "id": "79654bef-5fbd-4ece-a5dd-3aff12830b46", "metadata": {}, "outputs": [], "source": [ "def search_queries_by_words(search_text, to_be_labelled_sequence_list):\n", " for query in to_be_labelled_sequence_list:\n", " if search_text in query:\n", " yield query" ] }, { "cell_type": "code", "execution_count": null, "id": "6052b55a-3051-4ea8-b625-bf1f4a2a0f1b", "metadata": {}, "outputs": [], "source": [ "cnt = 0\n", "for query in search_queries_by_words(\"24 hour\", marco_text_queries_list):\n", " if cnt >= 100: # Stop after 20 results\n", " break\n", " print(cnt + 1, query)\n", " cnt += 1" ] }, { "cell_type": "code", "execution_count": null, "id": "5dfda4f8-0327-4005-84b8-54b85d8a2f37", "metadata": {}, "outputs": [], "source": [ "\n", "target_mapping = {\n", " 'how do': 'information_intent',\n", " 'weather in': 'weather_intent',\n", " 'the weather': 'weather_intent',\n", " 'hurricane': 'information_intent',\n", " # 'tornado': 'weather_intent',\n", " 'current temperature': 'weather_intent',\n", " 'current weather': 'weather_intent',\n", " 'weather forecast in': 'weather_intent',\n", " 'temperature in': 'weather_intent',\n", " # 'how much': 'purchase_intent', \n", " # 'cost to': 'purchase_intent',\n", " # 'where is': 'navigation_intent', \n", " 'sign in ': 'navigation_intent',\n", " 'signin ': 'navigation_intent',\n", " 'login ': 'navigation_intent',\n", " 'phone number': 'yelp_intent', \n", " 'customer service': 'yelp_intent',\n", " 'what are': 'information_intent',\n", " 'what county is': 'information_intent',\n", " 'what is a ': 'information_intent',\n", " # 'what is': 'information_intent',\n", " 'what does': 'information_intent',\n", " 'what do': 'information_intent',\n", " 'definition of': 'information_intent',\n", " 'meaning': 'information_intent',\n", " 'symptoms': 'yelp_intent',\n", " 'zip code': 'information_intent',\n", " 'zipcode': 'information_intent',\n", " 'postal code': 'information_intent',\n", " 'postalcode': 'information_intent',\n", " 'area code': 'information_intent',\n", " 'areacode': 'information_intent',\n", " 'definition': 'information_intent',\n", " 'define': 'information_intent',\n", " 'what is the difference between': 'information_intent',\n", " 'what is the purpose of': 'information_intent',\n", " 'what is the function of': 'information_intent',\n", " 'how long does it take': 'information_intent',\n", " 'what is the name of': 'information_intent',\n", " 'what is the population of': 'information_intent',\n", " 'what is an example of': 'information_intent',\n", " 'which of the following': 'information_intent',\n", " 'what is the purpose': 'information_intent',\n", " # 'what time zone is': 'information_intent',\n", " 'what is the average': 'information_intent',\n", " 'is in what county': 'information_intent',\n", " 'calories in': 'information_intent',\n", " # 'how many calories in': 'information_intent',\n", " \"causes of\": 'information_intent',\n", " 'visit': 'travel_intent',\n", " 'travel to': 'travel_intent',\n", " 'cruise': 'travel_intent',\n", " 'tours': 'travel_intent',\n", " 'mortgage rate': 'yelp_intent',\n", " 'interest rate': 'yelp_intent',\n", " 'price of': 'purchase_intent',\n", " 'cost of living': 'information_intent',\n", " 'does it cost': 'yelp_intent', \n", " # 'what is the current': ?\n", " 'what is the largest': 'information_intent',\n", " 'what is the currency': 'information_intent',\n", " 'how old do you': 'information_intent',\n", " 'how long does a': 'information_intent',\n", " # 'what time is it': 'information_intent',\n", " 'what time': 'information_intent',\n", " 'you have to be': 'information_intent',\n", " 'do you need to': 'information_intent',\n", " 'what is considered a': 'information_intent',\n", " 'dialing code': 'information_intent',\n", " 'side effects': 'information_intent',\n", " 'stock market': 'information_intent',\n", " 'how many calories': 'information_intent',\n", " 'average salary for': 'information_intent',\n", " 'how many grams': 'information_intent',\n", " 'what foods are': 'information_intent',\n", " 'how many ounces': 'information_intent',\n", " 'how many carbs': 'information_intent',\n", " 'what year was': 'information_intent',\n", " 'how old is': 'information_intent',\n", " 'how much is': 'information_intent',\n", " 'what type of': 'information_intent',\n", " 'how do i': 'information_intent',\n", " 'what kind of': 'information_intent',\n", " 'who is the': 'information_intent',\n", " 'where is the': 'information_intent',\n", " # 'different types of': 'information_intent',\n", " 'types': 'information_intent',\n", " 'what is': 'information_intent',\n", " 'how do you': 'information_intent',\n", " 'what was the': 'information_intent',\n", " 'in the world': 'information_intent',\n", " 'how long is': 'information_intent',\n", " 'when was': 'information_intent',\n", " 'when did': 'information_intent',\n", " 'how far is': 'information_intent',\n", " 'how tall is': 'information_intent',\n", " 'what to do': 'information_intent',\n", " 'how long': 'information_intent',\n", " 'types of': 'information_intent',\n", " 'who is': 'information_intent',\n", " 'where is': 'information_intent',\n", " 'what causes': 'information_intent',\n", " 'stock price': 'information_intent',\n", " 'difference between': 'information_intent',\n", " 'social security': 'information_intent',\n", " 'who was': 'information_intent',\n", " 'net worth': 'information_intent',\n", " 'cast of': 'information_intent',\n", " 'how many': 'information_intent',\n", " 'how does': 'information_intent',\n", " 'how is': 'information_intent',\n", " 'what did': 'information_intent',\n", " 'good for': 'information_intent',\n", " 'population of': 'information_intent',\n", " 'can you': 'information_intent',\n", " 'what can': 'information_intent',\n", " 'how big': 'information_intent',\n", " 'what size': 'information_intent',\n", " 'average salary of': 'information_intent',\n", " 'what year': 'information_intent',\n", " 'part of': 'information_intent',\n", " 'another word': 'information_intent',\n", " 'who invented': 'information_intent',\n", " 'what can you': 'information_intent',\n", " 'how much money': 'information_intent',\n", " 'what size': 'information_intent',\n", " 'what state': 'information_intent',\n", " 'what county': 'information_intent',\n", " 'in the us': 'information_intent',\n", " 'how old': 'information_intent',\n", " 'icd code': 'information_intent',\n", " 'what city': 'information_intent',\n", " 'can you': 'information_intent',\n", " 'can i': 'information_intent',\n", " 'when is': 'information_intent',\n", " 'how did': 'information_intent',\n", " 'what can': 'information_intent',\n", " 'what to': 'information_intent',\n", " 'the same': 'information_intent',\n", " \"cleaning \": 'yelp_intent',\n", " 'restaurant': 'yelp_intent',\n", " 'recommendation': 'yelp_intent',\n", " 'repair': 'yelp_intent',\n", " 'parking': 'yelp_intent',\n", " 'oil change': 'yelp_intent',\n", " ' rental': 'yelp_intent',\n", " 'auto ': 'yelp_intent',\n", " 'dry clean': 'yelp_intent',\n", " 'club': 'yelp_intent',\n", " 'hotel': 'yelp_intent',\n", " 'stores': 'yelp_intent',\n", " 'shopping': 'yelp_intent',\n", " ' shop ': 'yelp_intent',\n", " ' shops ': 'yelp_intent',\n", " ' mall ': 'yelp_intent',\n", " 'furniture': 'yelp_intent',\n", " 'crafts': 'yelp_intent',\n", " 'clothing': 'yelp_intent',\n", " 'benefits of': 'yelp_intent',\n", " 'average cost': 'yelp_intent',\n", " 'cost to install': 'yelp_intent',\n", " 'contact number': 'yelp_intent',\n", " 'what airport': 'travel_intent',\n", " # 'flight': 'travel_intent',\n", " 'cost for': 'yelp_intent',\n", " 'do you': 'information_intent',\n", " 'when does': 'information_intent',\n", " 'do you': 'information_intent',\n", " 'why is': 'information_intent',\n", " \"what's the\": 'information_intent',\n", " 'what was': 'information_intent',\n", " 'what language': 'information_intent',\n", " 'should i': 'information_intent',\n", " 'convert': 'information_intent',\n", " 'medication': 'yelp_intent',\n", " 'treatment': 'yelp_intent',\n", " 'tv show': 'information_intent',\n", " 'history': 'information_intent',\n", " 'remedies': 'information_intent',\n", " 'county is': 'information_intent',\n", " 'synonym ': 'information_intent',\n", " 'credit union number': 'yelp_intent',\n", " 'credit union phone number': 'yelp_intent',\n", " 'credit union hours': 'yelp_intent',\n", " 'movie cast': 'information_intent',\n", " 'average salary': 'information_intent',\n", " 'example': 'information_intent',\n", " 'blood pressure': 'information_intent',\n", " 'credit card': 'navigation_intent',\n", " 'time zone': 'information_intent',\n", " 'time in': 'information_intent',\n", " 'foods that': 'information_intent',\n", " 'salary for': 'information_intent',\n", " \"weather\": 'weather_intent',\n", " \"weather forecast\": 'weather_intent',\n", " \"windy\": 'weather_intent',\n", " \"humidity\": 'weather_intent',\n", " \"monsoon\": 'weather_intent',\n", " \"flooding\": 'weather_intent',\n", " \"rain in\": 'weather_intent',\n", " \"storms\": 'weather_intent',\n", " \"storm in\": 'weather_intent',\n", " \"forcast\": 'weather_intent',\n", " \"wether\": 'weather_intent',\n", " \"wather\": 'weather_intent',\n", " \"weahter\": 'weather_intent',\n", " \"weater\": 'weather_intent',\n", " \"weaher\": 'weather_intent',\n", " \" vindy \": 'weather_intent',\n", " \" sunny \": 'weather_intent',\n", " \" rain \": 'weather_intent',\n", " \"windy\": 'weather_intent',\n", " \"cloudy\": 'weather_intent',\n", " \"storms\": 'weather_intent',\n", " \"air quality\": 'weather_intent',\n", " \"thunderstorm\": 'weather_intent',\n", " \"pollen\": 'weather_intent',\n", " \"snow\": 'weather_intent',\n", " \"blizzard\": 'weather_intent',\n", " \"radar\": 'weather_intent',\n", " \"tiempo\": 'weather_intent',\n", " \"clima\": 'weather_intent',\n", " \"doppler radar\": 'weather_intent',\n", " \"local radar\": 'weather_intent',\n", " \"local weather\": 'weather_intent',\n", " # \"map\": 'weather_intent',\n", " \"us weather radar\": 'weather_intent',\n", " \"weather radar near me\": 'weather_intent',\n", " \"radar near me\": 'weather_intent',\n", " 'salary': 'information_intent',\n", " 'cost to build': 'yelp_intent',\n", " 'icd ': 'information_intent',\n", " 'how often': 'information_intent',\n", " 'get rid of': 'information_intent',\n", " 'university of': 'navigation_intent',\n", " 'windows 10': 'navigation_intent',\n", " 'causes for': 'information_intent',\n", " 'calculat': 'information_intent',\n", " 'which is ': 'information_intent',\n", " 'where are ': 'information_intent',\n", " 'kelvin': 'information_intent',\n", " 'celsius': 'information_intent',\n", " 'fahrenheit': 'information_intent',\n", " 'when ': 'information_intent',\n", " 'benefit of': 'yelp_intent',\n", " 'most common': 'information_intent',\n", " 'which ': 'information_intent',\n", " 'refers ': 'information_intent',\n", " 'where does ': 'information_intent',\n", " 'synonym': 'information_intent', \n", " 'salaries': 'information_intent', \n", " 'function of': 'information_intent', \n", " 'cause of': 'information_intent', \n", " 'effects of': 'information_intent', \n", " 'used for': 'information_intent', \n", " 'what color is': 'information_intent', \n", " 'weight loss': 'yelp_intent', \n", " 'where do': 'information_intent', \n", " 'what foods': 'information_intent', \n", " 'used for': 'information_intent', \n", " 'why': 'information_intent', \n", " 'age of': 'information_intent', \n", " 'who wrote': 'information_intent', \n", " 'function of': 'information_intent', \n", " \"what's a\": 'information_intent', \n", " \"how fast\": 'information_intent', \n", " 'most popular': 'information_intent', \n", " 'where': 'information_intent', \n", " 'is used': 'information_intent', \n", " 'doctors': 'yelp_intent', \n", " 'who ': 'information_intent', \n", " ' hours': 'yelp_intent',\n", " 'schedule': 'information_intent', \n", " 'what age': 'information_intent',\n", " 'cheap': 'yelp_intent',\n", " 'most expensive': 'information_intent',\n", " 'size of': 'information_intent',\n", " 'what exactly': 'information_intent',\n", " 'ways to ': 'information_intent',\n", " 'disorder': 'information_intent',\n", " 'disease': 'information_intent',\n", " 'felony': 'information_intent',\n", " 'movie': 'information_intent',\n", " 'cost of': 'yelp_intent',\n", " 'what were': 'information_intent',\n", " 'degree': 'information_intent',\n", " 'what day': 'information_intent',\n", " 'ways to': 'information_intent',\n", " 'influen': 'information_intent',\n", " 'importan': 'information_intent',\n", " 'school': 'information_intent',\n", " 'train': 'information_intent',\n", " 'dimension': 'information_intent',\n", " 'what makes': 'information_intent',\n", " 'what were': 'information_intent',\n", " 'what food': 'information_intent',\n", " 'normal range': 'information_intent',\n", " 'ways to': 'information_intent',\n", " 'requirements for': 'information_intent',\n", " 'employment': 'information_intent',\n", " 'support number': 'navigation_intent',\n", " 'fax number': 'navigation_intent',\n", " 'considered a': 'information_intent',\n", " 'distance ': 'information_intent',\n", " 'share price': 'information_intent',\n", " 'stock': 'information_intent',\n", " 'channel is': 'information_intent',\n", " 'continent': 'information_intent',\n", " 'what level': 'information_intent',\n", " 'english to': 'translation_intent',\n", " 'to english': 'translation_intent',\n", " 'translat': 'translation_intent',\n", " 'what currency': 'information_intent',\n", " 'blood test': 'information_intent',\n", " 'replacement cost': 'yelp_intent',\n", " 'how tall': 'information_intent',\n", " 'characteristics of': 'information_intent',\n", " 'tracking number': 'navigation_intent',\n", " 'to replace': 'yelp_intent',\n", " 'pay for': 'information_intent',\n", " 'calories': 'information_intent',\n", " 'health': 'information_intent',\n", " 'tax': 'information_intent',\n", " 'deadline': 'information_intent',\n", " 'insurance': 'information_intent',\n", " 'cancel': 'navigation_intent',\n", " 'address': 'navigation_intent',\n", " 'healthy': 'yelp_intent',\n", " 'diet': 'information_intent',\n", " 'lyrics': 'information_intent',\n", " 'iphone': 'purchase_intent',\n", " 'cell phone': 'purchase_intent',\n", " 'android phone': 'purchase_intent',\n", " 'android': 'information_intent',\n", " 'protein': 'information_intent',\n", " 'how to': 'information_intent',\n", " '401k': 'information_intent',\n", " ' ira ': 'information_intent',\n", " 'population': 'information_intent',\n", " 'president': 'information_intent',\n", " 'whats': 'information_intent',\n", " \"what's\": 'information_intent',\n", " 'benefits': 'information_intent',\n", " ' pain ': 'yelp_intent',\n", " 'installation cost': 'yelp_intent',\n", " 'in spanish': 'translation_intent',\n", " 'to spanish': 'translation_intent',\n", " 'in french': 'translation_intent',\n", " 'to french': 'translation_intent',\n", " 'in japanese': 'translation_intent',\n", " 'to japanese': 'translation_intent',\n", " 'in chinese': 'translation_intent',\n", " 'to chinese': 'translation_intent',\n", " 'side effect': 'information_intent',\n", " 'cost to': 'yelp_intent',\n", " 'cost per': 'information_intent',\n", " 'disney world': 'navigation_intent',\n", " 'surgery cost': 'yelp_intent',\n", " 'album': 'information_intent',\n", " 'genre': 'information_intent',\n", " 'much water': 'information_intent',\n", " 'job': 'navigation_intent',\n", " 'netflix': 'information_intent',\n", " 'nutrient': 'information_intent',\n", " 'amazon': 'navigation_intent',\n", " 'music': 'information_intent',\n", " 'caffeine': 'information_intent',\n", " 'adoption': 'yelp_intent',\n", " 'dogs': 'yelp_intent',\n", " 'cats': 'yelp_intent',\n", " 'countries': 'information_intent',\n", " 'number of': 'information_intent',\n", " 'related to': 'information_intent',\n", " 'foods with': 'information_intent',\n", " 'restaurant': 'yelp_intent',\n", " 'cusine': 'yelp_intent',\n", " 'italian': 'yelp_intent',\n", " 'mediterranean': 'yelp_intent',\n", " 'vietnamese': 'yelp_intent',\n", " 'recipe': 'yelp_intent',\n", " 'vegan': 'yelp_intent',\n", " ' vegeta': 'yelp_intent',\n", " ' meat': 'yelp_intent',\n", " ' spice': 'yelp_intent',\n", " ' beer': 'yelp_intent',\n", " ' wine': 'yelp_intent',\n", " ' fresh ': 'yelp_intent',\n", " 'fruit': 'yelp_intent',\n", " 'restaurant': 'yelp_intent',\n", " 'resort': 'travel_intent',\n", " 'attraction': 'travel_intent',\n", " 'installation': 'yelp_intent',\n", " 'service': 'yelp_intent',\n", " 'routing number': 'navigation_intent',\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "5ff887ab-e970-46d6-8a20-b2e070751e0d", "metadata": {}, "outputs": [], "source": [ "print(\"key\", \"#examples\")\n", "yelp_queries_set = set()\n", "for key,val in target_mapping.items():\n", " if val == 'yelp_intent':\n", " cnt = 0\n", " for query in search_queries_by_words(key, marco_text_queries_list):\n", " yelp_queries_set.add(query)\n", " cnt += 1\n", "\n", " print(key, cnt)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "224fff1c-f1ac-41dc-9907-9185e7f24030", "metadata": {}, "outputs": [], "source": [ "yelp_queries = list(yelp_queries_set)\n", "yelp_queries[:5]\n", "\n", "yelp_ngram_counter = count_ngrams(yelp_queries, 2)\n", "yelp_most_common_ngrams = yelp_ngram_counter.most_common(100)\n", "\n", "# Display the weather_most_common_ngrams\n", "print(yelp_most_common_ngrams)" ] }, { "cell_type": "code", "execution_count": null, "id": "a56a3dc7-5288-4b49-8fe1-54ba224efd20", "metadata": {}, "outputs": [], "source": [ "print(\"key\", \"#examples\")\n", "weather_queries_set = set()\n", "for key,val in target_mapping.items():\n", " if val == 'weather_intent':\n", " cnt = 0\n", " for query in search_queries_by_words(key, marco_text_queries_list):\n", " weather_queries_set.add(query)\n", " cnt += 1\n", "\n", " print(key, cnt)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "64c93b81-f34b-415c-bde2-0976f6331f89", "metadata": {}, "outputs": [], "source": [ "weather_queries = list(weather_queries_set)\n", "weather_queries[:5]\n", "\n", "weather_ngram_counter = count_ngrams(weather_queries, 2)\n", "weather_most_common_ngrams = weather_ngram_counter.most_common(100)\n", "\n", "# Display the weather_most_common_ngrams\n", "print(weather_most_common_ngrams)" ] }, { "cell_type": "code", "execution_count": null, "id": "2ecb5cff-106d-4fbd-981e-782d2ca69f8c", "metadata": {}, "outputs": [], "source": [ "weather_templates = [\n", " # Original Patterns\n", " (\"The weather in {}\", 0.539),\n", " (\"What is the weather in {}\", 0.499),\n", " (\"What's the weather in {}\", 0.046),\n", " (\"Weather forecast in {}\", 0.039),\n", " (\"What is the temperature in {}\", 0.033),\n", " (\"The weather forecast for {}\", 0.034),\n", " (\"Current weather in {}\", 0.023),\n", " (\"Average weather in {}\", 0.022),\n", " (\"What is the weather forecast for {}\", 0.014),\n", " (\"Weather in {} in {}\", 0.011),\n", " (\"How is the weather in {}\", 0.006),\n", " (\"What is the climate of {}\", 0.009),\n", " (\"Is the weather forecast for {}\", 0.005),\n", " (\"Rain in {}\", 0.002),\n", " (\"What is the weather like in {}\", 0.009),\n", " (\"What is the climate in {}\", 0.001),\n", " (\"The weather today in {}\", 0.001),\n", " (\"What's the weather forecast for {}\", 0.002),\n", " (\"What is the best weather in {}\", 0.001),\n", " (\"Is the weather today in {}\", 0.001),\n", " (\"Current temperature in {}\", 0.001),\n", " (\"Storms in {}\", 0.0007),\n", " (\"Humidity in {}\", 0.003),\n", " (\"Windy in {}\", 0.0005),\n", " (\"Snow in {}\", 0.009),\n", " (\"Weather radar in {}\", 0.005),\n", " (\"The temperature in {}\", 0.005),\n", " (\"Weather like in {}\", 0.006),\n", " (\"What's the temperature in {}\", 0.001),\n", " (\"Is the weather like in {}\", 0.006),\n", "\n", " # # Additional Patterns (10% of original weight)\n", " (\"weather {}\", 0.10 * 0.539),\n", " (\"{} weather\", 0.10 * 0.539),\n", " (\"temperature {}\", 0.10 * 0.033),\n", " (\"{} temperature\", 0.10 * 0.033),\n", "]\n", "\n", "# Expanding the typo variants further to include the common misspellings for \"weather\", \"temperature\", and \"forecast\"\n", "extended_typo_variants = [\n", " # Misspellings for \"weather\"\n", " (\"The weathr in {}\", 0.20 * 0.539),\n", " (\"What is the weathr in {}\", 0.20 * 0.499),\n", " (\"What's the weathr in {}\", 0.20 * 0.046),\n", " (\"Weathr forecast in {}\", 0.20 * 0.039),\n", " (\"What is the weathr like in {}\", 0.20 * 0.009),\n", " (\"The wether in {}\", 0.20 * 0.539),\n", " (\"What is the wether in {}\", 0.20 * 0.499),\n", " (\"What's the wether in {}\", 0.20 * 0.046),\n", " (\"Wether forecast in {}\", 0.20 * 0.039),\n", " (\"What is the wether like in {}\", 0.20 * 0.009),\n", " (\"The weater in {}\", 0.20 * 0.539),\n", " (\"What is the weater in {}\", 0.20 * 0.499),\n", " (\"What's the weater in {}\", 0.20 * 0.046),\n", " (\"Weater forecast in {}\", 0.20 * 0.039),\n", " (\"What is the weater like in {}\", 0.20 * 0.009),\n", " (\"The wather in {}\", 0.20 * 0.539),\n", " (\"What is the wather in {}\", 0.20 * 0.499),\n", " (\"What's the wather in {}\", 0.20 * 0.046),\n", " (\"Wather forecast in {}\", 0.20 * 0.039),\n", " (\"What is the wather like in {}\", 0.20 * 0.009),\n", " (\"The weahter in {}\", 0.20 * 0.539),\n", " (\"What is the weahter in {}\", 0.20 * 0.499),\n", " (\"What's the weahter in {}\", 0.20 * 0.046),\n", " (\"Weahter forecast in {}\", 0.20 * 0.039),\n", " (\"What is the weahter like in {}\", 0.20 * 0.009),\n", " (\"The weaher in {}\", 0.20 * 0.539),\n", " (\"What is the weaher in {}\", 0.20 * 0.499),\n", " (\"What's the weaher in {}\", 0.20 * 0.046),\n", " (\"Weaher forecast in {}\", 0.20 * 0.039),\n", " (\"What is the weaher like in {}\", 0.20 * 0.009),\n", " (\"The waether in {}\", 0.20 * 0.539),\n", " (\"What is the waether in {}\", 0.20 * 0.499),\n", " (\"What's the waether in {}\", 0.20 * 0.046),\n", " (\"Waether forecast in {}\", 0.20 * 0.039),\n", " (\"What is the waether like in {}\", 0.20 * 0.009),\n", "\n", " # Misspellings for \"temperature\"\n", " (\"What is the temprature in {}\", 0.20 * 0.033),\n", " (\"What is the temperture in {}\", 0.20 * 0.033),\n", " (\"What is the tempreture in {}\", 0.20 * 0.033),\n", " (\"What is the tempratuer in {}\", 0.20 * 0.033),\n", " (\"What is the tempratue in {}\", 0.20 * 0.033),\n", " (\"What is the tempertuer in {}\", 0.20 * 0.033),\n", " (\"What is the tempretuer in {}\", 0.20 * 0.033),\n", " (\"What is the temprture in {}\", 0.20 * 0.033),\n", "\n", " # Misspellings for \"forecast\"\n", " (\"Forcast in {}\", 0.20 * 0.039),\n", " (\"What is the forcast for {}\", 0.20 * 0.034),\n", " (\"Forcst in {}\", 0.20 * 0.039),\n", " (\"What is the forcst for {}\", 0.20 * 0.034),\n", " (\"Forescast in {}\", 0.20 * 0.039),\n", " (\"What is the forescast for {}\", 0.20 * 0.034),\n", " (\"Forecats in {}\", 0.20 * 0.039),\n", " (\"What is the forecats for {}\", 0.20 * 0.034),\n", " (\"Forcaste in {}\", 0.20 * 0.039),\n", " (\"What is the forcaste for {}\", 0.20 * 0.034),\n", " (\"Forecst in {}\", 0.20 * 0.039),\n", " (\"What is the forecst for {}\", 0.20 * 0.034),\n", " (\"Forecase in {}\", 0.20 * 0.039),\n", " (\"What is the forecase for {}\", 0.20 * 0.034),\n", " (\"Foercast in {}\", 0.20 * 0.039),\n", " (\"What is the foercast for {}\", 0.20 * 0.034),\n", "]\n", "\n", "# Combine original templates and the expanded typo variants\n", "weather_templates_extended = weather_templates + extended_typo_variants\n", "\n", "\n", "weather_templates_df = pd.DataFrame(weather_templates_extended, columns=['pattern', 'weight'])\n", "weather_templates_df['weight'] = weather_templates_df['weight'] / weather_templates_df['weight'].sum()\n", "weather_templates_df" ] }, { "cell_type": "code", "execution_count": null, "id": "5f7a5dac-3cf3-4dbf-afbd-47a5da68b0c3", "metadata": {}, "outputs": [], "source": [ "weather_templates_df.head(50)" ] }, { "cell_type": "code", "execution_count": null, "id": "1a67329c-76ca-4373-ac9c-ed534fdc5c7e", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import re" ] }, { "cell_type": "code", "execution_count": null, "id": "ea6893d5-b299-46a0-ad10-90aec1b549ea", "metadata": {}, "outputs": [], "source": [ "url = \"https://en.m.wikipedia.org/wiki/List_of_television_stations_in_North_America_by_media_market\"\n", "response = requests.get(url)\n", "\n", "if response.status_code == 200:\n", " soup = BeautifulSoup(response.content, 'html.parser')\n", " dma_heading = soup.find('h4', string='DMAs')\n", " dma_list = dma_heading.find_next('ul')\n", " \n", " dma_data = []\n", " if dma_list:\n", " for li in dma_list.find_all('li'):\n", " market_name = li.get_text(strip=True)\n", "\n", " # Split by dash (-) or en-dash (–) to handle cases like \"Dallas-Fort Worth\"\n", " split_names = re.split(r'–|-', market_name)\n", "\n", " # Process each split name\n", " for name in split_names:\n", " # Remove the (#NUM) part using regex\n", " name = re.sub(r'\\s*\\(#\\d+\\)', '', name).strip()\n", "\n", " # Check if there's a city in parentheses and split them\n", " match = re.match(r'(.+?)\\s*\\((.+?)\\)', name)\n", " if match:\n", " main_city = match.group(1).strip()\n", " parenthetical_city = match.group(2).strip()\n", " dma_data.append(main_city) # Add the main city\n", " dma_data.append(parenthetical_city) # Add the city in parentheses\n", " else:\n", " dma_data.append(name) \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "8fbf33d4-9444-4393-aedd-8d9018f62f5e", "metadata": {}, "outputs": [], "source": [ "len(dma_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "5b44b189-d1a9-43b2-999f-eca76b777f6a", "metadata": {}, "outputs": [], "source": [ "print(dma_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "903a2437-a2dd-4ad7-a60c-f6572bc2387a", "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "\n", "# months\n", "months = [\"January\", \"February\", \"March\", \"April\", \"May\", \"June\", \"July\", \"August\", \"September\", \"October\", \"November\", \"December\"]\n", "\n", "# Function to generate random queries with 30% lowercased\n", "def generate_queries_with_case(df, cities, months, num_queries=10, lower_case_prob=0.3):\n", " queries = set()\n", " cnt = 0\n", " pattern_counter = Counter()\n", " while cnt < num_queries:\n", " # Choose a pattern based on the weights\n", " pattern = random.choices(df['pattern'], weights=df['weight'], k=1)[0]\n", " \n", " # Replace placeholders in the pattern with a random city and/or month\n", " city = random.choice(cities)\n", " if \"{} in {}\" in pattern:\n", " month = random.choice(months)\n", " query = pattern.format(city, month)\n", " else:\n", " query = pattern.format(city)\n", "\n", " if pattern_counter.get(pattern, 0) > num_queries//10:\n", " continue\n", " pattern_counter.update([pattern])\n", " \n", " # Randomly convert the query to lowercase with the given probability\n", " if random.random() < lower_case_prob:\n", " query = query.lower()\n", "\n", " if query not in queries:\n", " queries.add(query)\n", " cnt += 1\n", " \n", " return list(queries), pattern_counter\n", "\n", "# Generate 10 sample queries with 30% in lowercase\n", "sample_queries_with_case, pattern_counter = generate_queries_with_case(weather_templates_df, dma_data, months, num_queries=10000, lower_case_prob=0.3)\n", "\n", "print(len(sample_queries_with_case))\n", "sample_queries_with_case[:10]\n" ] }, { "cell_type": "code", "execution_count": null, "id": "2eb0604c-7b51-44a8-b59c-e0428cc2b893", "metadata": {}, "outputs": [], "source": [ "pattern_counter" ] }, { "cell_type": "code", "execution_count": null, "id": "27b3d603-8fae-4654-840a-72657a9cc2f8", "metadata": {}, "outputs": [], "source": [ "# sample_queries_with_case[1000:2000]" ] }, { "cell_type": "code", "execution_count": null, "id": "4bf13d65-0af9-4cf9-b7e1-92a018a5715b", "metadata": {}, "outputs": [], "source": [ "# sample_queries_with_case[:100]\n", "weather_examples = pd.DataFrame(sample_queries_with_case, columns=['sequence'])\n", "weather_examples['target'] = 'weather_intent'\n", "weather_examples" ] }, { "cell_type": "markdown", "id": "9d6bf011-ff0c-4586-99a0-8ab5ddcc5090", "metadata": {}, "source": [ "#### Yelp examples" ] }, { "cell_type": "code", "execution_count": null, "id": "8990d49c-5d1d-4157-be4d-301294984b11", "metadata": {}, "outputs": [], "source": [ "# Original Yelp Intent Templates\n", "yelp_intent_templates = [\n", " (\"What are the best restaurants in {}\", 0.12),\n", " (\"Top-rated restaurants in {}\", 0.10),\n", " (\"Popular coffee shops in {}\", 0.09),\n", " (\"Best pizza places in {}\", 0.08),\n", " (\"Best sushi places in {}\", 0.07),\n", " (\"Cheap restaurants in {}\", 0.06),\n", " (\"Best places to eat in {}\", 0.06),\n", " (\"Restaurants near me in {}\", 0.05),\n", " (\"What is the average cost of a meal in {}\", 0.04),\n", " (\"Best Italian restaurants in {}\", 0.04),\n", " (\"Best fast food restaurants in {}\", 0.04),\n", " (\"Mexican restaurants in {}\", 0.03),\n", " (\"Chinese food near me in {}\", 0.03),\n", " (\"Best hotels in {}\", 0.03),\n", " (\"Affordable hotels in {}\", 0.03),\n", " (\"Best parks to visit in {}\", 0.02),\n", " (\"Best attractions in {}\", 0.02),\n", " (\"Popular things to do in {}\", 0.02),\n", " (\"Best shopping centers in {}\", 0.02),\n", " (\"Best gyms in {}\", 0.02),\n", " (\"Top hair salons in {}\", 0.02),\n", " (\"What are the best-rated dentists in {}\", 0.02),\n", " (\"Local plumbers in {}\", 0.02),\n", " (\"Popular electricians in {}\", 0.02),\n", " (\"What is the phone number for a restaurant in {}\", 0.02),\n", " (\"Phone number for hotels in {}\", 0.02),\n", " (\"Top-rated cafes in {}\", 0.02),\n", " (\"Best massage spas in {}\", 0.02),\n", " (\"Grocery stores near me in {}\", 0.02),\n", " (\"Where can I buy clothes in {}\", 0.01),\n", " (\"Pharmacies near me in {}\", 0.01),\n", " (\"Best bars in {}\", 0.01),\n", " (\"Cocktail bars in {}\", 0.01),\n", " (\"Family-friendly restaurants in {}\", 0.01),\n", " (\"Kid-friendly restaurants in {}\", 0.01),\n", " (\"Pet-friendly restaurants in {}\", 0.01),\n", " (\"Vegan restaurants in {}\", 0.01),\n", " (\"Best rooftop bars in {}\", 0.01),\n", " (\"Top pizza delivery places in {}\", 0.01),\n", " (\"Where can I get sushi in {}\", 0.01),\n", " (\"Best food delivery services in {}\", 0.01),\n", " (\"Catering services in {}\", 0.01),\n", " (\"Top-rated bakeries in {}\", 0.01),\n", " (\"Where can I find a gym in {}\", 0.01),\n", " (\"Yoga studios near me in {}\", 0.01),\n", " (\"What’s the cost of living in {}\", 0.01),\n", " (\"How much does it cost to live in {}\", 0.01),\n", " (\"Best places for nightlife in {}\", 0.01),\n", " (\"Local car repair shops in {}\", 0.01),\n", " (\"Best car rental services in {}\", 0.01)\n", "]\n", "\n", "# Function to add typos to templates\n", "def add_typos_to_template(template, typo_prob=0.1):\n", " typos = {\n", " \"restaurants\": [\"restarants\", \"resturants\", \"restrants\"],\n", " \"best\": [\"bst\", \"besst\", \"bet\"],\n", " \"popular\": [\"populer\", \"ppular\", \"poplar\"],\n", " \"coffee\": [\"cofee\", \"cofffe\", \"cofee\"],\n", " \"pizza\": [\"piza\", \"pzza\", \"piza\"],\n", " \"hotels\": [\"hoetls\", \"hotls\", \"hoetls\"],\n", " \"places\": [\"plces\", \"place\", \"palces\"],\n", " \"attractions\": [\"attractons\", \"atrctions\", \"attractins\"],\n", " \"cheap\": [\"chep\", \"cheep\", \"cheap\"],\n", " \"meal\": [\"mel\", \"meel\", \"male\"],\n", " \"cost\": [\"cst\", \"cots\", \"cot\"],\n", " \"living\": [\"lving\", \"livng\", \"livin\"],\n", " \"yoga\": [\"yga\", \"yoaga\", \"ygoa\"],\n", " \"food\": [\"fod\", \"fud\", \"fodd\"],\n", " \"parks\": [\"praks\", \"parcs\", \"paks\"],\n", " \"near\": [\"ner\", \"neer\", \"naer\"],\n", " \"bar\": [\"bar\", \"ber\", \"baer\"],\n", " \"family\": [\"famly\", \"famliy\", \"faimly\"],\n", " \"friendly\": [\"frindly\", \"frendly\", \"friendley\"]\n", " }\n", "\n", " words = template.split()\n", " for i, word in enumerate(words):\n", " if word.lower().strip(\"{}\") in typos and random.random() < typo_prob:\n", " words[i] = random.choice(typos[word.lower().strip(\"{}\")])\n", " return \" \".join(words)\n", "\n", "# Extending the list with typos\n", "extended_yelp_intent_templates = []\n", "extended_yelp_intent_templates_set = set()\n", "\n", "for template, weight in yelp_intent_templates:\n", " if template in extended_yelp_intent_templates_set:\n", " continue\n", " extended_yelp_intent_templates.append((template, weight))\n", " extended_yelp_intent_templates_set.add(template)\n", " \n", " # Adding a typo variant 10-20% of the time\n", " if random.random() < 0.2:\n", " typo_template = add_typos_to_template(template)\n", " typo_weight = weight * 0.2 # Typos occur less frequently, so reduce weight\n", " if typo_template in extended_yelp_intent_templates_set:\n", " continue\n", " extended_yelp_intent_templates.append((typo_template, typo_weight))\n", " extended_yelp_intent_templates_set.add(typo_template)\n", "\n", "# Convert to DataFrame for better readability\n", "df_extended_yelp_intent_templates = pd.DataFrame(extended_yelp_intent_templates, columns=[\"pattern\", \"weight\"])\n", "df_extended_yelp_intent_templates['weight'] = df_extended_yelp_intent_templates['weight'] / df_extended_yelp_intent_templates['weight'].sum()\n", "df_extended_yelp_intent_templates" ] }, { "cell_type": "code", "execution_count": null, "id": "d544569b-dace-4aff-a3c3-924124dbc596", "metadata": {}, "outputs": [], "source": [ "list(weather_templates_df['pattern'].values) + list(df_extended_yelp_intent_templates['pattern'].values)" ] }, { "cell_type": "code", "execution_count": null, "id": "d126b403-f7bc-49ee-aff6-b89b10bc7e63", "metadata": {}, "outputs": [], "source": [ "# Function to generate random queries with 30% lowercased\n", "def generate_yelp_queries_with_case(df, cities, num_queries=10, lower_case_prob=0.3):\n", " queries = set()\n", " cnt = 0\n", " pattern_counter = Counter()\n", " while cnt < num_queries:\n", " # Choose a pattern based on the weights\n", " pattern = random.choices(df['pattern'], weights=df['weight'], k=1)[0]\n", " \n", " # Replace placeholders in the pattern with a random city and/or month\n", " city = random.choice(cities)\n", " query = pattern.format(city)\n", "\n", " if pattern_counter.get(pattern, 0) > num_queries//10:\n", " continue\n", " pattern_counter.update([pattern])\n", " \n", " # Randomly convert the query to lowercase with the given probability\n", " if random.random() < lower_case_prob:\n", " query = query.lower()\n", "\n", " if query not in queries:\n", " queries.add(query)\n", " cnt += 1\n", " \n", " return list(queries), pattern_counter" ] }, { "cell_type": "code", "execution_count": null, "id": "2ef143ad-f441-4ade-9b11-47b5fad2c2a0", "metadata": {}, "outputs": [], "source": [ "# Generate 10 sample queries with 30% in lowercase\n", "sample_yelp_queries_with_case, pattern_counter = generate_yelp_queries_with_case(df_extended_yelp_intent_templates, dma_data, num_queries=10000, lower_case_prob=0.4)\n", "\n", "print(len(sample_yelp_queries_with_case))\n", "sample_yelp_queries_with_case[:10]" ] }, { "cell_type": "code", "execution_count": null, "id": "0c5ed901-17fa-4d02-8e38-cee2339b9ab1", "metadata": {}, "outputs": [], "source": [ "sample_yelp_queries_with_case" ] }, { "cell_type": "code", "execution_count": null, "id": "77c83fc5-78f7-4788-bf82-46ad48e29c6d", "metadata": {}, "outputs": [], "source": [ "pattern_counter" ] }, { "cell_type": "code", "execution_count": null, "id": "01d72484-d62a-44ef-a43c-0323e1e0ba36", "metadata": {}, "outputs": [], "source": [ "yelp_examples = pd.DataFrame(sample_yelp_queries_with_case, columns=['sequence'])\n", "yelp_examples['target'] = 'yelp_intent'\n", "yelp_examples" ] }, { "cell_type": "code", "execution_count": null, "id": "5101385f-496e-4ee8-89fe-bc2517eb3578", "metadata": {}, "outputs": [], "source": [ "def apply_target_mapping(df, target_mapping):\n", " mapped_text_set = set()\n", " for ngram in target_mapping.keys():\n", " # mask = df['sequence'].apply(lambda text: ngram in text)\n", " mask = df['sequence'].apply(lambda text: ngram in text and text not in mapped_text_set)\n", " print(f'Number of matches found for \"{ngram}\" = {mask.sum()}')\n", " print(f'size of mapped_text_set = {len(mapped_text_set)}')\n", " df.loc[mask, 'target'] = target_mapping[ngram]\n", " mapped_text_set.update(df.loc[mask, 'sequence'].values.tolist())\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "id": "f6322201-5190-42c5-9040-33df3c6f43a9", "metadata": {}, "outputs": [], "source": [ "to_be_labelled = marco_df.loc[marco_df['target'].isna()].copy()\n", "labelled = marco_df.loc[~marco_df['target'].isna()].copy()" ] }, { "cell_type": "code", "execution_count": null, "id": "ae058794-cf68-49ce-9eb4-cf3f97ee1fe9", "metadata": {}, "outputs": [], "source": [ "manual_labelled = pd.read_csv(\"../data/manual_labels.csv\")\n", "manual_labelled = manual_labelled.loc[~manual_labelled['target'].isna()]\n", "print(len(manual_labelled))\n", "print(manual_labelled['target'].value_counts())\n", "manual_labelled_lkp = manual_labelled[['sequence','target']].set_index('sequence').to_dict()['target']\n", "manual_labelled.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "e415ba5f-f845-4944-a560-cede49ab3516", "metadata": {}, "outputs": [], "source": [ "def apply_manual_mapping(df, manual_labelled_lkp):\n", " mask = df['sequence'].apply(lambda text: text in manual_labelled_lkp)\n", " print(f'Number of matches found in manual labels = {mask.sum()}')\n", " df.loc[mask, 'target'] = df.loc[mask, 'sequence'].map(manual_labelled_lkp)\n", " print()" ] }, { "cell_type": "code", "execution_count": null, "id": "32d57274-d2f3-4a31-8dae-62afa458bc02", "metadata": { "scrolled": true }, "outputs": [], "source": [ "\n", "print(f\"Number of examples labeled = {len(labelled)}\")\n", "print(f\"Number of examples to be labeled = {len(to_be_labelled)}\")\n", "print(f\"Label stats \\n{labelled['target'].value_counts()}\\n\")\n", "\n", "# Step 3: Get most common n-grams for a given n\n", "n = 2 # Change this to any n (e.g., 1 for unigrams, 3 for trigrams)\n", "to_be_labelled_sequence_list = to_be_labelled['sequence'].values.tolist()\n", "ngram_counter = count_ngrams(to_be_labelled_sequence_list, n)\n", "most_common_ngrams = ngram_counter.most_common(100)\n", "\n", "# Display the most common n-grams\n", "print(most_common_ngrams)\n", "\n", "# Example usage with a limit on the number of results\n", "cnt = 0\n", "for query in search_queries_by_words(\"5 star\", to_be_labelled_sequence_list):\n", " if cnt >= 100: # Stop after 20 results\n", " break\n", " print(cnt + 1, query)\n", " cnt += 1\n", "\n", "apply_target_mapping(to_be_labelled, target_mapping)\n", "apply_manual_mapping(to_be_labelled, manual_labelled_lkp)\n", "labelled = pd.concat([labelled, to_be_labelled.loc[~to_be_labelled['target'].isna()], weather_examples, yelp_examples], axis=0)\n", "to_be_labelled = to_be_labelled.loc[to_be_labelled['target'].isna()]\n", "print()\n" ] }, { "cell_type": "markdown", "id": "ecd6c05a-df86-46a6-98f4-37f794e7e8a2", "metadata": {}, "source": [ "#### Skip this for manual labeling" ] }, { "cell_type": "code", "execution_count": null, "id": "92ea96db-4112-4492-99c9-5232eb9900c4", "metadata": {}, "outputs": [], "source": [ "## Only if special list for manual process needed else skip this \n", "\n", "SKIP_MANUAL_LABEL_PREP = True\n", "if not SKIP_MANUAL_LABEL_PREP:\n", " special_list = set()\n", " \n", " cnt = 0\n", " \n", " for query in search_queries_by_words(\"how much\", to_be_labelled_sequence_list):\n", " if cnt >= 10000: # Stop after 20 results\n", " break\n", " # print(cnt + 1, query)\n", " cnt += 1\n", " special_list.add(query)\n", " \n", " pd.DataFrame(special_list, columns=['sequence']).to_csv('special_list_manual_label.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "1ada08c3-7167-4680-abc1-fc7018798fcb", "metadata": {}, "outputs": [], "source": [ "to_be_labelled" ] }, { "cell_type": "code", "execution_count": null, "id": "5ddf19be-438d-444d-8c46-313670da0a3b", "metadata": {}, "outputs": [], "source": [ "labelled['target'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "88b28463-32da-43f0-a123-780bca3786ff", "metadata": {}, "outputs": [], "source": [ "combined = pd.concat([labelled, to_be_labelled], axis=0).reset_index(drop=True)\n", "print(len(combined))\n", "combined" ] }, { "cell_type": "code", "execution_count": null, "id": "66f460b9-76df-4290-84d1-c962f9ac0826", "metadata": {}, "outputs": [], "source": [ "labelled['target'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "id": "42cae81d-4caf-46d1-b3d4-3c1df799e736", "metadata": {}, "outputs": [], "source": [ "labelled.to_csv(\"../data/marco_train_v2.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "id": "173c29d3-9f04-48e0-b2d9-b9bc89cb4d67", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from umap import UMAP\n", "from sklearn.pipeline import make_pipeline \n", "from embetter.text import SentenceEncoder\n", "\n", "\n", "SKIP_ENCODING = False\n", "if not SKIP_ENCODING:\n", " # Build a sentence encoder pipeline with UMAP at the end.\n", " enc = SentenceEncoder('all-MiniLM-L6-v2')\n", " umap = UMAP()\n", " \n", " text_emb_pipeline = make_pipeline(\n", " enc, umap\n", " )\n", " \n", " # Load sentences\n", " X = combined['sequence'].values.tolist()\n", " \n", " # Calculate embeddings \n", " X_tfm = text_emb_pipeline.fit_transform(X)\n", " \n", " # Write to disk. Note! Text column must be named \"text\"\n", " df = pd.DataFrame({\"text\": X})\n", " df['x'] = X_tfm[:, 0]\n", " df['y'] = X_tfm[:, 1]\n", " df.to_csv(\"marco_ready.csv\", index=False)\n", " df['target'] = combined['target'].fillna('unknown')\n", "else:\n", " df = pd.read_csv(\"marco_ready.csv\")\n", " df['target'] = combined['target'].fillna('unknown')" ] }, { "cell_type": "code", "execution_count": null, "id": "077564f3-bfd0-4df1-b113-6beefb047cbf", "metadata": {}, "outputs": [], "source": [ "combined" ] }, { "cell_type": "code", "execution_count": null, "id": "cea4591a-868e-4060-8fe4-140c6f056b34", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "id": "55e55825-e5fd-4f51-ad6f-7af163997410", "metadata": {}, "outputs": [], "source": [ "import plotly.express as px" ] }, { "cell_type": "code", "execution_count": null, "id": "747c92b0-a4f5-478a-ac7f-da882d2850c8", "metadata": {}, "outputs": [], "source": [ "fig_2d = px.scatter(\n", " df, x='x', y='y',\n", " color=df['target'], labels={'color': 'target'},\n", " hover_name=\"text\",\n", " opacity=0.3,\n", " title=\"marcos web search queries intents map\"\n", ")\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "664aa35b-d264-4c40-bcf4-1649fe3993bf", "metadata": {}, "outputs": [], "source": [ "fig_2d" ] }, { "cell_type": "code", "execution_count": null, "id": "a00e78ee-abed-4fa8-bc71-29544e9d5f01", "metadata": {}, "outputs": [], "source": [ "fig_2d.write_html(\"../reports/web_search_intents.html\")" ] }, { "cell_type": "code", "execution_count": null, "id": "30619067-30d4-470e-acbb-81d286863c56", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }

notebooks/suggest_intent_data_prep.ipynb (1,574 lines of code) (raw):