use_cases/archive/wikification/notebook/wikification.ipynb (330 lines of code) (raw):

{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Wikification with OpenAI: Leveraging Language Models for Entity Linking and Annotation\n", "\n", "[Wikification](https://en.wikipedia.org/wiki/Wikification) is a fundamental task in [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) that involves linking named entities in a given text to their corresponding entries in a knowledge base, such as Wikipedia. In this notebook, we will explore how to leverage [OpenAI's](https://en.wikipedia.org/wiki/OpenAI) powerful language models to perform wikification using the [LLM model](https://en.wikipedia.org/wiki/Language_model).\n", "Entity linking and annotation are crucial for a wide range of applications, including [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval), [question answering systems](https://en.wikipedia.org/wiki/Question_answering), [text summarization](https://en.wikipedia.org/wiki/Text_summarization), and more. By linking entities to knowledge bases, we can enrich the understanding of text and facilitate further analysis.\n", "Throughout this notebook, we will guide you through the process of setting up the environment, installing the necessary libraries, and implementing the steps involved in performing wikification with [OpenAI's](https://en.wikipedia.org/wiki/OpenAI) [LLM model](https://en.wikipedia.org/wiki/Language_model)." ] }, { "cell_type": "code", "execution_count": 325, "metadata": {}, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "from pathlib import Path\n", "import json\n", "import openai \n", "import os\n" ] }, { "cell_type": "code", "execution_count": 326, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 326, "metadata": {}, "output_type": "execute_result" } ], "source": [ "env_path = Path('../../../.env.sample') # Change with your .env file\n", "load_dotenv(dotenv_path=env_path,override=True) # Use a text-davinci-003 model" ] }, { "cell_type": "code", "execution_count": 327, "metadata": {}, "outputs": [], "source": [ "openai.api_type = \"azure\"\n", "openai.api_key = os.getenv('OPENAI_API_KEY') \n", "openai.api_base = os.getenv(\"OPENAI_API_BASE\")\n", "\n", "COMPLETIONS_MODEL = os.getenv(\"COMPLETIONS_MODEL\") " ] }, { "cell_type": "code", "execution_count": 329, "metadata": {}, "outputs": [], "source": [ "def wikifi_text(text,completions_model,max_tokens=300):\n", " prompt = \"\"\"You will read a text, and you will detect named entities contained in it \n", "--- \n", "TEXT \"\"\"+text+\"\"\" \n", "---\n", "Now you will connect the named entities to a corresponding Wikipedia Page. \n", "\n", "If there are several wikipedia pages corresponding, select wikipedia pages that have a logic connection\n", "\n", "e.g. If in the same sentence I have Italy and Roberto Baggio, Italy is probably referring to the National Soccer Team\n", "\n", "Finally you will return the result in a json format like the one below:\n", "\n", "{\n", " \"annotations\" : [\n", " { \n", " \"wikipedia_page\":\"HERE PUT THE WIKIPEDIA PAGE URL\",\n", " \"wikipedia_page_title\":\"HERE PUT THE WIKIPEDIA PAGE TITLE\",\n", " \"text\":\"HERE PUT THE DETECTED TEXT\"\n", " },\n", " ... ]\n", "}\n", "\n", "Answer:\n", "\"\"\"\n", " result = openai.Completion.create(\n", " prompt=prompt,\n", " temperature=0.5,\n", " max_tokens=max_tokens,\n", " engine=completions_model\n", " )\n", " return json.loads(result['choices'][0]['text'].strip(\"\\n\").replace(\"\\n\",\"\"))\n" ] }, { "cell_type": "code", "execution_count": 330, "metadata": {}, "outputs": [], "source": [ "def print_wikification_result(wikification):\n", " print(wikification)\n", " for (index,ann) in enumerate(wikification['annotations']):\n", " print(f\"#{index}\")\n", " print(\" - \"+ann[\"text\"])\n", " print(\" - \"+ann[\"wikipedia_page_title\"])\n", " print(\" - \"+ann[\"wikipedia_page\"])\n", " print(\"\\n\")\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 331, "metadata": {}, "outputs": [], "source": [ "sample_text_1 = \"On this day 24 years ago Maradona scored his infamous Hand of God goal against England in the quarter-final of the 1986\"\n", "sample_text_2 = \"The Mona Lisa is a sixteenth century oil painting created by Leonardo. It's held at the Louvre in Paris.\"" ] }, { "cell_type": "code", "execution_count": 332, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'annotations': [{'wikipedia_page': 'https://en.wikipedia.org/wiki/Diego_Maradona', 'wikipedia_page_title': 'Diego Maradona', 'text': 'Maradona'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/1986_FIFA_World_Cup', 'wikipedia_page_title': '1986 FIFA World Cup', 'text': '1986'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/England_national_football_team', 'wikipedia_page_title': 'England national football team', 'text': 'England'}]}\n", "#0\n", " - Maradona\n", " - Diego Maradona\n", " - https://en.wikipedia.org/wiki/Diego_Maradona\n", "\n", "\n", "#1\n", " - 1986\n", " - 1986 FIFA World Cup\n", " - https://en.wikipedia.org/wiki/1986_FIFA_World_Cup\n", "\n", "\n", "#2\n", " - England\n", " - England national football team\n", " - https://en.wikipedia.org/wiki/England_national_football_team\n", "\n", "\n" ] } ], "source": [ "wikification_test_1 = wikifi_text(sample_text_1,completions_model=COMPLETIONS_MODEL)\n", "print_wikification_result(wikification_test_1)" ] }, { "cell_type": "code", "execution_count": 333, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'annotations': [{'wikipedia_page': 'https://en.wikipedia.org/wiki/Mona_Lisa', 'wikipedia_page_title': 'Mona Lisa', 'text': 'Mona Lisa'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Leonardo_da_Vinci', 'wikipedia_page_title': 'Leonardo da Vinci', 'text': 'Leonardo'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Louvre', 'wikipedia_page_title': 'Louvre', 'text': 'Louvre'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Paris', 'wikipedia_page_title': 'Paris', 'text': 'Paris'}]}\n", "#0\n", " - Mona Lisa\n", " - Mona Lisa\n", " - https://en.wikipedia.org/wiki/Mona_Lisa\n", "\n", "\n", "#1\n", " - Leonardo\n", " - Leonardo da Vinci\n", " - https://en.wikipedia.org/wiki/Leonardo_da_Vinci\n", "\n", "\n", "#2\n", " - Louvre\n", " - Louvre\n", " - https://en.wikipedia.org/wiki/Louvre\n", "\n", "\n", "#3\n", " - Paris\n", " - Paris\n", " - https://en.wikipedia.org/wiki/Paris\n", "\n", "\n" ] } ], "source": [ "wikification_test_2 = wikifi_text(sample_text_2,completions_model=COMPLETIONS_MODEL)\n", "print_wikification_result(wikification_test_2)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In case you're wondering... yes! The introduction of this notebook was wikified using the same approach! Let's try!" ] }, { "cell_type": "code", "execution_count": 334, "metadata": {}, "outputs": [], "source": [ "sample_text_3 = \"\"\"Wikification is a fundamental task in natural language processing that involves linking named entities in a given text to their corresponding entries in a knowledge base, such as Wikipedia. In this notebook, we will explore how to leverage OpenAI's powerful language models to perform wikification using the LLM model.\n", "Entity linking and annotation are crucial for a wide range of applications, including information retrieval, question answering systems, text summarization, and more. By linking entities to knowledge bases, we can enrich the understanding of text and facilitate further analysis.\n", "Throughout this notebook, we will guide you through the process of setting up the environment, installing the necessary libraries, and implementing the steps involved in performing wikification with OpenAI's LLM model.\"\"\"" ] }, { "cell_type": "code", "execution_count": 335, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'annotations': [{'wikipedia_page': 'https://en.wikipedia.org/wiki/Natural_language_processing', 'wikipedia_page_title': 'Natural language processing', 'text': 'natural language processing'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Information_retrieval', 'wikipedia_page_title': 'Information retrieval', 'text': 'information retrieval'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Question_answering', 'wikipedia_page_title': 'Question answering', 'text': 'question answering systems'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Text_summarization', 'wikipedia_page_title': 'Text summarization', 'text': 'text summarization'}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/OpenAI', 'wikipedia_page_title': 'OpenAI', 'text': \"OpenAI's\"}, {'wikipedia_page': 'https://en.wikipedia.org/wiki/Language_model', 'wikipedia_page_title': 'Language model', 'text': 'LLM model'}]}\n", "#0\n", " - natural language processing\n", " - Natural language processing\n", " - https://en.wikipedia.org/wiki/Natural_language_processing\n", "\n", "\n", "#1\n", " - information retrieval\n", " - Information retrieval\n", " - https://en.wikipedia.org/wiki/Information_retrieval\n", "\n", "\n", "#2\n", " - question answering systems\n", " - Question answering\n", " - https://en.wikipedia.org/wiki/Question_answering\n", "\n", "\n", "#3\n", " - text summarization\n", " - Text summarization\n", " - https://en.wikipedia.org/wiki/Text_summarization\n", "\n", "\n", "#4\n", " - OpenAI's\n", " - OpenAI\n", " - https://en.wikipedia.org/wiki/OpenAI\n", "\n", "\n", "#5\n", " - LLM model\n", " - Language model\n", " - https://en.wikipedia.org/wiki/Language_model\n", "\n", "\n" ] } ], "source": [ "wikification_test_3 = wikifi_text(sample_text_3,completions_model=COMPLETIONS_MODEL,max_tokens=1000)\n", "print_wikification_result(wikification_test_3)" ] }, { "cell_type": "code", "execution_count": 336, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Final Markdown text:\n", "\n", "Wikification is a fundamental task in [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) that involves linking named entities in a given text to their corresponding entries in a knowledge base, such as Wikipedia. In this notebook, we will explore how to leverage [OpenAI's](https://en.wikipedia.org/wiki/OpenAI) powerful language models to perform wikification using the [LLM model](https://en.wikipedia.org/wiki/Language_model).\n", "Entity linking and annotation are crucial for a wide range of applications, including [information retrieval](https://en.wikipedia.org/wiki/Information_retrieval), [question answering systems](https://en.wikipedia.org/wiki/Question_answering), [text summarization](https://en.wikipedia.org/wiki/Text_summarization), and more. By linking entities to knowledge bases, we can enrich the understanding of text and facilitate further analysis.\n", "Throughout this notebook, we will guide you through the process of setting up the environment, installing the necessary libraries, and implementing the steps involved in performing wikification with [OpenAI's](https://en.wikipedia.org/wiki/OpenAI) [LLM model](https://en.wikipedia.org/wiki/Language_model).\n" ] } ], "source": [ "sample_text_3_wikified = sample_text_3\n", "for ann in wikification_test_3['annotations']:\n", " markdown_text= \"[\"+ann[\"text\"]+\"](\"+ann[\"wikipedia_page\"]+\")\"\n", " sample_text_3_wikified = sample_text_3_wikified.replace(ann[\"text\"],markdown_text)\n", " \n", "print(\"Final Markdown text:\\n\\n\"+sample_text_3_wikified)" ] } ], "metadata": { "kernelspec": { "display_name": ".env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }