1_agentic-design-ptn/01_reflection/LangGraph/01.2

{ "cells": [ { "cell_type": "markdown", "id": "c26ab996", "metadata": {}, "source": [ "# Self-RAG with Azure AI Evaluation SDK\n", "---\n", "\n", "In this notebook, we will demonstrate how to use the Azure AI Evaluation SDK.\n", "\n", "\n", "> ✨ ***Note*** \n", "> 1. Please check the reference document before you get started - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk \n", "> 2. Check the region support for the Azure AI Evaluation SDK. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#region-support\n", "\n", "### What is Self-RAG?\n", "\n", "Self-RAG reflects on the retrieved documents and generated responses, and includes a self-evaluation process to improve the quality of the generated answers.\n", "\n", "Original paper says Self-RAG generates special tokens, termed \"reflection tokens,\" to determine if retrieval would enhance the response, allowing for on-demand retrieval integration. \n", "But in practice, we can ignore reflection tokens and let LLM decides if each document is relevant or not.\n", "\n", "Corrective RAG (CRAG) is similar to Self-RAG, but Self-RAG focuses on self-reflection and self-evaluation, while CRAG focuses on refining the entire retrieval process including web search.\n", "\n", "- Self-RAG: Trains the LLM to be self-sufficient in managing retrieval and generation processes. By generating reflection tokens, the model controls its behavior during inference, deciding when to retrieve information and how to critique and improve its own responses, leading to more accurate and contextually appropriate outputs. \n", "- CRAG: Focuses on refining the retrieval process by evaluating and correcting the retrieved documents before they are used in generation. It integrates additional retrievals, such as web searches, when initial retrievals are insufficient, ensuring that the generation is based on the most relevant and accurate information available.\n", "\n", "**Reference**\n", "\n", "- [Self-RAG paper](https://arxiv.org/abs/2310.11511) " ] }, { "cell_type": "code", "execution_count": null, "id": "b6458235", "metadata": {}, "outputs": [], "source": [ "import os\n", "from dotenv import load_dotenv\n", "from azure_genai_utils.tracer import get_langchain_api_key, set_langsmith\n", "\n", "load_dotenv(override=True)\n", "\n", "# If you want to trace your RAG API calls, please set the tracing=True. You need to have a valid Langchain API key.\n", "langchain_key, has_langchain_key = get_langchain_api_key()\n", "set_langsmith(\"[RAG Innv Lab] 1_Agentic-Design-Pattern\", tracing=False)\n", "\n", "azure_openai_chat_deployment_name = os.getenv(\"AZURE_OPENAI_CHAT_DEPLOYMENT_NAME\")\n", "azure_openai_embedding_deployment_name = os.getenv(\"AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME\")" ] }, { "cell_type": "code", "execution_count": null, "id": "2ac665a7", "metadata": {}, "outputs": [], "source": [ "import os\n", "import pprint\n", "from azure.identity import DefaultAzureCredential\n", "from azure.ai.evaluation import evaluate\n", "from azure.ai.evaluation import (\n", " ContentSafetyEvaluator,\n", " RelevanceEvaluator,\n", " CoherenceEvaluator,\n", " GroundednessEvaluator,\n", " FluencyEvaluator,\n", " SimilarityEvaluator,\n", " F1ScoreEvaluator,\n", " RetrievalEvaluator,\n", ")\n", "\n", "credential = DefaultAzureCredential()\n", "\n", "# Initialize Azure OpenAI conncetion with your environment variables\n", "\n", "model_config = {\n", " \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n", " \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n", " \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT_NAME\"),\n", " \"api_version\": os.environ.get(\"AZURE_OPENAI_API_VERSION\"),\n", " \"type\": \"azure_openai\",\n", "}\n", "\n", "pprint.pprint(model_config)" ] }, { "cell_type": "markdown", "id": "5ec192a9", "metadata": {}, "source": [ " \n", "\n", "## 🧪 Step 1. Test and Construct each module\n", "---\n", "\n", "Before building the entire the graph pipeline, we will test and construct each module separately.\n", "\n", "- **Retrieval Grader**\n", "- **Answer Generator**\n", "- **Groundedness Evaluator**\n", "- **Relevance Evaluator**\n", "- **Question Re-writer**\n", "\n", "### Construct Retrieval Chain based on PDF" ] }, { "cell_type": "code", "execution_count": null, "id": "4cde195c", "metadata": {}, "outputs": [], "source": [ "from azure_genai_utils.rag.pdf import PDFRetrievalChain\n", "\n", "pdf_path = \"../../../sample-docs/AutoGen-paper.pdf\"\n", "\n", "pdf = PDFRetrievalChain(\n", " source_uri=[pdf_path],\n", " loader_type=\"PDFPlumber\",\n", " model_name=azure_openai_chat_deployment_name,\n", " embedding_name=azure_openai_embedding_deployment_name,\n", " chunk_size=500,\n", " chunk_overlap=50,\n", ").create_chain()\n", "\n", "pdf_retriever = pdf.retriever\n", "pdf_chain = pdf.chain\n", "\n", "question = \"What is AutoGen's main features?\"\n", "docs = pdf_retriever.invoke(question)\n", "\n", "# Non-streaming\n", "# results = pdf_chain.invoke({\"chat_history\": \"\", \"question\": question, \"context\": docs})\n", "\n", "# Streaming\n", "for text in pdf_chain.stream(\n", " {\"chat_history\": \"\", \"question\": question, \"context\": docs}\n", "):\n", " print(text, end=\"\", flush=True)" ] }, { "cell_type": "markdown", "id": "083d39ad", "metadata": {}, "source": [ "### Define your LLM\n", "\n", "This hands-on only uses the `gpt-4o-mini`, but you can utilize multiple models in the pipeline." ] }, { "cell_type": "code", "execution_count": null, "id": "18d3a8f8", "metadata": {}, "outputs": [], "source": [ "from langchain_openai import AzureChatOpenAI\n", "\n", "llm = AzureChatOpenAI(model=azure_openai_chat_deployment_name, temperature=0)" ] }, { "cell_type": "markdown", "id": "65ec10e5", "metadata": {}, "source": [ "### Question-Retrieval Grader\n", "\n", "Construct a retrieval grader that evaluates the relevance of the retrieved documents to the input question. The retrieval grader should take the input question and the retrieved documents as input and output a relevance score for each document. \n", "Note that the retrieval grader should be able to handle **multiple documents** as input." ] }, { "cell_type": "code", "execution_count": null, "id": "4c5c5203", "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "from langchain_core.prompts import ChatPromptTemplate\n", "\n", "\n", "class GradeDocuments(BaseModel):\n", " \"\"\"Binary score for relevance check on retrieved documents.\"\"\"\n", "\n", " binary_score: str = Field(\n", " description=\"Documents are relevant to the question, 'yes' or 'no'\"\n", " )\n", "\n", "\n", "# Custom class based evaluator for grading documents\n", "class GradeDocumentsEvaluator:\n", " system = \"\"\"You are a grader assessing relevance of a retrieved document to a user question. \\n \n", " If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \\n\n", " It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \\n\n", " Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.\"\"\"\n", "\n", " def __init__(self, llm_client):\n", " self.llm_client = llm_client\n", "\n", " def __call__(self, *, question: str, document: str, **kwargs):\n", " grade_prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\"system\", self.system),\n", " (\n", " \"human\",\n", " \"Retrieved document: \\n\\n {document} \\n\\n User question: {question}\",\n", " ),\n", " ]\n", " )\n", " structured_llm_grader = self.llm_client.with_structured_output(GradeDocuments)\n", " retrieval_grader = grade_prompt | structured_llm_grader\n", " results = retrieval_grader.invoke({\"question\": question, \"document\": document})\n", " return results\n", "\n", "\n", "document_evaluator = GradeDocumentsEvaluator(llm)" ] }, { "cell_type": "markdown", "id": "a5d56838", "metadata": {}, "source": [ "Test the retrieval grader. For testing, we only show the result of the a single document, not the entire document set. " ] }, { "cell_type": "code", "execution_count": null, "id": "60a08f33", "metadata": {}, "outputs": [], "source": [ "question = \"What is AutoGen's main features?\"\n", "docs = pdf_retriever.invoke(question)\n", "\n", "retrieved_doc = docs[0].page_content\n", "print(f\"[Retrieved Doc sample]\\n{retrieved_doc}\\n\")\n", "print(document_evaluator(question=question, document=retrieved_doc))" ] }, { "cell_type": "markdown", "id": "702b7f87", "metadata": {}, "source": [ "### Answer Generator\n", "\n", "Construct a LLM Generation node. This is a Naive RAG chain that generates an answer based on the retrieved documents. \n", "\n", "We recommend you to use more advanced RAG chain for production" ] }, { "cell_type": "code", "execution_count": null, "id": "d080f695", "metadata": {}, "outputs": [], "source": [ "from langchain import hub\n", "from langchain_core.output_parsers import StrOutputParser\n", "from langchain_core.prompts import load_prompt\n", "\n", "if has_langchain_key:\n", " print(f\"Load prompt from LangChain Hub.\")\n", " prompt = hub.pull(\"daekeun-ml/rag-baseline\")\n", "else:\n", " print(\"LANGCHAIN_API_KEY is not set. Load prompt from YAML file.\")\n", " prompt = load_prompt(\"prompts/rag-baseline.yaml\")\n", "\n", "\n", "def format_docs(docs):\n", " return \"\\n\\n\".join(\n", " [\n", " f'<document><content>{doc.page_content}</content><source>{doc.metadata[\"source\"]}</source><page>{doc.metadata[\"page\"]+1}</page></document>'\n", " for doc in docs\n", " ]\n", " )\n", "\n", "\n", "rag_chain = prompt | llm | StrOutputParser()\n", "generation = rag_chain.invoke({\"context\": format_docs(docs), \"question\": question})\n", "print(generation)" ] }, { "cell_type": "markdown", "id": "41d4050d", "metadata": {}, "source": [ "### Groundedness Evaluator\n", "\n", "We can utilize Azure AI Evaluation API to evaluate the groundedness of the answer." ] }, { "cell_type": "code", "execution_count": null, "id": "608b59a5", "metadata": {}, "outputs": [], "source": [ "def get_groundedness_score(context, response):\n", "\n", " groundedness_eval = GroundednessEvaluator(model_config)\n", " query_response = dict(\n", " context=context,\n", " response=response,\n", " )\n", "\n", " # Running Groundedness Evaluator on a query and response pair\n", " groundedness_score = groundedness_eval(**query_response)\n", " return groundedness_score" ] }, { "cell_type": "code", "execution_count": null, "id": "be7d51b9", "metadata": {}, "outputs": [], "source": [ "display(get_groundedness_score(context=format_docs(docs), response=generation))" ] }, { "cell_type": "markdown", "id": "54c773d2", "metadata": {}, "source": [ "### Relevance Evaluator\n", "\n", "We can utilize Azure AI Evaluation API to evaluate the relevance of the answer." ] }, { "cell_type": "code", "execution_count": null, "id": "da799216", "metadata": {}, "outputs": [], "source": [ "def get_answer_relevace_score(query, response):\n", "\n", " relevance_eval = RelevanceEvaluator(model_config)\n", " query_response = dict(\n", " query=query,\n", " response=response,\n", " )\n", " relevance_score = relevance_eval(**query_response)\n", " return relevance_score" ] }, { "cell_type": "code", "execution_count": null, "id": "2c84d9c3", "metadata": {}, "outputs": [], "source": [ "display(get_answer_relevace_score(query=question, response=generation))" ] }, { "cell_type": "markdown", "id": "fd05275c", "metadata": {}, "source": [ "### Question Re-writer\n", "\n", "Construct a `question_rewriter` node to rewrite the question based on the retrieved documents and the generated answer." ] }, { "cell_type": "code", "execution_count": null, "id": "806f6d04", "metadata": {}, "outputs": [], "source": [ "from langchain_core.prompts import ChatPromptTemplate\n", "from langchain_core.output_parsers import StrOutputParser\n", "\n", "system = \"\"\"You a question re-writer that converts an input question to a better version that is optimized\n", "for vectorstore retrieval. Look at the input and try to reason about the underlying semantic intent / meaning.\"\"\"\n", "\n", "re_write_prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\"system\", system),\n", " (\n", " \"human\",\n", " \"Here is the initial question: \\n\\n {question} \\n Formulate an improved question.\",\n", " ),\n", " ]\n", ")\n", "\n", "question_rewriter = re_write_prompt | llm | StrOutputParser()" ] }, { "cell_type": "code", "execution_count": null, "id": "3e681966", "metadata": {}, "outputs": [], "source": [ "print(f\"[Original question] {question}\")\n", "question_rewriter.invoke({\"question\": question})" ] }, { "cell_type": "markdown", "id": "f08d720d", "metadata": {}, "source": [ " \n", "\n", "## 🧪 Step 2. Define the Graph\n", "---\n", "\n", "### State Definition\n", "\n", "- `question`: Question from the user\n", "- `generation`: Generated answer\n", "- `documents`: Retrieved documents" ] }, { "cell_type": "code", "execution_count": null, "id": "52775711", "metadata": {}, "outputs": [], "source": [ "from typing import List\n", "from typing_extensions import TypedDict, Annotated\n", "\n", "\n", "class GraphState(TypedDict):\n", " question: Annotated[str, \"Question\"]\n", " generation: Annotated[str, \"LLM Generation\"]\n", " documents: Annotated[List[str], \"Retrieved Documents\"]" ] }, { "cell_type": "markdown", "id": "78a0836f", "metadata": {}, "source": [ "### Define Nodes\n", "\n", "We will define the following nodes in the graph:\n", "\n", "- `retrieve`: Retrieve documents based on the user question.\n", "- `grade_documents`: Generate an answer based on the retrieved documents and user question.\n", "- `generate`: Grade documents based on their relevance to the user question.\n", "- `rewrite_query`: Rewrite the user question to improve retrieval performance.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4cbd12b6", "metadata": {}, "outputs": [], "source": [ "def retrieve(state: GraphState):\n", " \"\"\"\n", " Retrieve documents based on the user question.\n", " \"\"\"\n", " print(\"==== [RETRIEVE] ====\")\n", " question = state[\"question\"]\n", "\n", " documents = pdf_retriever.invoke(question)\n", " return {\"documents\": documents}\n", "\n", "\n", "def generate(state: GraphState):\n", " \"\"\"Generate an answer based on the retrieved documents and user question.\"\"\"\n", " print(\"==== [GENERATE] ====\")\n", " question = state[\"question\"]\n", " documents = state[\"documents\"]\n", "\n", " generation = rag_chain.invoke({\"context\": documents, \"question\": question})\n", " return {\"generation\": generation}\n", "\n", "\n", "def grade_documents(state: GraphState):\n", " \"\"\"Grade documents based on their relevance to the user question.\"\"\"\n", " print(\"==== [GRADE DOCUMENTS] ====\")\n", " question = state[\"question\"]\n", " documents = state[\"documents\"]\n", "\n", " filtered_docs = []\n", " relevant_doc_count = 0\n", "\n", " for d in documents:\n", " score = document_evaluator(question=question, document=d.page_content)\n", " grade = score.binary_score\n", " if grade == \"yes\":\n", " # Add related documents to filtered_docs\n", " print(\"==== GRADE: DOCUMENT RELEVANT ====\")\n", " filtered_docs.append(d)\n", " relevant_doc_count += 1\n", " else:\n", " print(\"==== GRADE: DOCUMENT NOT RELEVANT ====\")\n", " continue\n", " return {\"documents\": filtered_docs}\n", "\n", "\n", "def rewrite_query(state: GraphState):\n", " \"\"\"Rewrite the user question to improve retrieval performance.\"\"\"\n", " print(\"\\n==== [REWRITE QUERY] ====\\n\")\n", " question = state[\"question\"]\n", "\n", " better_question = question_rewriter.invoke({\"question\": question})\n", " return {\"question\": better_question}" ] }, { "cell_type": "markdown", "id": "1ef3da3f", "metadata": {}, "source": [ "### Define Conditional Nodes\n", "\n", "- `decide_to_generate`: Decide whether to generate an answer based on the retrieved documents. \n", "- `grade_generation_v_documents_and_question`: Grade the generated answer based on its relevance to the user question and the retrieved documents." ] }, { "cell_type": "code", "execution_count": null, "id": "87d1f724", "metadata": {}, "outputs": [], "source": [ "def decide_to_generate(state):\n", " \"\"\"\n", " Assess whether to generate an answer based on the relevance of the retrieved documents to the user question\n", " \"\"\"\n", " print(\"==== [ASSESS GRADED DOCUMENTS] ====\")\n", " state[\"question\"]\n", " filtered_documents = state[\"documents\"]\n", "\n", " if not filtered_documents:\n", " # If all documents are not relevant to the question, rewrite the query\n", " print(\n", " \"==== [DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, REWRITE QUERY] ====\"\n", " )\n", " return \"rewrite_query\"\n", " else:\n", " # If there are relevant documents, generate an answer\n", " print(\"==== [DECISION: GENERATE] ====\")\n", " return \"generate\"\n", "\n", "\n", "def grade_generation_v_documents_and_question(state):\n", " \"\"\"\n", " Grade the relevance of the generated answer to the user question and retrieved documents.\n", " \"\"\"\n", " print(\"==== [CHECK HALLUCINATIONS] ====\")\n", " question = state[\"question\"]\n", " documents = state[\"documents\"]\n", " generation = state[\"generation\"]\n", "\n", " groundedness_score = get_groundedness_score(context=documents, response=generation)\n", " grade = groundedness_score[\"groundedness\"]\n", " print(f\"Groundness_score (1-5; higher is better): {grade}\\n\")\n", "\n", " # Groundedness check\n", " if grade >= 4:\n", " print(\"==== [DECISION: GENERATION IS GROUNDED IN DOCUMENTS] ====\")\n", " print(\"==== [GRADE GENERATION vs QUESTION] ====\")\n", " relevance_score = get_answer_relevace_score(query=question, response=generation)\n", " grade = relevance_score[\"relevance\"]\n", " if grade >= 4:\n", " print(\n", " f\"==== [DECISION: GENERATED ANSWER ADDRESSES QUESTION, Relevance Score {grade}] ====\"\n", " )\n", " return \"relevant\"\n", " else:\n", " print(\"==== [DECISION: GENERATED ANSWER DOES NOT ADDRESS QUESTION] ====\")\n", " return \"not relevant\"\n", " else:\n", " print(\"==== [DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY] ====\")\n", " return \"hallucination\"" ] }, { "cell_type": "markdown", "id": "ee9a8d0e", "metadata": {}, "source": [ "### Construct the Graph" ] }, { "cell_type": "code", "execution_count": null, "id": "204852a3", "metadata": {}, "outputs": [], "source": [ "from langgraph.graph import END, StateGraph, START\n", "from langgraph.checkpoint.memory import MemorySaver\n", "\n", "workflow = StateGraph(GraphState)\n", "\n", "# Node definition\n", "workflow.add_node(\"retrieve\", retrieve)\n", "workflow.add_node(\"grade_documents\", grade_documents)\n", "workflow.add_node(\"generate\", generate)\n", "workflow.add_node(\"rewrite_query\", rewrite_query)\n", "\n", "# Edge connections\n", "workflow.add_edge(START, \"retrieve\")\n", "workflow.add_edge(\"retrieve\", \"grade_documents\")\n", "workflow.add_conditional_edges(\n", " \"grade_documents\",\n", " decide_to_generate,\n", " {\n", " \"rewrite_query\": \"rewrite_query\",\n", " \"generate\": \"generate\",\n", " },\n", ")\n", "workflow.add_edge(\"rewrite_query\", \"retrieve\")\n", "workflow.add_conditional_edges(\n", " \"generate\",\n", " grade_generation_v_documents_and_question,\n", " {\n", " \"hallucination\": \"generate\",\n", " \"relevant\": END,\n", " \"not relevant\": \"rewrite_query\",\n", " },\n", ")\n", "\n", "# Compile the workflow\n", "app = workflow.compile(checkpointer=MemorySaver())" ] }, { "cell_type": "markdown", "id": "8ffa3bb2", "metadata": {}, "source": [ "### Visualize the graph" ] }, { "cell_type": "code", "execution_count": null, "id": "4cff2c2a", "metadata": {}, "outputs": [], "source": [ "from azure_genai_utils.graphs import visualize_langgraph\n", "\n", "visualize_langgraph(app, xray=True)" ] }, { "cell_type": "markdown", "id": "2d8c305a", "metadata": {}, "source": [ " \n", "\n", "## 🧪 Step 3. Execute the Graph\n", "---\n", "\n", "### Execute the graph" ] }, { "cell_type": "code", "execution_count": null, "id": "bc37cfd2", "metadata": {}, "outputs": [], "source": [ "from langchain_core.runnables import RunnableConfig\n", "from langgraph.errors import GraphRecursionError\n", "from azure_genai_utils.messages import stream_graph, invoke_graph, random_uuid\n", "\n", "config = RunnableConfig(recursion_limit=10, configurable={\"thread_id\": random_uuid()})\n", "\n", "inputs = {\n", " \"question\": \"What is AutoGen's main features?\",\n", "}\n", "\n", "try:\n", " stream_graph(\n", " app,\n", " inputs,\n", " config,\n", " [\"retrieve\", \"rewrite_query\", \"grade_documents\", \"generate\"],\n", " )\n", "except GraphRecursionError as recursion_error:\n", " print(f\"GraphRecursionError: {recursion_error}\")" ] }, { "cell_type": "markdown", "id": "218f12e3", "metadata": {}, "source": [ "### Define the Failure Condition\n", "\n", "The below execution graph shows a recursive state where the graph keeps generating answers for non-related questions without providing a satisfactory response to the user. \n", "To prevent this, you can define a web search node that searches for related questions and provides a list of related questions to the user.\n", "\n", "Corrective-RAG (CRAG) is a similar approach that focuses on refining the entire retrieval process, including web search, to ensure that the generation is based on the most relevant and accurate information available." ] }, { "cell_type": "code", "execution_count": null, "id": "20b4db9c", "metadata": {}, "outputs": [], "source": [ "from langchain_core.runnables import RunnableConfig\n", "from langgraph.errors import GraphRecursionError\n", "from azure_genai_utils.messages import stream_graph, invoke_graph, random_uuid\n", "\n", "config = RunnableConfig(recursion_limit=10, configurable={\"thread_id\": random_uuid()})\n", "\n", "inputs = {\n", " \"question\": \"Who is Daekeun?\",\n", "}\n", "\n", "try:\n", " stream_graph(\n", " app,\n", " inputs,\n", " config,\n", " [\"retrieve\", \"rewrite_query\", \"grade_documents\", \"generate\"],\n", " )\n", "except GraphRecursionError as recursion_error:\n", " print(f\"GraphRecursionError: {recursion_error}\")" ] }, { "cell_type": "markdown", "id": "2cda1a04", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "venv_agent", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }

1_agentic-design-ptn/01_reflection/LangGraph/01.2_self-rag-eval-sdk.ipynb (785 lines of code) (raw):