notebooks/zh-CN/rag_with_knowledge_graphs

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 利用知识图谱增强 RAG 推理能力" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_作者: [Diego Carpintero](https://github.com/dcarpintero)_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "知识图谱提供了一种以既能为人类又能为机器理解的格式建模和存储互联信息的方法。这些图谱由*节点*和*边*组成，分别表示实体及其关系。与传统数据库不同，图谱固有的表达能力允许更丰富的语义理解，同时提供了灵活性，可以在不受固定模式限制的情况下，适应新的实体类型和关系。\n", "\n", "通过将知识图谱与嵌入（向量搜索）结合，我们可以利用*多跳连接性*和*信息的上下文理解*，来增强大语言模型（LLMs）的推理能力和可解释性。\n", "\n", "本文档探讨了这一方法的实际应用，展示了如何：\n", "- 使用合成数据集在 [Neo4j](https://neo4j.com/docs/) 中构建与研究出版物相关的知识图谱，\n", "- 使用[嵌入模型](https://python.langchain.com/v0.2/docs/integrations/text_embedding/)将我们的部分数据字段投影到高维向量空间，\n", "- 在这些嵌入上构建向量索引以启用相似性搜索，\n", "- 使用自然语言从我们的图谱中提取洞见，通过 [LangChain](https://python.langchain.com/v0.2/docs/introduction/) 轻松将用户查询转换为 [Cypher](https://neo4j.com/docs/cypher-manual/current/introduction/) 语句：" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " <img src=\"https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/static/knowledge-graphs.png\">\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 初始化" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install neo4j langchain langchain_openai langchain-community python-dotenv --quiet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 设置 Neo4j 实例\n", "\n", "我们将使用 [Neo4j](https://neo4j.com/docs/) 来创建我们的知识图谱，它是一个开源的数据库管理系统，专门用于图数据库技术。\n", "\n", "为了快速且简便地设置，您可以在 [Neo4j Aura](https://neo4j.com/product/auradb/)上启动一个免费的实例。\n", "\n", "接着，你可以使用 `.env` 文件将 `NEO4J_URI`、`NEO4J_USERNAME` 和 `NEO4J_PASSWORD` 设置为环境变量：" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dotenv\n", "dotenv.load_dotenv('.env', override=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LangChain 提供了 `Neo4jGraph` 类来与 Neo4j 进行交互：" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "from langchain_community.graphs import Neo4jGraph\n", "\n", "graph = Neo4jGraph(\n", " url=os.environ['NEO4J_URI'], \n", " username=os.environ['NEO4J_USERNAME'],\n", " password=os.environ['NEO4J_PASSWORD'],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 将数据集加载到图谱中\n", "\n", "以下示例演示了如何与我们的 `Neo4j` 数据库建立连接，并使用[合成数据](https://github.com/dcarpintero/generative-ai-101/blob/main/dataset/synthetic_articles.csv)填充它，这些数据包括研究文章及其作者。\n", "\n", "实体包括：\n", "- *研究人员*（Researcher）\n", "- *文章*（Article）\n", "- *主题*（Topic）\n", "\n", "关系包括：\n", "- *研究人员* --[PUBLISHED]--> *文章*\n", "- *文章* --[IN_TOPIC]--> *主题*" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from langchain_community.graphs import Neo4jGraph\n", "\n", "graph = Neo4jGraph()\n", "\n", "q_load_articles = \"\"\"\n", "LOAD CSV WITH HEADERS\n", "FROM 'https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/dataset/synthetic_articles.csv' \n", "AS row \n", "FIELDTERMINATOR ';'\n", "MERGE (a:Article {title:row.Title})\n", "SET a.abstract = row.Abstract,\n", " a.publication_date = date(row.Publication_Date)\n", "FOREACH (researcher in split(row.Authors, ',') | \n", " MERGE (p:Researcher {name:trim(researcher)})\n", " MERGE (p)-[:PUBLISHED]->(a))\n", "FOREACH (topic in [row.Topic] | \n", " MERGE (t:Topic {name:trim(topic)})\n", " MERGE (a)-[:IN_TOPIC]->(t))\n", "\"\"\"\n", "\n", "graph.query(q_load_articles)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "让我们检查节点和关系是否已正确初始化：" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Node properties:\n", "Article {title: STRING, abstract: STRING, publication_date: DATE, embedding: LIST}\n", "Researcher {name: STRING}\n", "Topic {name: STRING}\n", "Relationship properties:\n", "\n", "The relationships:\n", "(:Article)-[:IN_TOPIC]->(:Topic)\n", "(:Researcher)-[:PUBLISHED]->(:Article)\n" ] } ], "source": [ "graph.refresh_schema()\n", "print(graph.get_schema)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们可以在 Neo4j 工作区中检查我们的知识图谱：" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " <img src=\"https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/static/kg_sample_00.png\">\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 构建向量索引\n", "\n", "现在，我们构建一个向量索引，以便根据*主题、标题和摘要*高效地搜索相关的*文章*。这一过程包括使用这些字段计算每篇文章的嵌入。在查询时，系统通过使用相似性度量（例如余弦距离）来找到与用户输入最相似的文章。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from langchain_community.vectorstores import Neo4jVector\n", "from langchain_openai import OpenAIEmbeddings\n", "\n", "vector_index = Neo4jVector.from_existing_graph(\n", " OpenAIEmbeddings(),\n", " url=os.environ['NEO4J_URI'],\n", " username=os.environ['NEO4J_USERNAME'],\n", " password=os.environ['NEO4J_PASSWORD'],\n", " index_name='articles',\n", " node_label=\"Article\",\n", " text_node_properties=['topic', 'title', 'abstract'],\n", " embedding_node_property='embedding',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**注意：** 要访问 OpenAI 嵌入模型，你需要创建一个 OpenAI 账户，获取 API 密钥，并将 `OPENAI_API_KEY` 设置为环境变量。你还可以尝试使用其他的[嵌入模型](https://python.langchain.com/v0.2/docs/integrations/text_embedding/)集成，进行实验和对比。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 基于相似性的问答" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Langchain RetrievalQA` 创建了一个问答（QA）链，使用上述的向量索引作为检索器。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from langchain.chains import RetrievalQA\n", "from langchain_openai import ChatOpenAI\n", "\n", "vector_qa = RetrievalQA.from_chain_type(\n", " llm=ChatOpenAI(),\n", " chain_type=\"stuff\",\n", " retriever=vector_index.as_retriever()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们来问一下：“*哪些文章讨论了人工智能如何影响我们的日常生活？*”" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The articles that discuss how AI might affect our daily life are:\n", "\n", "1. **The Impact of AI on Employment: A Comprehensive Study**\n", " *Abstract:* This study analyzes the potential effects of AI on various job sectors and suggests policy recommendations to mitigate negative impacts.\n", "\n", "2. **The Societal Implications of Advanced AI: A Multidisciplinary Analysis**\n", " *Abstract:* Our study brings together experts from various fields to analyze the potential long-term impacts of advanced AI on society, economy, and culture.\n", "\n", "These two articles would provide insights into how AI could potentially impact our daily lives from different perspectives.\n" ] } ], "source": [ "r = vector_qa.invoke(\n", " {\"query\": \"which articles discuss how AI might affect our daily life? include the article titles and abstracts.\"}\n", ")\n", "print(r['result'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 通过知识图谱进行推理" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "知识图谱非常适合于在实体之间建立连接，能够提取模式并发现新的洞察。\n", "\n", "本节将演示如何实现这一过程，并通过自然语言查询将结果集成到大语言模型（LLM）管道中。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graph-Cypher-Chain 与 LangChain\n", "\n", "为了构建富有表现力且高效的查询，`Neo4j` 使用 `Cypher`，一种受 SQL 启发的声明式查询语言。`LangChain` 提供了封装器 `GraphCypherQAChain`，它是一个抽象层，允许通过自然语言查询图数据库，从而更容易将基于图的数据检索集成到大语言模型（LLM）管道中。\n", "\n", "在实际应用中，`GraphCypherQAChain`：\n", "- 从用户输入（自然语言）生成 Cypher 语句（图数据库的查询，如 Neo4j），并应用上下文学习（提示工程），\n", "- 将这些语句执行到图数据库中，\n", "- 将结果作为上下文提供，帮助 LLM 基于准确、最新的信息生成回答。\n", "\n", "**注意：** 该实现涉及执行模型生成的图查询，这可能带来潜在风险，例如意外访问或修改数据库中的敏感数据。为减少这些风险，请确保数据库连接权限尽可能受限，以满足链/代理的特定需求。虽然这种方法能够降低风险，但并不能完全消除风险。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from langchain.chains import GraphCypherQAChain\n", "from langchain_openai import ChatOpenAI\n", "\n", "graph.refresh_schema()\n", "\n", "cypher_chain = GraphCypherQAChain.from_llm(\n", " cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'),\n", " qa_llm = ChatOpenAI(temperature=0, model_name='gpt-4o'), \n", " graph=graph,\n", " verbose=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 使用自然语言的查询示例" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "请注意，在以下示例中，Cypher 查询执行的结果是如何作为上下文提供给大语言模型（LLM）的：" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **\"*Emily Chen 发布了多少篇文章？*\"**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这个示例中，我们的问题“*Emily Chen 发布了多少篇文章？*”将被转换为以下 Cypher 查询：\n", "\n", "```\n", "MATCH (r:Researcher {name: \"Emily Chen\"})-[:PUBLISHED]->(a:Article)\n", "RETURN COUNT(a) AS numberOfArticles\n", "```\n", "\n", "该查询通过匹配名称为“Emily Chen”的 `Researcher` 节点，并遍历与之相关的 `PUBLISHED` 关系，连接到 `Article` 节点。然后，它计算与“Emily Chen”连接的 `Article` 节点的数量。\n", "\n", "执行查询后，结果将作为上下文提供给 LLM，LLM 基于这个上下文来生成回答。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " <img src=\"https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/static/kg_sample_01.png\" width=\"40%\">\n", "" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n", "Generated Cypher:\n", "\u001b[32;1m\u001b[1;3mcypher\n", "MATCH (r:Researcher {name: \"Emily Chen\"})-[:PUBLISHED]->(a:Article)\n", "RETURN COUNT(a) AS numberOfArticles\n", "\u001b[0m\n", "Full Context:\n", "\u001b[32;1m\u001b[1;3m[{'numberOfArticles': 7}]\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "{'query': 'How many articles has published Emily Chen?',\n", " 'result': 'Emily Chen has published 7 articles.'}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the answer should be '7'\n", "cypher_chain.invoke(\n", " {\"query\": \"How many articles has published Emily Chen?\"}\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **\"*是否有任何一对研究人员共同发布了超过三篇文章？*\"**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这个示例中，查询“*是否有任何一对研究人员共同发布了超过三篇文章？*”将结果转换为以下 Cypher 查询：\n", "\n", "```\n", "MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)\n", "WHERE r1 <> r2\n", "WITH r1, r2, COUNT(a) AS sharedArticles\n", "WHERE sharedArticles > 3\n", "RETURN r1.name, r2.name, sharedArticles\n", "```\n", "\n", "该查询首先从 `Researcher` 节点出发，遍历 `PUBLISHED` 关系，找到与之连接的 `Article` 节点，然后再次遍历，查找与另一位 `Researcher` 节点的连接。通过这种方式，查询找出那些共同发表了超过三篇文章的研究人员对。最终，结果将作为上下文提供给 LLM，LLM 会基于这个上下文生成回答。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " <img src=\"https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/static/kg_sample_02.png\">\n", "" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n", "Generated Cypher:\n", "\u001b[32;1m\u001b[1;3mcypher\n", "MATCH (r1:Researcher)-[:PUBLISHED]->(a:Article)<-[:PUBLISHED]-(r2:Researcher)\n", "WHERE r1 <> r2\n", "WITH r1, r2, COUNT(a) AS sharedArticles\n", "WHERE sharedArticles > 3\n", "RETURN r1.name, r2.name, sharedArticles\n", "\u001b[0m\n", "Full Context:\n", "\u001b[32;1m\u001b[1;3m[{'r1.name': 'David Johnson', 'r2.name': 'Emily Chen', 'sharedArticles': 4}, {'r1.name': 'Robert Taylor', 'r2.name': 'Emily Chen', 'sharedArticles': 4}, {'r1.name': 'Emily Chen', 'r2.name': 'David Johnson', 'sharedArticles': 4}, {'r1.name': 'Emily Chen', 'r2.name': 'Robert Taylor', 'sharedArticles': 4}]\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "{'query': 'are there any pair of researchers who have published more than three articles together?',\n", " 'result': 'Yes, David Johnson and Emily Chen, as well as Robert Taylor and Emily Chen, have published more than three articles together.'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the answer should be David Johnson & Emily Chen, Robert Taylor & Emily Chen\n", "cypher_chain.invoke(\n", " {\"query\": \"are there any pair of researchers who have published more than three articles together?\"}\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### **\"*哪位研究人员与最多的同行合作过？*\"**\n", "\n", "让我们找出哪位研究人员与最多的同行合作过。 \n", "我们的查询“*哪位研究人员与最多的同行合作过？*”现在转换为以下 Cypher 查询：\n", "\n", "```\n", "MATCH (r:Researcher)-[:PUBLISHED]->(:Article)<-[:PUBLISHED]-(peer:Researcher)\n", "WITH r, COUNT(DISTINCT peer) AS peerCount\n", "RETURN r.name AS researcher, peerCount\n", "ORDER BY peerCount DESC\n", "LIMIT 1\n", "```\n", "\n", "在这个查询中，我们从所有 `Researcher` 节点出发，遍历它们的 `PUBLISHED` 关系，找到与之连接的 `Article` 节点。对于每个 `Article` 节点，Neo4j 会继续回溯，查找那些也发表了同一篇文章的其他 `Researcher` 节点（同行）。通过这种方式，查询能够计算出每位研究人员与多少位同行合作，并按合作次数降序排列，最终返回合作最多的研究人员及其同行数量。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " <img src=\"https://raw.githubusercontent.com/dcarpintero/generative-ai-101/main/static/kg_sample_03.png\">\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\u001b[1m> Entering new GraphCypherQAChain chain...\u001b[0m\n", "Generated Cypher:\n", "\u001b[32;1m\u001b[1;3mcypher\n", "MATCH (r1:Researcher)-[:PUBLISHED]->(:Article)<-[:PUBLISHED]-(r2:Researcher)\n", "WHERE r1 <> r2\n", "WITH r1, COUNT(DISTINCT r2) AS collaborators\n", "RETURN r1.name AS researcher, collaborators\n", "ORDER BY collaborators DESC\n", "LIMIT 1\n", "\u001b[0m\n", "Full Context:\n", "\u001b[32;1m\u001b[1;3m[{'researcher': 'David Johnson', 'collaborators': 6}]\u001b[0m\n", "\n", "\u001b[1m> Finished chain.\u001b[0m\n" ] }, { "data": { "text/plain": [ "{'query': 'Which researcher has collaborated with the most peers?',\n", " 'result': 'David Johnson has collaborated with 6 peers.'}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# the answer should be 'David Johnson'\n", "cypher_chain.invoke(\n", " {\"query\": \"Which researcher has collaborated with the most peers?\"}\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }

notebooks/zh-CN/rag_with_knowledge_graphs_neo4j.ipynb (644 lines of code) (raw):