notebooks/zh-CN/rag_llamaindex

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 用 LlamaIndex 构建一个 RAG 电子书库智能助手\n", "\n", "_作者: [Jonathan Jin](https://huggingface.co/jinnovation)_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 简介\n", "\n", "这份教程将指导你如何快速为你的电子书库创建一个基于 RAG 图书助手。\n", "就像图书馆的图书管理员帮你找书一样，这个助手也能帮你从你的电子书里找到你需要的书。\n", "\n", "## 要求\n", "这个助手要做得**轻巧**，**尽量在本地运行**，而且**不要用太多其他的东西**。我们会尽量用免费的开源软件，选择那种在**本地普通电脑上，比如 M1 型号的 MacBook 上就能运行的模型**。\n", "\n", "## 组件\n", "我们的解决方案将包括以下组件：\n", "- [LlamaIndex]，一个基于LLM的应用数据框架，与 [LangChain] 不同，它是专门为 RAG 设计的；\n", "- [Ollama]，一个简单易用的工具，可以让你在本地运行语言模型，比如Llama 2；\n", "- [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 嵌入模型，它的表现[相当好，并且大小适中](https://huggingface.co/spaces/mteb/leaderboard)；\n", "- [Llama 2]，我们将通过 [Ollama] 运行它。\n", "\n", "[LlamaIndex]: https://docs.llamaindex.ai/en/stable/index.html\n", "[LangChain]: https://python.langchain.com/docs/get_started/introduction\n", "[Ollama]: https://ollama.com/\n", "[Llama 2]: https://ollama.com/library/llama2\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 依赖\n", "\n", "首先安装依赖库" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -q \\\n", " llama-index \\\n", " EbookLib \\\n", " html2text \\\n", " llama-index-embeddings-huggingface \\\n", " llama-index-llms-ollama" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!brew install ollama" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 设置测试书库\n", "\n", "我们接下来要弄个测试用的“书库”。\n", "\n", "简单点说，我们的“书库”就是一个放有 `.epub` 格式电子书文件的**文件夹**。这个方法很容易就能扩展到像 Calibre 那种带有个 `metadata.db` 数据库文件的书库。怎么扩展这个问题，我们留给读者自己思考。😇\n", "\n", "现在，我们先从[古腾堡计划网站](https://www.gutenberg.org/)下载两本`.epub`格式的电子书放到我们的书库里。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p \".test/library/jane-austen\"\n", "!mkdir -p \".test/library/victor-hugo\"\n", "!wget https://www.gutenberg.org/ebooks/1342.epub.noimages -O \".test/library/jane-austen/pride-and-prejudice.epub\"\n", "!wget https://www.gutenberg.org/ebooks/135.epub.noimages -O \".test/library/victor-hugo/les-miserables.epub\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 用 LlamaIndex 构建 RAG\n", "\n", "使用 LlamaIndex 的 RAG 主要包括以下三个阶段：\n", "\n", "1. **加载**，在这个阶段你告诉 LlamaIndex 你的数据在哪里以及如何加载它；\n", "2. **索引**，在这个阶段你扩充加载的数据以方便查询，例如使用向量嵌入；\n", "3. **查询**，在这个阶段你配置一个 LLM 作为你索引数据的查询接口。\n", "\n", "以上解释仅是对 LlamaIndex 可实现功能的表面说明。要想了解更多深入细节，我强烈推荐阅读 LlamaIndex 文档中的[\"高级概念\"页面](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 加载\n", "\n", "好的，我们首先从**加载**阶段开始。\n", "\n", "之前提到，LlamaIndex 是专为 RAG 这种混合检索生成模型设计的。这一点从它的[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)功能就可以明显看出，它能**神奇地**免费支持很多种文件类型。幸运的是， `.epub` 文件格式也在支持范围内。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from llama_index.core import SimpleDirectoryReader\n", "\n", "loader = SimpleDirectoryReader(\n", " input_dir=\"./.test/\",\n", " recursive=True,\n", " required_exts=[\".epub\"],\n", ")\n", "\n", "documents = loader.load_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SimpleDirectoryReader.load_data()` 将我们的电子书转换成一组 LlamaIndex 可以处理的 [`Document`s](https://docs.llamaindex.ai/en/stable/api/llama_index.core.schema.Document.html)。\n", "\n", "这里有一个重要的事情要注意，就是这个阶段的文档**还没有被分块**——这将在索引阶段进行。继续往下看...\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 索引\n", "\n", "\n", "在把数据**加载**进来之后，接下来我们要做的是**建立索引**。这样我们的 RAG 系统就能找到与用户查询相关的信息，然后把这些信息传给语言模型（LLM），以便它能够**增强**回答的内容。同时，这一步也将对文档进行分块。\n", "\n", "在 LlamaIndex 中，[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html) 是用来建立索引的一个“默认”工具。这个工具默认使用一个简单、基于内存的字典来保存索引，但随着你的使用规模扩大，LlamaIndex 还支持\n", "[多种向量存储解决方案](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)。\n", "\n", "<Tip>\n", "LlamaIndex 默认的块大小是 1024 个字符，块与块之间有 20 个字符的重叠。如果需要了解更多细节，可以查看 [LlamaIndex 的文档](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes)。\n", "</Tip>\n", "\n", "如前所述，我们选择使用 [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 生成嵌入，以避免使用默认的 OpenAI（特别是 gpt-3.5-turbo）模型，因为我们需要一个轻量级、可在本地运行的完整解决方案。\n", "\n", "幸运的是，LlamaIndex 可以通过 `HuggingFaceEmbedding` 类方便地从 Hugging Face 获取嵌入模型，因此我们将使用它。\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", "\n", "embedding_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们将把这个模型传递给 `VectorStoreIndex`，作为我们的嵌入模型，以绕过 OpenAI 的默认行为。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from llama_index.core import VectorStoreIndex\n", "\n", "index = VectorStoreIndex.from_documents(\n", " documents,\n", " embed_model=embedding_model,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 查询\n", "\n", "现在我们要完成 RAG 智能助手的最后一部分——设置查询接口。\n", "\n", "在这个教程中，我们将使用 Llama 2 作为语言模型，但我建议你试试不同的模型，看看哪个能给出最好的回答。\n", "\n", "首先，我们需要在一个新的终端窗口中启动 Ollama 服务器。不过，[Ollama 的 Python 客户端](https://github.com/ollama/ollama-python)不支持直接启动和关闭服务器，所以我们需要在 Python 环境之外操作。\n", "\n", "打开一个新的终端窗口，输入命令：`ollama serve`。等我们这里操作完成后，别忘了关闭服务器！\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在，让我们将 Llama 2 连接到 LlamaIndex，并使用它作为我们查询引擎的基础。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from llama_index.llms.ollama import Ollama\n", "\n", "llama = Ollama(\n", " model=\"llama2\",\n", " request_timeout=40.0,\n", ")\n", "\n", "query_engine = index.as_query_engine(llm=llama)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 最终结果 \n", "\n", "有了这些，我们的基本的 RAG 电子书库智能助手就设置好了，我们可以开始询问有关我们电子书库的问题了。例如：\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Based on the context provided, there are two books available:\n", "\n", "1. \"Pride and Prejudice\" by Jane Austen\n", "2. \"Les Misérables\" by Victor Hugo\n", "\n", "The context used to derive this answer includes:\n", "\n", "* The file path for each book, which provides information about the location of the book files on the computer.\n", "* The titles of the books, which are mentioned in the context as being available for reading.\n", "* A list of words associated with each book, such as \"epub\" and \"notebooks\", which provide additional information about the format and storage location of each book.\n" ] } ], "source": [ "print(query_engine.query(\"What are the titles of all the books available? Show me the context used to derive your answer.\"))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The main character of 'Pride and Prejudice' is Elizabeth Bennet.\n" ] } ], "source": [ "print(query_engine.query(\"Who is the main character of 'Pride and Prejudice'?\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 总结和未来可能的提升\n", "\n", "\n", "我们成功地展示了如何创建一个完全在本地运行的基本 RAG 的电子书库智能助手，甚至在苹果的 Apple silicon Macs 上也能运行。在这个过程中，我们还全面了解了 LlamaIndex 是如何帮助我们简化建立基于 RAG 的应用程序的。\n", "\n", "尽管如此，我们其实只是接触到了一些皮毛。下面是一些关于如何改进和在这个基础上进一步发展的想法。\n", "\n", "### 强制引用\n", "\n", "为了避免图书馆员的虚构响应，我们怎样才能要求它为其回答提供引用？\n", "\n", "### 使用扩充的元数据\n", "\n", "像 [Calibre](https://calibre-ebook.com/) 这样的电子书库管理工具会为电子书创建更多的元数据。这些元数据可以提供一些在书中文本里找不到的信息，比如出版商或版本。我们怎样才能扩展我们的 RAG 流程，使其也能利用那些不是 .epub 文件的额外信息源呢？\n", "\n", "\n", "### 高效索引\n", "\n", "如果我们把这里做的所有东西写成一个脚本或可执行程序，那么每次运行这个脚本时，它都会重新索引我们的图书馆。对于只有两个文件的微型测试库来说，这样还行，但对于稍大一点的图书馆来说，每次都重新索引会让用户感到非常烦恼。我们怎样才能让索引持久化，并且只在图书馆内容有重要变化时，比如添加了新书，才去更新索引呢？" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }

notebooks/zh-CN/rag_llamaindex_librarian.ipynb (340 lines of code) (raw):