notebooks/zh-CN/rag_llamaindex_librarian.ipynb (340 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 用 LlamaIndex 构建一个 RAG 电子书库智能助手\n", "\n", "_作者: [Jonathan Jin](https://huggingface.co/jinnovation)_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 简介\n", "\n", "这份教程将指导你如何快速为你的电子书库创建一个基于 RAG 图书助手。\n", "就像图书馆的图书管理员帮你找书一样,这个助手也能帮你从你的电子书里找到你需要的书。\n", "\n", "## 要求\n", "这个助手要做得**轻巧**,**尽量在本地运行**,而且**不要用太多其他的东西**。我们会尽量用免费的开源软件,选择那种在**本地普通电脑上,比如 M1 型号的 MacBook 上就能运行的模型**。\n", "\n", "## 组件\n", "我们的解决方案将包括以下组件:\n", "- [LlamaIndex],一个基于LLM的应用数据框架,与 [LangChain] 不同,它是专门为 RAG 设计的;\n", "- [Ollama],一个简单易用的工具,可以让你在本地运行语言模型,比如Llama 2;\n", "- [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 嵌入模型,它的表现[相当好,并且大小适中](https://huggingface.co/spaces/mteb/leaderboard);\n", "- [Llama 2],我们将通过 [Ollama] 运行它。\n", "\n", "[LlamaIndex]: https://docs.llamaindex.ai/en/stable/index.html\n", "[LangChain]: https://python.langchain.com/docs/get_started/introduction\n", "[Ollama]: https://ollama.com/\n", "[Llama 2]: https://ollama.com/library/llama2\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 依赖\n", "\n", "首先安装依赖库" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install -q \\\n", " llama-index \\\n", " EbookLib \\\n", " html2text \\\n", " llama-index-embeddings-huggingface \\\n", " llama-index-llms-ollama" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!brew install ollama" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 设置测试书库\n", "\n", "我们接下来要弄个测试用的“书库”。\n", "\n", "简单点说,我们的“书库”就是一个放有 `.epub` 格式电子书文件的**文件夹**。这个方法很容易就能扩展到像 Calibre 那种带有个 `metadata.db` 数据库文件的书库。怎么扩展这个问题,我们留给读者自己思考。😇\n", "\n", "现在,我们先从[古腾堡计划网站](https://www.gutenberg.org/)下载两本`.epub`格式的电子书放到我们的书库里。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p \".test/library/jane-austen\"\n", "!mkdir -p \".test/library/victor-hugo\"\n", "!wget https://www.gutenberg.org/ebooks/1342.epub.noimages -O \".test/library/jane-austen/pride-and-prejudice.epub\"\n", "!wget https://www.gutenberg.org/ebooks/135.epub.noimages -O \".test/library/victor-hugo/les-miserables.epub\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 用 LlamaIndex 构建 RAG\n", "\n", "使用 LlamaIndex 的 RAG 主要包括以下三个阶段:\n", "\n", "1. **加载**,在这个阶段你告诉 LlamaIndex 你的数据在哪里以及如何加载它;\n", "2. **索引**,在这个阶段你扩充加载的数据以方便查询,例如使用向量嵌入;\n", "3. **查询**,在这个阶段你配置一个 LLM 作为你索引数据的查询接口。\n", "\n", "以上解释仅是对 LlamaIndex 可实现功能的表面说明。要想了解更多深入细节,我强烈推荐阅读 LlamaIndex 文档中的[\"高级概念\"页面](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 加载\n", "\n", "好的,我们首先从**加载**阶段开始。\n", "\n", "之前提到,LlamaIndex 是专为 RAG 这种混合检索生成模型设计的。这一点从它的[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)功能就可以明显看出,它能**神奇地**免费支持很多种文件类型。幸运的是, `.epub` 文件格式也在支持范围内。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from llama_index.core import SimpleDirectoryReader\n", "\n", "loader = SimpleDirectoryReader(\n", " input_dir=\"./.test/\",\n", " recursive=True,\n", " required_exts=[\".epub\"],\n", ")\n", "\n", "documents = loader.load_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SimpleDirectoryReader.load_data()` 将我们的电子书转换成一组 LlamaIndex 可以处理的 [`Document`s](https://docs.llamaindex.ai/en/stable/api/llama_index.core.schema.Document.html)。\n", "\n", "这里有一个重要的事情要注意,就是这个阶段的文档**还没有被分块**——这将在索引阶段进行。继续往下看...\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 索引\n", "\n", "\n", "在把数据**加载**进来之后,接下来我们要做的是**建立索引**。这样我们的 RAG 系统就能找到与用户查询相关的信息,然后把这些信息传给语言模型(LLM),以便它能够**增强**回答的内容。同时,这一步也将对文档进行分块。\n", "\n", "在 LlamaIndex 中,[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html) 是用来建立索引的一个“默认”工具。这个工具默认使用一个简单、基于内存的字典来保存索引,但随着你的使用规模扩大,LlamaIndex 还支持\n", "[多种向量存储解决方案](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)。\n", "\n", "<Tip>\n", "LlamaIndex 默认的块大小是 1024 个字符,块与块之间有 20 个字符的重叠。如果需要了解更多细节,可以查看 [LlamaIndex 的文档](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes)。\n", "</Tip>\n", "\n", "如前所述,我们选择使用 [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 生成嵌入,以避免使用默认的 OpenAI(特别是 gpt-3.5-turbo)模型,因为我们需要一个轻量级、可在本地运行的完整解决方案。\n", "\n", "幸运的是,LlamaIndex 可以通过 `HuggingFaceEmbedding` 类方便地从 Hugging Face 获取嵌入模型,因此我们将使用它。\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", "\n", "embedding_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们将把这个模型传递给 `VectorStoreIndex`,作为我们的嵌入模型,以绕过 OpenAI 的默认行为。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from llama_index.core import VectorStoreIndex\n", "\n", "index = VectorStoreIndex.from_documents(\n", " documents,\n", " embed_model=embedding_model,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 查询\n", "\n", "现在我们要完成 RAG 智能助手的最后一部分——设置查询接口。\n", "\n", "在这个教程中,我们将使用 Llama 2 作为语言模型,但我建议你试试不同的模型,看看哪个能给出最好的回答。\n", "\n", "首先,我们需要在一个新的终端窗口中启动 Ollama 服务器。不过,[Ollama 的 Python 客户端](https://github.com/ollama/ollama-python)不支持直接启动和关闭服务器,所以我们需要在 Python 环境之外操作。\n", "\n", "打开一个新的终端窗口,输入命令:`ollama serve`。等我们这里操作完成后,别忘了关闭服务器!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "现在,让我们将 Llama 2 连接到 LlamaIndex,并使用它作为我们查询引擎的基础。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from llama_index.llms.ollama import Ollama\n", "\n", "llama = Ollama(\n", " model=\"llama2\",\n", " request_timeout=40.0,\n", ")\n", "\n", "query_engine = index.as_query_engine(llm=llama)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 最终结果 \n", "\n", "有了这些,我们的基本的 RAG 电子书库智能助手就设置好了,我们可以开始询问有关我们电子书库的问题了。例如:\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Based on the context provided, there are two books available:\n", "\n", "1. \"Pride and Prejudice\" by Jane Austen\n", "2. \"Les Misérables\" by Victor Hugo\n", "\n", "The context used to derive this answer includes:\n", "\n", "* The file path for each book, which provides information about the location of the book files on the computer.\n", "* The titles of the books, which are mentioned in the context as being available for reading.\n", "* A list of words associated with each book, such as \"epub\" and \"notebooks\", which provide additional information about the format and storage location of each book.\n" ] } ], "source": [ "print(query_engine.query(\"What are the titles of all the books available? Show me the context used to derive your answer.\"))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The main character of 'Pride and Prejudice' is Elizabeth Bennet.\n" ] } ], "source": [ "print(query_engine.query(\"Who is the main character of 'Pride and Prejudice'?\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 总结和未来可能的提升\n", "\n", "\n", "我们成功地展示了如何创建一个完全在本地运行的基本 RAG 的电子书库智能助手,甚至在苹果的 Apple silicon Macs 上也能运行。在这个过程中,我们还全面了解了 LlamaIndex 是如何帮助我们简化建立基于 RAG 的应用程序的。\n", "\n", "尽管如此,我们其实只是接触到了一些皮毛。下面是一些关于如何改进和在这个基础上进一步发展的想法。\n", "\n", "### 强制引用\n", "\n", "为了避免图书馆员的虚构响应,我们怎样才能要求它为其回答提供引用?\n", "\n", "### 使用扩充的元数据\n", "\n", "像 [Calibre](https://calibre-ebook.com/) 这样的电子书库管理工具会为电子书创建更多的元数据。这些元数据可以提供一些在书中文本里找不到的信息,比如出版商或版本。我们怎样才能扩展我们的 RAG 流程,使其也能利用那些不是 .epub 文件的额外信息源呢?\n", "\n", "\n", "### 高效索引\n", "\n", "如果我们把这里做的所有东西写成一个脚本或可执行程序,那么每次运行这个脚本时,它都会重新索引我们的图书馆。对于只有两个文件的微型测试库来说,这样还行,但对于稍大一点的图书馆来说,每次都重新索引会让用户感到非常烦恼。我们怎样才能让索引持久化,并且只在图书馆内容有重要变化时,比如添加了新书,才去更新索引呢?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 2 }