notebooks/zh-CN/rag_llamaindex_librarian.ipynb (340 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 用 LlamaIndex 构建一个 RAG 电子书库智能助手\n",
"\n",
"_作者: [Jonathan Jin](https://huggingface.co/jinnovation)_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 简介\n",
"\n",
"这份教程将指导你如何快速为你的电子书库创建一个基于 RAG 图书助手。\n",
"就像图书馆的图书管理员帮你找书一样,这个助手也能帮你从你的电子书里找到你需要的书。\n",
"\n",
"## 要求\n",
"这个助手要做得**轻巧**,**尽量在本地运行**,而且**不要用太多其他的东西**。我们会尽量用免费的开源软件,选择那种在**本地普通电脑上,比如 M1 型号的 MacBook 上就能运行的模型**。\n",
"\n",
"## 组件\n",
"我们的解决方案将包括以下组件:\n",
"- [LlamaIndex],一个基于LLM的应用数据框架,与 [LangChain] 不同,它是专门为 RAG 设计的;\n",
"- [Ollama],一个简单易用的工具,可以让你在本地运行语言模型,比如Llama 2;\n",
"- [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 嵌入模型,它的表现[相当好,并且大小适中](https://huggingface.co/spaces/mteb/leaderboard);\n",
"- [Llama 2],我们将通过 [Ollama] 运行它。\n",
"\n",
"[LlamaIndex]: https://docs.llamaindex.ai/en/stable/index.html\n",
"[LangChain]: https://python.langchain.com/docs/get_started/introduction\n",
"[Ollama]: https://ollama.com/\n",
"[Llama 2]: https://ollama.com/library/llama2\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 依赖\n",
"\n",
"首先安装依赖库"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -q \\\n",
" llama-index \\\n",
" EbookLib \\\n",
" html2text \\\n",
" llama-index-embeddings-huggingface \\\n",
" llama-index-llms-ollama"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!brew install ollama"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 设置测试书库\n",
"\n",
"我们接下来要弄个测试用的“书库”。\n",
"\n",
"简单点说,我们的“书库”就是一个放有 `.epub` 格式电子书文件的**文件夹**。这个方法很容易就能扩展到像 Calibre 那种带有个 `metadata.db` 数据库文件的书库。怎么扩展这个问题,我们留给读者自己思考。😇\n",
"\n",
"现在,我们先从[古腾堡计划网站](https://www.gutenberg.org/)下载两本`.epub`格式的电子书放到我们的书库里。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p \".test/library/jane-austen\"\n",
"!mkdir -p \".test/library/victor-hugo\"\n",
"!wget https://www.gutenberg.org/ebooks/1342.epub.noimages -O \".test/library/jane-austen/pride-and-prejudice.epub\"\n",
"!wget https://www.gutenberg.org/ebooks/135.epub.noimages -O \".test/library/victor-hugo/les-miserables.epub\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 用 LlamaIndex 构建 RAG\n",
"\n",
"使用 LlamaIndex 的 RAG 主要包括以下三个阶段:\n",
"\n",
"1. **加载**,在这个阶段你告诉 LlamaIndex 你的数据在哪里以及如何加载它;\n",
"2. **索引**,在这个阶段你扩充加载的数据以方便查询,例如使用向量嵌入;\n",
"3. **查询**,在这个阶段你配置一个 LLM 作为你索引数据的查询接口。\n",
"\n",
"以上解释仅是对 LlamaIndex 可实现功能的表面说明。要想了解更多深入细节,我强烈推荐阅读 LlamaIndex 文档中的[\"高级概念\"页面](https://docs.llamaindex.ai/en/stable/getting_started/concepts.html)。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 加载\n",
"\n",
"好的,我们首先从**加载**阶段开始。\n",
"\n",
"之前提到,LlamaIndex 是专为 RAG 这种混合检索生成模型设计的。这一点从它的[`SimpleDirectoryReader`](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html)功能就可以明显看出,它能**神奇地**免费支持很多种文件类型。幸运的是, `.epub` 文件格式也在支持范围内。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core import SimpleDirectoryReader\n",
"\n",
"loader = SimpleDirectoryReader(\n",
" input_dir=\"./.test/\",\n",
" recursive=True,\n",
" required_exts=[\".epub\"],\n",
")\n",
"\n",
"documents = loader.load_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`SimpleDirectoryReader.load_data()` 将我们的电子书转换成一组 LlamaIndex 可以处理的 [`Document`s](https://docs.llamaindex.ai/en/stable/api/llama_index.core.schema.Document.html)。\n",
"\n",
"这里有一个重要的事情要注意,就是这个阶段的文档**还没有被分块**——这将在索引阶段进行。继续往下看...\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 索引\n",
"\n",
"\n",
"在把数据**加载**进来之后,接下来我们要做的是**建立索引**。这样我们的 RAG 系统就能找到与用户查询相关的信息,然后把这些信息传给语言模型(LLM),以便它能够**增强**回答的内容。同时,这一步也将对文档进行分块。\n",
"\n",
"在 LlamaIndex 中,[`VectorStoreIndex`](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index.html) 是用来建立索引的一个“默认”工具。这个工具默认使用一个简单、基于内存的字典来保存索引,但随着你的使用规模扩大,LlamaIndex 还支持\n",
"[多种向量存储解决方案](https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores.html)。\n",
"\n",
"<Tip>\n",
"LlamaIndex 默认的块大小是 1024 个字符,块与块之间有 20 个字符的重叠。如果需要了解更多细节,可以查看 [LlamaIndex 的文档](https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies.html#chunk-sizes)。\n",
"</Tip>\n",
"\n",
"如前所述,我们选择使用 [`BAAI/bge-small-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 生成嵌入,以避免使用默认的 OpenAI(特别是 gpt-3.5-turbo)模型,因为我们需要一个轻量级、可在本地运行的完整解决方案。\n",
"\n",
"幸运的是,LlamaIndex 可以通过 `HuggingFaceEmbedding` 类方便地从 Hugging Face 获取嵌入模型,因此我们将使用它。\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
"\n",
"embedding_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们将把这个模型传递给 `VectorStoreIndex`,作为我们的嵌入模型,以绕过 OpenAI 的默认行为。"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.core import VectorStoreIndex\n",
"\n",
"index = VectorStoreIndex.from_documents(\n",
" documents,\n",
" embed_model=embedding_model,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 查询\n",
"\n",
"现在我们要完成 RAG 智能助手的最后一部分——设置查询接口。\n",
"\n",
"在这个教程中,我们将使用 Llama 2 作为语言模型,但我建议你试试不同的模型,看看哪个能给出最好的回答。\n",
"\n",
"首先,我们需要在一个新的终端窗口中启动 Ollama 服务器。不过,[Ollama 的 Python 客户端](https://github.com/ollama/ollama-python)不支持直接启动和关闭服务器,所以我们需要在 Python 环境之外操作。\n",
"\n",
"打开一个新的终端窗口,输入命令:`ollama serve`。等我们这里操作完成后,别忘了关闭服务器!\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在,让我们将 Llama 2 连接到 LlamaIndex,并使用它作为我们查询引擎的基础。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from llama_index.llms.ollama import Ollama\n",
"\n",
"llama = Ollama(\n",
" model=\"llama2\",\n",
" request_timeout=40.0,\n",
")\n",
"\n",
"query_engine = index.as_query_engine(llm=llama)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 最终结果 \n",
"\n",
"有了这些,我们的基本的 RAG 电子书库智能助手就设置好了,我们可以开始询问有关我们电子书库的问题了。例如:\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Based on the context provided, there are two books available:\n",
"\n",
"1. \"Pride and Prejudice\" by Jane Austen\n",
"2. \"Les Misérables\" by Victor Hugo\n",
"\n",
"The context used to derive this answer includes:\n",
"\n",
"* The file path for each book, which provides information about the location of the book files on the computer.\n",
"* The titles of the books, which are mentioned in the context as being available for reading.\n",
"* A list of words associated with each book, such as \"epub\" and \"notebooks\", which provide additional information about the format and storage location of each book.\n"
]
}
],
"source": [
"print(query_engine.query(\"What are the titles of all the books available? Show me the context used to derive your answer.\"))"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The main character of 'Pride and Prejudice' is Elizabeth Bennet.\n"
]
}
],
"source": [
"print(query_engine.query(\"Who is the main character of 'Pride and Prejudice'?\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 总结和未来可能的提升\n",
"\n",
"\n",
"我们成功地展示了如何创建一个完全在本地运行的基本 RAG 的电子书库智能助手,甚至在苹果的 Apple silicon Macs 上也能运行。在这个过程中,我们还全面了解了 LlamaIndex 是如何帮助我们简化建立基于 RAG 的应用程序的。\n",
"\n",
"尽管如此,我们其实只是接触到了一些皮毛。下面是一些关于如何改进和在这个基础上进一步发展的想法。\n",
"\n",
"### 强制引用\n",
"\n",
"为了避免图书馆员的虚构响应,我们怎样才能要求它为其回答提供引用?\n",
"\n",
"### 使用扩充的元数据\n",
"\n",
"像 [Calibre](https://calibre-ebook.com/) 这样的电子书库管理工具会为电子书创建更多的元数据。这些元数据可以提供一些在书中文本里找不到的信息,比如出版商或版本。我们怎样才能扩展我们的 RAG 流程,使其也能利用那些不是 .epub 文件的额外信息源呢?\n",
"\n",
"\n",
"### 高效索引\n",
"\n",
"如果我们把这里做的所有东西写成一个脚本或可执行程序,那么每次运行这个脚本时,它都会重新索引我们的图书馆。对于只有两个文件的微型测试库来说,这样还行,但对于稍大一点的图书馆来说,每次都重新索引会让用户感到非常烦恼。我们怎样才能让索引持久化,并且只在图书馆内容有重要变化时,比如添加了新书,才去更新索引呢?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}