seed/make_qa_multimodal_pdf_oss.ipynb (672 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Generate QnA synthetic dataset from a Complex PDF using Unstructured\n", "\n", "### Overview\n", "\n", "We process the PDF by dividing it into three parts.\n", "\n", "- **Text-heavy** - Text-heavy PDF can be processed with open source without the need to use toolkits like Azure AI Document Intelligence or Unstructured.\n", "- **Image-heavy** - Image-heavy PDF can be converted the entire page to images and let a multimodal LLM like GPT-4o summarize each page.\n", "- **Mixed** - After reading the document with Azure AI Document Intelligence, we replace the image descriptions inside the figure tags with text summarized by a multimodal LLM. (Often the image descriptions are blank or have only a short caption.)\n", "\n", "![summary](../imgs/summary-creating-qna-pdf.png)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from dotenv import load_dotenv\n", "load_dotenv()\n", "\n", "aoai_api_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", "aoai_api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n", "aoai_api_version = os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", "aoai_deployment_name = os.getenv(\"AZURE_OPENAI_DEPLOYMENT_NAME\")\n", "\n", "if not aoai_api_version:\n", " aoai_api_version = os.getenv(\"OPENAI_API_VERSION\")\n", "if not aoai_deployment_name:\n", " aoai_deployment_name = os.getenv(\"DEPLOYMENT_NAME\")\n", " \n", "print(f\"aoai_api_endpoint: {aoai_api_endpoint}\")\n", "print(f\"aoai_api_key: {aoai_api_key}\")\n", "print(f\"aoai_api_version: {aoai_api_version}\")\n", "print(f\"aoai_deployment_name: {aoai_deployment_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Read & Preprocess PDF file\n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split the PDFs into individual pages\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shutil, random\n", "import openai\n", "from unstructured.cleaners.core import clean_bullets, clean_extra_whitespace, remove_punctuation\n", "from langchain_community.document_loaders import UnstructuredFileLoader, UnstructuredMarkdownLoader, UnstructuredAPIFileLoader\n", "from langchain_community.document_loaders.csv_loader import CSVLoader, UnstructuredCSVLoader\n", "from util.common_utils import get_language_code\n", "\n", "raw_data_dir = \"../raw_data\"\n", "splitted_raw_data_dir = \"splitted_raw_data\"\n", "file_path = f\"{raw_data_dir}/pdf/en-imagenet-training-wrote-by-daekeun.pdf\"\n", "\n", "DOMAIN = \"Distributed training on Cloud\"\n", "LANGUAGE = \"English\" # You can change your language here. e.g., \"Korean\", \"Japanese\", \"Chinese\"\n", "LANGUAGE_CODE = get_language_code(LANGUAGE)\n", "print(f\"Domain: {DOMAIN}, Language: {LANGUAGE}, Language Code: {LANGUAGE_CODE}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Optional) Only use a poration of the PDF documents for testing. If there are a lot of pages or partial processing is required, cut and save only some pages.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import fitz\n", "\n", "# Open the first PDF document\n", "doc1 = fitz.open(file_path)\n", "split_pages = [(1, 15)]\n", "\n", "for idx, s in enumerate(split_pages):\n", " # Create a new empty PDF document\n", " doc2 = fitz.open()\n", "\n", " # Insert the first 2 pages of doc1 into doc2\n", " doc2.insert_pdf(doc1, from_page=s[0], to_page=s[1])\n", "\n", " # Save the modified document\n", " doc2.save(f\"{raw_data_dir}/part{idx}.pdf\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from util.common_utils import delete_folder_and_make_folder\n", "from util.preprocess import remove_short_sentences, remove_small_images, analyze_pdf_page_content, split_pdf\n", "\n", "#file_path = f\"{raw_data_dir}/part0.pdf\"\n", "analyzed_pdf_result = analyze_pdf_page_content(file_path)\n", "delete_folder_and_make_folder(splitted_raw_data_dir) \n", "\n", "print(\"### PDF Content Analysis Result:\")\n", "for content_type, pages in analyzed_pdf_result.items():\n", " print(f\"{content_type} pages: {pages}\")\n", " split_pdf(file_path, f\"{splitted_raw_data_dir}/{content_type}.pdf\", pages)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Case 1: Mixed page (Images and text mixed appropriately)\n", "\n", "After reading the document with UnstructuredFileLoader, we replace the image descriptions inside the figure tags with text summarized by a multimodal LLM. (Often the image descriptions are blank or have only a short caption.)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "pdf_mixed_path = f\"{splitted_raw_data_dir}/Mixed.pdf\"\n", "\n", "chunk_size = 1500\n", "new_after_n_chars = 1200\n", "combine_text_under_n_chars = 1000\n", "chunk_overlap = 100\n", "max_tokens = 1024\n", "image_dir = \"./images\"\n", "\n", "loader = UnstructuredFileLoader(\n", " file_path=pdf_mixed_path,\n", "\n", " chunking_strategy = \"by_title\",\n", " mode=\"elements\",\n", "\n", " extract_image_block_types=[\"Image\", \"Table\"],\n", " hi_res_model_name=\"yolox_quantized\", #\"detectron2_onnx\", \"yolox\", \"yolox_quantized\"\n", "\n", " extract_images_in_pdf=True,\n", " skip_infer_table_types='[]', # ['pdf', 'jpg', 'png', 'xls', 'xlsx', 'heic']\n", " #skip_infer_table_types=True, ## enable to get table as html using tabletrasformer\n", "\n", " extract_image_block_output_dir=image_dir,\n", " extract_image_block_to_payload=False, ## False: to save image\n", "\n", " max_characters=chunk_size,\n", " new_after_n_chars=new_after_n_chars,\n", " combine_text_under_n_chars=combine_text_under_n_chars, # 이 문자 수 이하의 텍스트는 결합\n", "\n", " languages= [\"kor+eng\"],\n", "\n", " post_processors=[clean_bullets, clean_extra_whitespace, remove_punctuation]\n", ")\n", "docs = loader.load()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "images = remove_small_images(image_dir, image_dim_thres=16)\n", "tables, texts = [], []\n", "\n", "for doc in docs:\n", " category = doc.metadata[\"category\"]\n", " if category == \"Table\": tables.append(doc)\n", " else: texts.append(doc)\n", "\n", "print (f' # texts: {len(texts)} \\n # tables: {len(tables)} \\n # images: {len(images)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Summarize images\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain.schema.output_parser import StrOutputParser\n", "from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate\n", "from langchain_openai import AzureChatOpenAI\n", "\n", "llm = AzureChatOpenAI(\n", " temperature=0, \n", " max_tokens=max_tokens,\n", " openai_api_version=aoai_api_version,\n", " azure_deployment=aoai_deployment_name \n", ")\n", "\n", "system_prompt = \"You are an assistant tasked with describing table or image, specialized in Smartphone product.\"\n", "system_message_template = SystemMessagePromptTemplate.from_template(system_prompt)\n", "human_prompt = [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"data:image/png;base64,\" + \"{image_base64}\",\n", " },\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": '''Given image, give a concise summary in Korean. Don't insert any XML tag such as <text> and </text> when answering.'''\n", " },\n", "]\n", "human_message_template = HumanMessagePromptTemplate.from_template(human_prompt)\n", "\n", "prompt = ChatPromptTemplate.from_messages(\n", " [\n", " system_message_template,\n", " human_message_template\n", " ]\n", ")\n", "\n", "summarize_chain = prompt | llm | StrOutputParser()\n", "#summarize_chain = {\"image_base64\": lambda x:x} | prompt | llm_text | StrOutputParser()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from util.preprocess import encode_image_base64\n", "#images = glob(os.path.join(image_path, \"*.jpg\"))\n", "base64_images = [encode_image_base64(img_path) for img_path in images]\n", "image_summaries = summarize_chain.batch(base64_images, {\"max_concurrency\": 3})\n", "image_summaries = remove_short_sentences(image_summaries)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from util.preprocess import split_text_using_tiktoken\n", "\n", "texts_tiktoken = split_text_using_tiktoken(texts, chunk_size, chunk_overlap)\n", "\n", "mixed_chunks = image_summaries + texts_tiktoken\n", "print(\"Length of splits (mixed case): \" + str(len(mixed_chunks)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Case 2: Text-heavy\n", "\n", "Text-heavy PDFs can be processed with open source without the need to use toolkits like Azure AI Document Intelligence or Unstructured.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if \"Text\" in analyzed_pdf_result:\n", "\n", " from langchain_community.document_loaders.pdf import PyMuPDFLoader\n", " from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter\n", "\n", " pdf_text_path = f\"{splitted_raw_data_dir}/Text.pdf\"\n", " loader = PyMuPDFLoader(pdf_text_path)\n", " documents = loader.load()\n", "\n", " text_splitter = RecursiveCharacterTextSplitter(\n", " chunk_size=1200, \n", " chunk_overlap=200\n", " )\n", "\n", " text_chunks = text_splitter.split_documents(documents)\n", "\n", " for idx, chunk in enumerate(text_chunks):\n", " print(f\"Chunk {idx}\\n{chunk}\")\n", " print(\"=\"*80)\n", " if idx == 2:\n", " break\n", "\n", " text_chunks = [d.page_content for d in text_chunks]\n", " print(\"Length of splits (text-heay case): \" + str(len(text_chunks)))\n", "else:\n", " text_chunks = []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Case 3: Image-heavy\n", "\n", "Image-heavy PDF can be converted the entire page to images and let a multimodal LLM like GPT-4o summarize each page.\n", "\n", "### Preprocess Image\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if \"Image\" in analyzed_pdf_result:\n", " import fitz\n", " from glob import glob\n", "\n", " image_dir = \"./pdf_image_tmp\"\n", " delete_folder_and_make_folder(image_dir) \n", "\n", " pdf_image_path = f\"{splitted_raw_data_dir}/Image.pdf\"\n", " doc = fitz.open(pdf_image_path)\n", " #clip_x, clip_y = 10, 45\n", " clip_x, clip_y = 10, 10\n", "\n", " for i, page in enumerate(doc):\n", " x, y, w, h = page.rect\n", " clip = fitz.Rect(x+clip_x, y+clip_y, w-clip_x, h-clip_y)\n", " page.set_cropbox(clip)\n", " pix = page.get_pixmap()\n", " pix.save(f\"{image_dir}/page_{i:03d}.jpg\")\n", "\n", " images = sorted(glob(os.path.join(image_dir, \"*.jpg\")))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain.schema.output_parser import StrOutputParser\n", "from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate\n", "from langchain_openai import AzureChatOpenAI\n", "\n", "max_tokens = 1024\n", "llm = AzureChatOpenAI(\n", " temperature=0, \n", " max_tokens=max_tokens,\n", " openai_api_version=aoai_api_version,\n", " azure_deployment=aoai_deployment_name \n", ")\n", "\n", "human_prompt_main = f\"Given image, give a concise summary in {LANGUAGE}. Don't insert any XML tag such as <text> and </text> when answering.\"\n", "\n", "system_prompt = \"You are an assistant tasked with describing table or image, specialized in Smartphone product.\"\n", "system_message_template = SystemMessagePromptTemplate.from_template(system_prompt)\n", "human_prompt = [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"data:image/png;base64,\" + \"{image_base64}\",\n", " },\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": human_prompt_main\n", " },\n", "]\n", "human_message_template = HumanMessagePromptTemplate.from_template(human_prompt)\n", "\n", "prompt = ChatPromptTemplate.from_messages(\n", " [\n", " system_message_template,\n", " human_message_template\n", " ]\n", ")\n", "\n", "summarize_chain = prompt | llm | StrOutputParser()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "if \"Image\" in analyzed_pdf_result:\n", " from util.preprocess import encode_image_base64\n", " #images = glob(os.path.join(image_path, \"*.jpg\"))\n", " base64_images = [encode_image_base64(img_path) for img_path in images]\n", " image_summaries = summarize_chain.batch(base64_images, {\"max_concurrency\": 8})\n", " image_summaries = remove_short_sentences(image_summaries)\n", " print(\"Length of image_summaries (image-heavy case): \" + str(len(image_summaries)))\n", "else:\n", " image_summaries = []\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Construct QnA Pairs\n", "\n", "---\n", "\n", "### Option 1.\n", "\n", "Leverage the `azure-ai-generative` package. The `QADataGenerator` class in this package makes it easy to generate QnA synthetic questions. However, using this class as is has the disadvantage of not being able to use custom prompts, so we inherited from it and created the `CustomQADataGenerator` class.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from util.qa import CustomQADataGenerator\n", "model_config = {\n", " \"deployment\": aoai_deployment_name,\n", " \"model\": \"gpt-4o-mini\",\n", " \"max_tokens\": 2000,\n", "}\n", "\n", "qa_generator = CustomQADataGenerator(model_config=model_config, templates_dir=f\"./prompt_template/{LANGUAGE_CODE}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import asyncio\n", "from collections import Counter\n", "from typing import Dict\n", "import os\n", "from azure.ai.generative.synthetic.qa import QAType\n", "concurrency = 6 # number of concurrent calls\n", "sem = asyncio.Semaphore(concurrency)\n", "\n", "#qa_type = QAType.CONVERSATION\n", "qa_type = QAType.LONG_ANSWER\n", "\n", "async def generate_async(text: str) -> Dict:\n", " async with sem:\n", " return await qa_generator.generate_async(\n", " text=text,\n", " qa_type=qa_type,\n", " num_questions=3, # Number of questions to generate per text\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_batch = mixed_chunks + text_chunks + image_summaries\n", "results = await asyncio.gather(*[generate_async(text) for text in input_batch], return_exceptions=True)\n", "\n", "question_answer_list = []\n", "for result in results:\n", " if isinstance(result, Exception):\n", " raise result # exception raised inside generate_async()\n", " question_answer_list.append(result[\"question_answers\"])\n", "\n", "print(\"Successfully generated QAs\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "question_answer_list[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Option 2.\n", "\n", "You write the entire sequence of code to create a QnA dataset without using a separate toolkit.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aoai_api_endpoint = os.getenv(\"AZURE_OPENAI_ENDPOINT\")\n", "aoai_api_key = os.getenv(\"AZURE_OPENAI_API_KEY\")\n", "aoai_api_version = os.getenv(\"AZURE_OPENAI_API_VERSION\")\n", "aoai_deployment_name = os.getenv(\"AZURE_OPENAI_DEPLOYMENT_NAME\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from langchain_openai import AzureChatOpenAI\n", "from langchain_core.runnables import RunnablePassthrough, RunnableLambda\n", "from langchain_core.output_parsers import JsonOutputParser\n", "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n", "\n", "from util.qa_pair import get_qna_prompt_template, QAPair\n", "\n", "llm = AzureChatOpenAI(\n", " temperature=0, \n", " max_tokens=1024,\n", " openai_api_version=aoai_api_version,\n", " azure_deployment=aoai_deployment_name \n", ")\n", "\n", "parser = JsonOutputParser(pydantic_object=QAPair)\n", "prompt = get_qna_prompt_template(LANGUAGE)\n", "\n", "chain = prompt | llm | parser" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_batch = []\n", "\n", "for doc in mixed_chunks:\n", " dic = {\"context\": doc, \"domain\": DOMAIN, \"num_questions\": \"3\"}\n", " input_batch.append(dic)\n", "\n", "for doc in text_chunks:\n", " dic = {\"context\": doc, \"domain\": DOMAIN, \"num_questions\": \"3\"}\n", " input_batch.append(dic)\n", "\n", "for doc in image_summaries:\n", " dic = {\"context\": doc, \"domain\": DOMAIN, \"num_questions\": \"3\"}\n", " input_batch.append(dic) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "qa_pair = chain.batch(input_batch, {\"max_concurrency\": 5})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Save to jsonl\n", "\n", "---\n", "\n", "If you want to augment dataset, you can try Evovle-Instruct or other data augmentation techniques.<br>\n", "Please refer to `../evolve-instruct` and `../glan-instruct` for more details.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from util.common_utils import convert_to_oai_format, save_jsonl\n", "\n", "output_dir = './dataset'\n", "os.makedirs(output_dir, exist_ok=True)\n", "\n", "system_prompt_msg = f\"\"\"You are the SME (Subject Matter Expert) in {DOMAIN}. Please answer the questions accurately. If the question is in {LANGUAGE}, write your answer in {LANGUAGE}.\"\"\"\n", "\n", "save_filename = \"advertising\"\n", "oai_qa_pair = convert_to_oai_format(question_answer_list, system_prompt_msg=system_prompt_msg)\n", "\n", "#save_jsonl(qa_pair, f\"{output_dir}/{save_filename}.jsonl\")\n", "save_jsonl(oai_qa_pair, f\"{output_dir}/{save_filename}-oai.jsonl\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean up\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm -rf {splitted_raw_data_dir} pdf_image_tmp pdf_mixed_tmp outputs_tmp images" ] } ], "metadata": { "kernelspec": { "display_name": "py312-dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 4 }