vi/5_vision_language_models/notebooks/vlm_usage

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Xử lý hình ảnh và văn bản với Mô hình Ngôn ngữ Thị giác (VLM)\n", "\n", "Notebook này minh họa cách sử dụng mô hình `HuggingFaceTB/SmolVLM-Instruct` đã được lượng tử hóa **4 bit** cho các tác vụ đa phương thức (multimodal) khác nhau như:\n", "\n", "- Hỏi đáp bằng hình ảnh (Visual Question Answering - VQA): Trả lời các câu hỏi dựa trên nội dung hình ảnh.\n", "\n", "- Nhận dạng văn bản (OCR): Trích xuất và diễn giải văn bản trong hình ảnh.\n", "\n", "- Mô tả video: Mô tả video thông qua phân tích khung hình tuần tự.\n", "\n", "Bằng cách cấu trúc các lời nhắc (prompt) một cách hiệu quả, bạn có thể tận dụng mô hình cho nhiều ứng dụng, chẳng hạn như hiểu cảnh, phân tích tài liệu và suy luận hình ảnh động." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cài đặt các yêu cầu trong Google Colab\n", "# !pip install transformers datasets trl huggingface_hub bitsandbytes\n", "\n", "# Xác thực với Hugging Face\n", "from huggingface_hub import notebook_login\n", "notebook_login()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`low_cpu_mem_usage` was None, now default to True since model is quantized.\n", "You shouldn't move a model that is dispatched using accelerate hooks.\n", "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "{'longest_edge': 1536}\n" ] } ], "source": [ "import torch, PIL\n", "from transformers import AutoProcessor, AutoModelForVision2Seq, BitsAndBytesConfig\n", "from transformers.image_utils import load_image\n", "\n", "device = (\n", " \"cuda\"\n", " if torch.cuda.is_available()\n", " else \"mps\" if torch.backends.mps.is_available() else \"cpu\"\n", ")\n", "\n", "quantization_config = BitsAndBytesConfig(load_in_4bit=True)\n", "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"\n", "model = AutoModelForVision2Seq.from_pretrained(\n", " model_name,\n", " quantization_config=quantization_config,\n", ").to(device)\n", "processor = AutoProcessor.from_pretrained(\"HuggingFaceTB/SmolVLM-Instruct\")\n", "\n", "print(processor.image_processor.size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Xử lý hình ảnh\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hãy bắt đầu với việc tạo chú thích và trả lời các câu hỏi về một hình ảnh. Chúng ta cũng sẽ khám phá việc xử lý nhiều hình ảnh.\n", "\n", "### 1. Tạo chú thích cho một hình ảnh" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<img src=\"https://cdn.pixabay.com/photo/2024/11/20/09/14/christmas-9210799_1280.jpg\"/>" ], "text/plain": [ "<IPython.core.display.Image object>" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "<img src=\"https://cdn.pixabay.com/photo/2024/11/23/08/18/christmas-9218404_1280.jpg\"/>" ], "text/plain": [ "<IPython.core.display.Image object>" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.display import Image, display\n", "\n", "image_url1 = \"https://cdn.pixabay.com/photo/2024/11/20/09/14/christmas-9210799_1280.jpg\"\n", "display(Image(url=image_url1))\n", "\n", "image_url2 = \"https://cdn.pixabay.com/photo/2024/11/23/08/18/christmas-9218404_1280.jpg\"\n", "display(Image(url=image_url2))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/duydl/Miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:451: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['User:<image>Can you describe the image?\\nAssistant: The image is a scene of a person walking in a forest. The person is wearing a coat and a cap. The person is holding the hand of another person. The person is walking on a path. The path is covered with dry leaves. The background of the image is a forest with trees.']\n" ] } ], "source": [ "# Tải một hình ảnh\n", "image1 = load_image(image_url1)\n", "\n", "# Tạo các thông báo đầu vào\n", "messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image\"},\n", " {\"type\": \"text\", \"text\": \"Bạn có thể mô tả hình ảnh này không?\"}\n", " ]\n", " },\n", "]\n", "\n", "# Chuẩn bị đầu vào\n", "prompt = processor.apply_chat_template(messages, add_generation_prompt=True)\n", "inputs = processor(text=prompt, images=[image1], return_tensors=\"pt\")\n", "inputs = inputs.to(device)\n", "\n", "# Tạo đầu ra\n", "generated_ids = model.generate(**inputs, max_new_tokens=500)\n", "generated_texts = processor.batch_decode(\n", " generated_ids,\n", " skip_special_tokens=True,\n", ")\n", "\n", "print(generated_texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. So sánh nhiều hình ảnh\n", "Mô hình có thể xử lý và so sánh nhiều hình ảnh. Hãy xác định chủ đề chung giữa hai hình ảnh." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['User:<image>What event do they both represent?\\nAssistant: Christmas.']\n" ] } ], "source": [ "# Tải hình ảnh\n", "image2 = load_image(image_url2)\n", "\n", "# Tạo các thông báo đầu vào\n", "messages = [\n", " # {\n", " # \"role\": \"user\",\n", " # \"content\": [\n", " # {\"type\": \"image\"},\n", " # {\"type\": \"image\"},\n", " # {\"type\": \"text\", \"text\": \"Bạn có thể mô tả hai hình ảnh này không?\"}\n", " # ]\n", " # },\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image\"},\n", " {\"type\": \"image\"},\n", " {\"type\": \"text\", \"text\": \"Chúng đại diện cho sự kiện gì?\"}\n", " ]\n", " },\n", "]\n", "\n", "# Chuẩn bị đầu vào\n", "prompt = processor.apply_chat_template(messages, add_generation_prompt=True)\n", "inputs = processor(text=prompt, images=[image1, image2], return_tensors=\"pt\")\n", "inputs = inputs.to(device)\n", "\n", "# Tạo đầu ra\n", "generated_ids = model.generate(**inputs, max_new_tokens=500)\n", "generated_texts = processor.batch_decode(\n", " generated_ids,\n", " skip_special_tokens=True,\n", ")\n", "\n", "print(generated_texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 🔠 Nhận dạng văn bản (OCR)\n", "VLM cũng có thể nhận ra và diễn giải văn bản trong hình ảnh, làm cho nó phù hợp với các tác vụ như phân tích tài liệu.\n", "Bạn có thể thử nghiệm trên các hình ảnh có văn bản dày đặc hơn." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<img src=\"https://cdn.pixabay.com/photo/2020/11/30/19/23/christmas-5792015_960_720.png\"/>" ], "text/plain": [ "<IPython.core.display.Image object>" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "['User:<image>What is written?\\nAssistant: MERRY CHRISTMAS AND A HAPPY NEW YEAR']\n" ] } ], "source": [ "document_image_url = \"https://cdn.pixabay.com/photo/2020/11/30/19/23/christmas-5792015_960_720.png\"\n", "display(Image(url=document_image_url))\n", "\n", "# Tải hình ảnh tài liệu\n", "document_image = load_image(document_image_url)\n", "\n", "# Tạo thông báo đầu vào để phân tích\n", "messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\"type\": \"image\"},\n", " {\"type\": \"text\", \"text\": \"Nội dung là gì?\"}\n", " ]\n", " }\n", "]\n", "\n", "# Chuẩn bị đầu vào\n", "prompt = processor.apply_chat_template(messages, add_generation_prompt=True)\n", "inputs = processor(text=prompt, images=[document_image], return_tensors=\"pt\")\n", "inputs = inputs.to(device)\n", "\n", "# Tạo đầu ra\n", "generated_ids = model.generate(**inputs, max_new_tokens=500)\n", "generated_texts = processor.batch_decode(\n", " generated_ids,\n", " skip_special_tokens=True,\n", ")\n", "\n", "print(generated_texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Xử lý video\n", "Các Mô hình Ngôn ngữ Thị giác (VLM) có thể xử lý video gián tiếp bằng cách trích xuất các khung hình chính (keyframe) và suy luận trên chúng theo thứ tự thời gian. Mặc dù VLM thiếu khả năng nhận biết thời gian thực của các mô hình video chuyên dụng, chúng vẫn có thể:\n", "\n", "- Mô tả các hành động hoặc sự kiện bằng cách phân tích các khung hình được lấy mẫu tuần tự.\n", "\n", "- Trả lời các câu hỏi về video dựa trên các khung hình chính đại diện.\n", "\n", "- Tóm tắt nội dung video bằng cách kết hợp các mô tả văn bản của nhiều khung hình.\n", "\n", "Hãy thử nghiệm trên một ví dụ:\n", "\n", "<video width=\"640\" height=\"360\" controls>\n", " <source src=\"https://cdn.pixabay.com/video/2023/10/28/186794-879050032_large.mp4\" type=\"video/mp4\">\n", " Your browser does not support the video tag.\n", "</video>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !pip install opencv-python" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from IPython.display import Video\n", "import cv2\n", "import numpy as np\n", "\n", "def extract_frames(video_path, max_frames=50, target_size=None):\n", " cap = cv2.VideoCapture(video_path)\n", " if not cap.isOpened():\n", " raise ValueError(f\"Không thể mở video: {video_path}\")\n", " \n", " total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))\n", " frame_indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)\n", "\n", " frames = []\n", " for idx in frame_indices:\n", " cap.set(cv2.CAP_PROP_POS_FRAMES, idx)\n", " ret, frame = cap.read()\n", " if ret:\n", " frame = PIL.Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))\n", " if target_size:\n", " frames.append(resize_and_crop(frame, target_size))\n", " else:\n", " frames.append(frame)\n", " cap.release()\n", " return frames\n", "\n", "def resize_and_crop(image, target_size):\n", " width, height = image.size\n", " scale = target_size / min(width, height)\n", " image = image.resize((int(width * scale), int(height * scale)), PIL.Image.Resampling.LANCZOS)\n", " left = (image.width - target_size) // 2\n", " top = (image.height - target_size) // 2\n", " return image.crop((left, top, left + target_size, top + target_size))\n", "\n", "# Đường dẫn video\n", "video_link = \"https://cdn.pixabay.com/video/2023/10/28/186794-879050032_large.mp4\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Response: User: Following are the frames of a video in temporal order.<image>Describe what the woman is doing.\n", "Assistant: The woman is hanging an ornament on a Christmas tree.\n" ] } ], "source": [ "question = \"Mô tả những gì người phụ nữ đang làm.\"\n", "\n", "def generate_response(model, processor, frames, question):\n", "\n", " image_tokens = [{\"type\": \"image\"} for _ in frames]\n", " messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [{\"type\": \"text\", \"text\": \"Sau đây là các khung hình của một video theo thứ tự thời gian.\"}, *image_tokens, {\"type\": \"text\", \"text\": question}]\n", " }\n", " ]\n", " inputs = processor(\n", " text=processor.apply_chat_template(messages, add_generation_prompt=True),\n", " images=frames,\n", " return_tensors=\"pt\"\n", " ).to(model.device)\n", "\n", " outputs = model.generate(\n", " **inputs, max_new_tokens=100, num_beams=5, temperature=0.7, do_sample=True, use_cache=True\n", " )\n", " return processor.decode(outputs[0], skip_special_tokens=True)\n", "\n", "# Trích xuất khung hình từ video\n", "frames = extract_frames(video_link, max_frames=15, target_size=384)\n", "\n", "processor.image_processor.size = (384, 384)\n", "processor.image_processor.do_resize = False\n", "# Tạo phản hồi\n", "response = generate_response(model, processor, frames, question)\n", "\n", "# Hiển thị kết quả\n", "# print(\"Câu hỏi:\", question)\n", "print(\"Phản hồi:\", response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 💐 Bạn đã hoàn thành!\n", "Notebook này đã minh họa cách sử dụng Mô hình Ngôn ngữ Thị giác (VLM) như định dạng lời nhắc cho các tác vụ đa phương thức. Bằng cách làm theo các bước được nêu ở đây, bạn có thể thử nghiệm với VLM và các ứng dụng của chúng.\n", "\n", "### Các bước tiếp theo để khám phá:\n", "- Thử nghiệm với nhiều trường hợp sử dụng VLM hơn.\n", "\n", "- Cộng tác với đồng nghiệp bằng cách xem xét các yêu cầu kéo (PR) của họ.\n", "\n", "- Đóng góp để cải thiện tài liệu khóa học này bằng cách mở một vấn đề (issue) hoặc gửi PR để giới thiệu các trường hợp sử dụng, ví dụ hoặc khái niệm mới.\n", "\n", "Chúc bạn khám phá vui vẻ! 🌟" ] } ], "metadata": { "kernelspec": { "display_name": "py310", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 2 }

vi/5_vision_language_models/notebooks/vlm_usage_sample.ipynb (474 lines of code) (raw):