misc/using_citations.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Citations \n", "\n", "The Anthropic API features citation support that enables Claude to provide detailed citations when answering questions about documents. Citations are a valuable affordance in many LLM powered applications to help users track and verify the sources of information in responses.\n", "\n", "Citations are supported on:\n", "* `claude-3-5-sonnet-20241022`\n", "* `claude-3-5-haiku-20241022`\n", "\n", "The citations feature is an alternative to prompt-based citation techniques. Using this featue has the following advantages:\n", "- Prompt-based techniques often require Claude to output full quotes from the source document it intends to cite. This increases output tokens and therefore cost.\n", "- The citation feature will not return citations pointing to documents or locations that were not provided as valid sources.\n", "- While testing we found the citation feature to generate citations with higher recall and percision than prompt based techniques.\n", "\n", "The documentation for citations can be found [here](https://docs.anthropic.com/en/docs/build-with-claude/citations)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "First, let's install the required libraries and initalize our Anthropic client. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install anthropic --quiet" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "import anthropic\n", "import os\n", "import json\n", "\n", "ANTHROPIC_API_KEY = os.environ.get(\"ANTHROPIC_API_KEY\")\n", "# ANTHROPIC_API_KEY = \"\" # Put your API key here!\n", "\n", "client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Document Types\n", "\n", "Citations support three different document types. The type of citation outputted depends on the type of document being cited from:\n", "\n", "* Plain text document citation → char location format\n", "* PDF document citation → page location format\n", "* Custom content document citation → content block location format\n", "\n", "We will explore working with each of these in the examples below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plain Text Documents\n", "\n", "With plain text document citations you provide your document as raw text to the model. You can provide one or multiple documents. This text will get automatically chunked into sentences. The model will cite these sentences as appropriate. The model is able to cite multiple sentences together at once in a single citation but will not cite text smaller than a sentence.\n", "\n", "Along with the outputted text the API response will include structured data for all citations. \n", "\n", "Let's see a complete example using a help center customer chatbot for a made up company PetWorld." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "Raw response:\n", "================================================================================\n", "{\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"Based on the documentation, I can explain why you don't see tracking yet: \"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"You'll receive an email with your tracking number once your order ships. If you don't receive a tracking number within 48 hours of your order confirmation, please contact our customer support team for assistance.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"char_location\",\n", " \"cited_text\": \"Once your order ships, you'll receive an email with a tracking number. \",\n", " \"document_title\": \"Order Tracking Information\"\n", " },\n", " {\n", " \"type\": \"char_location\",\n", " \"cited_text\": \"If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\",\n", " \"document_title\": \"Order Tracking Information\"\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\nSince you just checked out, your order likely hasn't shipped yet. Once it ships, you'll receive the tracking information via email.\"\n", " }\n", " ]\n", "}\n" ] } ], "source": [ "# Read all help center articles and create a list of documents\n", "articles_dir = './data/help_center_articles'\n", "documents = []\n", "\n", "for filename in sorted(os.listdir(articles_dir)):\n", " if filename.endswith('.txt'):\n", " with open(os.path.join(articles_dir, filename), 'r') as f:\n", " content = f.read()\n", " # Split into title and body\n", " title_line, body = content.split('\\n', 1)\n", " title = title_line.replace('title: ', '')\n", " documents.append({\n", " \"type\": \"document\",\n", " \"source\": {\n", " \"type\": \"text\",\n", " \"media_type\": \"text/plain\",\n", " \"data\": body\n", " },\n", " \"title\": title,\n", " \"citations\": {\"enabled\": True}\n", " })\n", "\n", "QUESTION = \"I just checked out, where is my order tracking number? Track package is not available on the website yet for my order.\"\n", "\n", "# Add the question to the content\n", "content = documents \n", "\n", "response = client.messages.create(\n", " model=\"claude-3-5-sonnet-latest\",\n", " temperature=0.0,\n", " max_tokens=1024,\n", " system='You are a customer support bot working for PetWorld. Your task is to provide short, helpful answers to user questions. Since you are in a chat interface avoid providing extra details. You will be given access to PetWorld\\'s help center articles to help you answer questions.',\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": documents\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": [{\"type\": \"text\", \"text\": f'Here is the user\\'s question: {QUESTION}'}]\n", " },\n", "\n", " ]\n", ")\n", "\n", "def visualize_raw_response(response):\n", " raw_response = {\"content\": []}\n", "\n", " print(\"\\n\" + \"=\"*80 + \"\\nRaw response:\\n\" + \"=\"*80)\n", " \n", " for content in response.content:\n", " if content.type == \"text\":\n", " block = {\n", " \"type\": \"text\",\n", " \"text\": content.text\n", " }\n", " if hasattr(content, 'citations') and content.citations:\n", " block[\"citations\"] = []\n", " for citation in content.citations:\n", " citation_dict = {\n", " \"type\": citation.type,\n", " \"cited_text\": citation.cited_text,\n", " \"document_title\": citation.document_title,\n", " }\n", " if citation.type == \"page_location\":\n", " citation_dict.update({\n", " \"start_page_number\": citation.start_page_number,\n", " \"end_page_number\": citation.end_page_number\n", " })\n", " block[\"citations\"].append(citation_dict)\n", " raw_response[\"content\"].append(block)\n", " \n", " return json.dumps(raw_response, indent=2)\n", "\n", "print(visualize_raw_response(response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Visualizing Citations\n", "By leveraging the citation data, we can create UIs that:\n", "\n", "1. Show users exactly where information comes from\n", "2. Link directly to source documents\n", "3. Highlight cited text in context\n", "4. Build trust through transparent sourcing\n", "\n", "Below is a simple visualization function that transforms Claude's structured citations into a readable format with numbered references, similar to academic papers.\n", "\n", "The function takes Claude's response object and outputs:\n", "- Text with numbered citation markers (e.g., \"The answer [1] includes this fact [2]\")\n", "- A numbered reference list showing each cited text and its source document" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "Formatted response:\n", "================================================================================\n", "Based on the documentation, I can explain why you don't see tracking yet: You'll receive an email with your tracking number once your order ships. If you don't receive a tracking number within 48 hours of your order confirmation, please contact our customer support team for assistance. [1] [2]\n", "\n", "Since you just checked out, your order likely hasn't shipped yet. Once it ships, you'll receive the tracking information via email.\n", "\n", "[1] \"Once your order ships, you'll receive an email with a tracking number.\" found in \"Order Tracking Information\"\n", "[2] \"If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\" found in \"Order Tracking Information\"\n" ] } ], "source": [ "def visualize_citations(response):\n", " \"\"\"\n", " Takes a response object and returns a string with numbered citations.\n", " Example output: \"here is the plain text answer [1][2] here is some more text [3]\"\n", " with a list of citations below.\n", " \"\"\"\n", " # Dictionary to store unique citations\n", " citations_dict = {}\n", " citation_counter = 1\n", " \n", " # Final formatted text\n", " formatted_text = \"\"\n", " citations_list = []\n", "\n", " print(\"\\n\" + \"=\"*80 + \"\\nFormatted response:\\n\" + \"=\"*80)\n", " \n", " for content in response.content:\n", " if content.type == \"text\":\n", " text = content.text\n", " if hasattr(content, 'citations') and content.citations:\n", " # Sort citations by their appearance in the text\n", " def get_sort_key(citation):\n", " if hasattr(citation, 'start_char_index'):\n", " return citation.start_char_index\n", " elif hasattr(citation, 'start_page_number'):\n", " return citation.start_page_number\n", " elif hasattr(citation, 'start_block_index'):\n", " return citation.start_block_index\n", " return 0 # fallback\n", "\n", " sorted_citations = sorted(content.citations, key=get_sort_key)\n", " \n", " # Process each citation\n", " for citation in sorted_citations:\n", " doc_title = citation.document_title\n", " cited_text = citation.cited_text.replace('\\n', ' ').replace('\\r', ' ')\n", " # Remove any multiple spaces that might have been created\n", " cited_text = ' '.join(cited_text.split())\n", " \n", " # Create a unique key for this citation\n", " citation_key = f\"{doc_title}:{cited_text}\"\n", " \n", " # If this is a new citation, add it to our dictionary\n", " if citation_key not in citations_dict:\n", " citations_dict[citation_key] = citation_counter\n", " citations_list.append(f\"[{citation_counter}] \\\"{cited_text}\\\" found in \\\"{doc_title}\\\"\")\n", " citation_counter += 1\n", " \n", " # Add the citation number to the text\n", " citation_num = citations_dict[citation_key]\n", " text += f\" [{citation_num}]\"\n", " \n", " formatted_text += text\n", " \n", " # Combine the formatted text with the citations list\n", " final_output = formatted_text + \"\\n\\n\" + \"\\n\".join(citations_list)\n", " return final_output\n", "\n", "formatted_response = visualize_citations(response)\n", "print(formatted_response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PDF Documents\n", "\n", "When working with PDFs, Claude can provide citations that reference specific page numbers, making it easy to track information sources. Here's how PDF citations work:\n", "\n", "- PDF document content is provided as base64-encoded data\n", "- Text is automatically chunked into sentences\n", "- Citations include page numbers (1-indexed) where the information was found\n", "- The model can cite multiple sentences together in a single citation but won't cite text smaller than a sentence\n", "- While images are processed, only text content can be cited at this time\n", "\n", "Below is an example using the Constitutional AI paper to demonstrate PDF citations:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "Raw response:\n", "================================================================================\n", "{\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"Based on the paper, here are the key aspects of Constitutional AI:\\n\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"Constitutional AI is a method for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, hence the name \\\"Constitutional AI\\\".\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"We experiment with methods for training a harmless AI assistant through self\\u0002improvement, without any human labels identifying harmful outputs. The only human\\r\\noversight is provided through a list of rules or principles, and so we refer to the method as\\r\\n\\u2018Constitutional AI\\u2019. \",\n", " \"document_title\": \"Constitutional AI Paper\",\n", " \"start_page_number\": 1,\n", " \"end_page_number\": 2\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\nThe process involves two main phases:\\n\\n1. Supervised Learning Phase:\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"In this phase, they sample from an initial model, generate self-critiques and revisions, and then finetune the original model on revised responses.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"In the supervised phase we sample from an initial model, then generate\\r\\nself-critiques and revisions, and then finetune the original model on revised responses. \",\n", " \"document_title\": \"Constitutional AI Paper\",\n", " \"start_page_number\": 1,\n", " \"end_page_number\": 2\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\n2. Reinforcement Learning Phase:\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"In this phase, they:\\n- Sample from the finetuned model\\n- Use a model to evaluate which of two samples is better\\n- Train a preference model from this dataset of AI preferences\\n- Use \\\"RL from AI Feedback\\\" (RLAIF)\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"In\\r\\nthe RL phase, we sample from the finetuned model, use a model to evaluate which of the\\r\\ntwo samples is better, and then train a preference model from this dataset of AI prefer\\u0002ences. We then train with RL using the preference model as the reward signal, i.e. we\\r\\nuse \\u2018RL from AI Feedback\\u2019 (RLAIF). \",\n", " \"document_title\": \"Constitutional AI Paper\",\n", " \"start_page_number\": 1,\n", " \"end_page_number\": 2\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\nThe key outcomes are:\\n\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"- They are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them\\n- Both the SL and RL methods can leverage chain-of-thought style reasoning to improve human-judged performance and transparency of AI decision making\\n- These methods make it possible to control AI behavior more precisely and with far fewer human labels\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"As a result we are able to train a harmless but non\\u0002evasive AI assistant that engages with harmful queries by explaining its objections to them.\\r\\nBoth the SL and RL methods can leverage chain-of-thought style reasoning to improve the\\r\\nhuman-judged performance and transparency of AI decision making. These methods make\\r\\nit possible to control AI behavior more precisely and with far fewer human labels.\\r\\n\",\n", " \"document_title\": \"Constitutional AI Paper\",\n", " \"start_page_number\": 1,\n", " \"end_page_number\": 2\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"The ultimate goal is not to completely remove human supervision, but rather to make it more efficient, transparent and targeted. While this work reduces reliance on human supervision for harmlessness, they still relied on human supervision in the form of helpfulness labels. The researchers expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting, but leave this for future work.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"By removing human feedback labels for harmlessness, we have moved further away from reliance on human\\r\\nsupervision, and closer to the possibility of a self-supervised approach to alignment. However, in this work\\r\\nwe still relied on human supervision in the form of helpfulness labels. We expect it is possible to achieve help\\u0002fulness and instruction-following without human feedback, starting from only a pretrained LM and extensive\\r\\nprompting, but we leave this for future work.\\r\\nOur ultimate goal is not to remove human supervision entirely, but to make it more efficient, transparent, and\\r\\ntargeted. \",\n", " \"document_title\": \"Constitutional AI Paper\",\n", " \"start_page_number\": 15,\n", " \"end_page_number\": 16\n", " }\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "================================================================================\n", "Formatted response:\n", "================================================================================\n", "Based on the paper, here are the key aspects of Constitutional AI:\n", "\n", "Constitutional AI is a method for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, hence the name \"Constitutional AI\". [1]\n", "\n", "The process involves two main phases:\n", "\n", "1. Supervised Learning Phase:\n", "In this phase, they sample from an initial model, generate self-critiques and revisions, and then finetune the original model on revised responses. [2]\n", "\n", "2. Reinforcement Learning Phase:\n", "In this phase, they:\n", "- Sample from the finetuned model\n", "- Use a model to evaluate which of two samples is better\n", "- Train a preference model from this dataset of AI preferences\n", "- Use \"RL from AI Feedback\" (RLAIF) [3]\n", "\n", "The key outcomes are:\n", "\n", "- They are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them\n", "- Both the SL and RL methods can leverage chain-of-thought style reasoning to improve human-judged performance and transparency of AI decision making\n", "- These methods make it possible to control AI behavior more precisely and with far fewer human labels [4]\n", "\n", "The ultimate goal is not to completely remove human supervision, but rather to make it more efficient, transparent and targeted. While this work reduces reliance on human supervision for harmlessness, they still relied on human supervision in the form of helpfulness labels. The researchers expect it is possible to achieve helpfulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting, but leave this for future work. [5]\n", "\n", "[1] \"We experiment with methods for training a harmless AI assistant through self\u0002improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.\" found in \"Constitutional AI Paper\"\n", "[2] \"In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses.\" found in \"Constitutional AI Paper\"\n", "[3] \"In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI prefer\u0002ences. We then train with RL using the preference model as the reward signal, i.e. we use ‘RL from AI Feedback’ (RLAIF).\" found in \"Constitutional AI Paper\"\n", "[4] \"As a result we are able to train a harmless but non\u0002evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.\" found in \"Constitutional AI Paper\"\n", "[5] \"By removing human feedback labels for harmlessness, we have moved further away from reliance on human supervision, and closer to the possibility of a self-supervised approach to alignment. However, in this work we still relied on human supervision in the form of helpfulness labels. We expect it is possible to achieve help\u0002fulness and instruction-following without human feedback, starting from only a pretrained LM and extensive prompting, but we leave this for future work. Our ultimate goal is not to remove human supervision entirely, but to make it more efficient, transparent, and targeted.\" found in \"Constitutional AI Paper\"\n" ] } ], "source": [ "import base64\n", "import json\n", "\n", "# Read and encode the PDF\n", "pdf_path = 'data/Constitutional AI.pdf'\n", "with open(pdf_path, \"rb\") as f:\n", " pdf_data = base64.b64encode(f.read()).decode()\n", "\n", "pdf_response = client.messages.create(\n", " model=\"claude-3-5-sonnet-latest\",\n", " temperature=0.0,\n", " max_tokens=1024,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"document\",\n", " \"source\": {\n", " \"type\": \"base64\",\n", " \"media_type\": \"application/pdf\",\n", " \"data\": pdf_data\n", " },\n", " \"title\": \"Constitutional AI Paper\",\n", " \"citations\": {\"enabled\": True}\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What is the main idea of Constitutional AI?\"\n", " }\n", " ]\n", " }\n", " ]\n", ")\n", "\n", "print(visualize_raw_response(pdf_response))\n", "print(visualize_citations(pdf_response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Custom Content Documents\n", "\n", "While plain text documents are automatically chunked into sentences, custom content documents give you complete control over citation granularity. This API shape allows you to:\n", "\n", "* Define your own chunks of any size\n", "* Control the minimum citation unit\n", "* Optimize for documents that don't work well with sentence chunking\n", "\n", "In the example below, we use the same help center articles as the plain text example above, but instead of allowing sentence-level citations, we'll treat each article as a single chunk. This demonstrates how the choice of document type affects citation behavior and granularity. You will notice that the `cited_text` is the entire article in contrast to a sentence from the source article." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "Raw response:\n", "================================================================================\n", "{\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"You should receive an email with your tracking number once your order ships. If it's been less than 48 hours since your order confirmation, please wait as the tracking number may not be available yet. If you haven't received a tracking number after 48 hours, please contact our customer support team for assistance.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"content_block_location\",\n", " \"cited_text\": \"Once your order ships, you'll receive an email with a tracking number. To track your package, log in to your PetWorld account and go to \\\"Order History.\\\" Click on the order you want to track and select \\\"Track Package.\\\" This will show you the current status and estimated delivery date. You can also enter the tracking number directly on our shipping partner's website for more detailed information. If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\",\n", " \"document_title\": \"Order Tracking Information\"\n", " }\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "================================================================================\n", "Formatted response:\n", "================================================================================\n", "You should receive an email with your tracking number once your order ships. If it's been less than 48 hours since your order confirmation, please wait as the tracking number may not be available yet. If you haven't received a tracking number after 48 hours, please contact our customer support team for assistance. [1]\n", "\n", "[1] \"Once your order ships, you'll receive an email with a tracking number. To track your package, log in to your PetWorld account and go to \"Order History.\" Click on the order you want to track and select \"Track Package.\" This will show you the current status and estimated delivery date. You can also enter the tracking number directly on our shipping partner's website for more detailed information. If you haven't received a tracking number within 48 hours of your order confirmation, please contact our customer support team.\" found in \"Order Tracking Information\"\n" ] } ], "source": [ "# Read all help center articles and create a list of custom content documents\n", "articles_dir = './data/help_center_articles'\n", "documents = []\n", "\n", "for filename in sorted(os.listdir(articles_dir)):\n", " if filename.endswith('.txt'):\n", " with open(os.path.join(articles_dir, filename), 'r') as f:\n", " content = f.read()\n", " # Split into title and body\n", " title_line, body = content.split('\\n', 1)\n", " title = title_line.replace('title: ', '')\n", " \n", " documents.append({\n", " \"type\": \"document\",\n", " \"source\": {\n", " \"type\": \"content\",\n", " \"content\": [\n", " {\"type\": \"text\", \"text\": body}\n", " ]\n", " },\n", " \"title\": title,\n", " \"citations\": {\"enabled\": True}\n", " })\n", "\n", "QUESTION = \"I just checked out, where is my order tracking number? Track package is not available on the website yet for my order.\"\n", "\n", "custom_content_response = client.messages.create(\n", " model=\"claude-3-5-sonnet-latest\",\n", " temperature=0.0,\n", " max_tokens=1024,\n", " system='You are a customer support bot working for PetWorld. Your task is to provide short, helpful answers to user questions. Since you are in a chat interface avoid providing extra details. You will be given access to PetWorld\\'s help center articles to help you answer questions.',\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": documents\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": [{\"type\": \"text\", \"text\": f'Here is the user\\'s question: {QUESTION}'}]\n", " }\n", " ]\n", ")\n", "\n", "print(visualize_raw_response(custom_content_response))\n", "print(visualize_citations(custom_content_response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the Context Field\n", "\n", "The `context` field allows you to provide additional information about a document that Claude can use when generating responses, but that won't be cited. This is useful for:\n", "\n", "* Providing metadata about the document (e.g., publication date, author)\n", "* [Contextual retrieval](https://www.anthropic.com/news/contextual-retrieval)\n", "* Including usage instructions or context that shouldn't be directly cited\n", "\n", "In the example below, we provide a loyalty program article with a warning in the context field. Notice how Claude can use the information in the context to inform its response but the context field content is not available for citation." ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "Raw response:\n", "================================================================================\n", "{\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"Let me explain PetWorld's loyalty program based on the provided information:\\n\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"PetWorld's loyalty program is straightforward - you earn 1 point for every dollar you spend. These points can be redeemed once you reach 100 points, which will get you a $5 reward that you can use on your next purchase.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"char_location\",\n", " \"cited_text\": \"PetWorld offers a loyalty program where customers earn 1 point for every dollar spent. Once you accumulate 100 points, you'll receive a $5 reward that can be used on your next purchase. \",\n", " \"document_title\": \"Loyalty Program Details\"\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"Points have an expiration period of 12 months from the date they are earned.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"char_location\",\n", " \"cited_text\": \"Points expire 12 months after they are earned. \",\n", " \"document_title\": \"Loyalty Program Details\"\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"You can easily keep track of your points by either checking your account dashboard or contacting customer service.\",\n", " \"citations\": [\n", " {\n", " \"type\": \"char_location\",\n", " \"cited_text\": \"You can check your point balance in your account dashboard or by asking customer service.\",\n", " \"document_title\": \"Loyalty Program Details\"\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n\\nPlease note that since this information is from an article that hasn't been updated in 12 months, some details of the program may have changed. It would be best to verify the current terms with PetWorld directly.\"\n", " }\n", " ]\n", "}\n", "\n", "================================================================================\n", "Formatted response:\n", "================================================================================\n", "Let me explain PetWorld's loyalty program based on the provided information:\n", "\n", "PetWorld's loyalty program is straightforward - you earn 1 point for every dollar you spend. These points can be redeemed once you reach 100 points, which will get you a $5 reward that you can use on your next purchase. [1]\n", "\n", "Points have an expiration period of 12 months from the date they are earned. [2]\n", "\n", "You can easily keep track of your points by either checking your account dashboard or contacting customer service. [3]\n", "\n", "Please note that since this information is from an article that hasn't been updated in 12 months, some details of the program may have changed. It would be best to verify the current terms with PetWorld directly.\n", "\n", "[1] \"PetWorld offers a loyalty program where customers earn 1 point for every dollar spent. Once you accumulate 100 points, you'll receive a $5 reward that can be used on your next purchase.\" found in \"Loyalty Program Details\"\n", "[2] \"Points expire 12 months after they are earned.\" found in \"Loyalty Program Details\"\n", "[3] \"You can check your point balance in your account dashboard or by asking customer service.\" found in \"Loyalty Program Details\"\n" ] } ], "source": [ "import json\n", "\n", "# Create a document with context field\n", "document = {\n", " \"type\": \"document\",\n", " \"source\": {\n", " \"type\": \"text\",\n", " \"media_type\": \"text/plain\",\n", " \"data\": \"PetWorld offers a loyalty program where customers earn 1 point for every dollar spent. Once you accumulate 100 points, you'll receive a $5 reward that can be used on your next purchase. Points expire 12 months after they are earned. You can check your point balance in your account dashboard or by asking customer service.\"\n", " },\n", " \"title\": \"Loyalty Program Details\",\n", " \"context\": \"WARNING: This article has not been updated in 12 months. Content may be out of date. Be sure to inform the user this content may be incorrect after providing guidance.\",\n", " \"citations\": {\"enabled\": True}\n", "}\n", "\n", "QUESTION = \"How does PetWorld's loyalty program work? When do points expire?\"\n", "\n", "context_response = client.messages.create(\n", " model=\"claude-3-5-sonnet-latest\",\n", " temperature=0.0,\n", " max_tokens=1024,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " document,\n", " {\n", " \"type\": \"text\",\n", " \"text\": QUESTION\n", " }\n", " ]\n", " }\n", " ]\n", ")\n", "\n", "print(visualize_raw_response(context_response))\n", "print(visualize_citations(context_response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PDF Highlighting\n", "\n", "One limitation with PDF citations is only the page numbers are returned. You can use third party libraries to match the returned cited text with page contents to draw attention to the cited content. This cell demonstrates PDF citation highlighting using Claude and PyMuPDF, creating a new annotated PDF:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "================================================================================\n", "Raw response:\n", "================================================================================\n", "{\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"According to the letter, \"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"Amazon's total revenue grew 12% year-over-year (\\\"YoY\\\") from $514B to $575B in 2023\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"In 2023, Amazon\\u2019s total revenue grew 12% year-over-year (\\u201cYoY\\u201d) from $514B to $575B. \",\n", " \"document_title\": \"Amazon 2023 Shareholder Letter\",\n", " \"start_page_number\": 1,\n", " \"end_page_number\": 2\n", " }\n", " ]\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \".\\n\\nBreaking this down by segment:\\n\"\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"\\n- North America revenue increased 12% YoY from $316B to $353B\\n- International revenue grew 11% YoY from $118B to $131B \\n- AWS revenue increased 13% YoY from $80B to $91B\",\n", " \"citations\": [\n", " {\n", " \"type\": \"page_location\",\n", " \"cited_text\": \"By segment, North\\r\\nAmerica revenue increased 12% YoY from $316B to $353B, International revenue grew 11% YoY from\\r\\n$118B to $131B, and AWS revenue increased 13% YoY from $80B to $91B.\\r\\n\",\n", " \"document_title\": \"Amazon 2023 Shareholder Letter\",\n", " \"start_page_number\": 1,\n", " \"end_page_number\": 2\n", " }\n", " ]\n", " }\n", " ]\n", "}\n", "Found cited text on page 1\n", "Found cited text on page 1\n", "\n", "Created highlighted PDF at: data/Amazon-com-Inc-2023-Shareholder-Letter-highlighted.pdf\n" ] } ], "source": [ "import fitz # PyMuPDF\n", "\n", "# Setup paths and read PDF\n", "pdf_path = 'data/Amazon-com-Inc-2023-Shareholder-Letter.pdf'\n", "output_pdf_path = 'data/Amazon-com-Inc-2023-Shareholder-Letter-highlighted.pdf'\n", "\n", "# Read and encode the PDF\n", "with open(pdf_path, \"rb\") as f:\n", " pdf_data = base64.b64encode(f.read()).decode()\n", "\n", "response = client.messages.create(\n", " model=\"claude-3-5-sonnet-latest\",\n", " max_tokens=1024,\n", " temperature=0,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"document\",\n", " \"source\": {\n", " \"type\": \"base64\",\n", " \"media_type\": \"application/pdf\",\n", " \"data\": pdf_data\n", " },\n", " \"title\": \"Amazon 2023 Shareholder Letter\",\n", " \"citations\": {\"enabled\": True}\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What was Amazon's total revenue in 2023 and how much did it grow year-over-year?\"\n", " }\n", " ]\n", " }\n", " ]\n", ")\n", "\n", "print(visualize_raw_response(response))\n", "\n", "# Collect PDF citations\n", "pdf_citations = []\n", "for content in response.content:\n", " if hasattr(content, 'citations') and content.citations:\n", " for citation in content.citations:\n", " if citation.type == \"page_location\":\n", " pdf_citations.append(citation)\n", "\n", "doc = fitz.open(pdf_path)\n", "\n", "# Process each citation\n", "for citation in pdf_citations:\n", " if citation.type == \"page_location\":\n", " text_to_find = citation.cited_text.replace('\\u0002', '')\n", " start_page = citation.start_page_number - 1 # Convert to 0-based index\n", " end_page = citation.end_page_number - 2\n", " \n", " # Process each page in the citation range\n", " for page_num in range(start_page, end_page + 1):\n", " page = doc[page_num]\n", " \n", " text_instances = page.search_for(text_to_find.strip())\n", " \n", " if text_instances:\n", " print(f\"Found cited text on page {page_num + 1}\")\n", " for inst in text_instances:\n", " highlight = page.add_highlight_annot(inst)\n", " highlight.set_colors({\"stroke\":(1, 1, 0)}) # Yellow highlight\n", " highlight.update()\n", " else:\n", " print(f\"{text_to_find} not found on page {page_num + 1}\")\n", "\n", "# Save the new PDF\n", "doc.save(output_pdf_path)\n", "doc.close()\n", "\n", "print(f\"\\nCreated highlighted PDF at: {output_pdf_path}\")" ] } ], "metadata": { "kernelspec": { "display_name": "py311", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 4 }

misc/using_citations.ipynb (916 lines of code) (raw):