misc/prompt_caching.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Prompt caching through the Anthropic API\n", "\n", "Prompt caching allows you to store and reuse context within your prompt. This makes it more practical to include additional information in your prompt—such as detailed instructions and example responses—which help improve every response Claude generates.\n", "\n", "In addition, by fully leveraging prompt caching within your prompt, you can reduce latency by >2x and costs up to 90%. This can generate significant savings when building solutions that involve repetitive tasks around detailed book_content.\n", "\n", "In this cookbook, we will demonstrate how to use prompt caching in a single turn and across a multi-turn conversation. \n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "plaintext" } }, "source": [ "## Setup\n", "\n", "First, let's set up our environment with the necessary imports and initializations:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install anthropic bs4 --quiet" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import anthropic\n", "import time\n", "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "client = anthropic.Anthropic()\n", "MODEL_NAME = \"claude-3-5-sonnet-20241022\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's fetch some text content to use in our examples. We'll use the text from Pride and Prejudice by Jane Austen which is around ~187,000 tokens long." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fetched 737525 characters from the book.\n", "First 500 characters:\n", "The Project Gutenberg eBook of Pride and Prejudice\n", "This ebook is for the use of anyone anywhere in the United States and\n", "most other parts of the world at no cost and with almost no restrictions\n", "whatsoever. You may copy it, give it away or re-use it under the terms\n", "of the Project Gutenberg License included with this ebook or online\n", "at www.gutenberg.org. If you are not located in the United States,\n", "you will have to check the laws of the country where you are located\n", "before using this eBook.\n", "Title:\n" ] } ], "source": [ "def fetch_article_content(url):\n", " response = requests.get(url)\n", " soup = BeautifulSoup(response.content, 'html.parser')\n", " \n", " # Remove script and style elements\n", " for script in soup([\"script\", \"style\"]):\n", " script.decompose()\n", " \n", " # Get text\n", " text = soup.get_text()\n", " \n", " # Break into lines and remove leading and trailing space on each\n", " lines = (line.strip() for line in text.splitlines())\n", " # Break multi-headlines into a line each\n", " chunks = (phrase.strip() for line in lines for phrase in line.split(\" \"))\n", " # Drop blank lines\n", " text = '\\n'.join(chunk for chunk in chunks if chunk)\n", " \n", " return text\n", "\n", "# Fetch the content of the article\n", "book_url = \"https://www.gutenberg.org/cache/epub/1342/pg1342.txt\"\n", "book_content = fetch_article_content(book_url)\n", "\n", "print(f\"Fetched {len(book_content)} characters from the book.\")\n", "print(\"First 500 characters:\")\n", "print(book_content[:500])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: Single turn\n", "\n", "Let's demonstrate prompt caching with a large document, comparing the performance and cost between cached and non-cached API calls." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1: Non-cached API Call\n", "\n", "First, let's make a non-cached API call. This will load the prompt into the cache so that our subsequent cached API calls can benefit from the prompt caching.\n", "\n", "We will ask for a short output string to keep the output response time low since the benefit of prompt caching applies only to the input processing time." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Non-cached API call time: 20.37 seconds\n", "Non-cached API call input tokens: 17\n", "Non-cached API call output tokens: 8\n", "\n", "Summary (non-cached):\n", "[TextBlock(text='Pride and Prejudice', type='text')]\n" ] } ], "source": [ "def make_non_cached_api_call():\n", " messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"<book>\" + book_content + \"</book>\",\n", " \"cache_control\": {\"type\": \"ephemeral\"}\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What is the title of this book? Only output the title.\"\n", " }\n", " ]\n", " }\n", " ]\n", "\n", " start_time = time.time()\n", " response = client.messages.create(\n", " model=MODEL_NAME,\n", " max_tokens=300,\n", " messages=messages,\n", " extra_headers={\"anthropic-beta\": \"prompt-caching-2024-07-31\"}\n", "\n", " )\n", " end_time = time.time()\n", "\n", " return response, end_time - start_time\n", "\n", "non_cached_response, non_cached_time = make_non_cached_api_call()\n", "\n", "print(f\"Non-cached API call time: {non_cached_time:.2f} seconds\")\n", "print(f\"Non-cached API call input tokens: {non_cached_response.usage.input_tokens}\")\n", "print(f\"Non-cached API call output tokens: {non_cached_response.usage.output_tokens}\")\n", "\n", "print(\"\\nSummary (non-cached):\")\n", "print(non_cached_response.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2: Cached API Call\n", "\n", "Now, let's make a cached API call. I'll add in the \"cache_control\": {\"type\": \"ephemeral\"} attribute to the content object and add the \"prompt-caching-2024-07-31\" beta header to the request. This will enable prompt caching for this API call.\n", "\n", "To keep the output latency constant, we will ask Claude the same question as before. Note that this question is not part of the cached content." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cached API call time: 2.92 seconds\n", "Cached API call input tokens: 17\n", "Cached API call output tokens: 8\n", "\n", "Summary (cached):\n", "[TextBlock(text='Pride and Prejudice', type='text')]\n" ] } ], "source": [ "def make_cached_api_call():\n", " messages = [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"<book>\" + book_content + \"</book>\",\n", " \"cache_control\": {\"type\": \"ephemeral\"}\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What is the title of this book? Only output the title.\"\n", " }\n", " ]\n", " }\n", " ]\n", "\n", " start_time = time.time()\n", " response = client.messages.create(\n", " model=MODEL_NAME,\n", " max_tokens=300,\n", " messages=messages,\n", " extra_headers={\"anthropic-beta\": \"prompt-caching-2024-07-31\"}\n", " )\n", " end_time = time.time()\n", "\n", " return response, end_time - start_time\n", "\n", "cached_response, cached_time = make_cached_api_call()\n", "\n", "print(f\"Cached API call time: {cached_time:.2f} seconds\")\n", "print(f\"Cached API call input tokens: {cached_response.usage.input_tokens}\")\n", "print(f\"Cached API call output tokens: {cached_response.usage.output_tokens}\")\n", "\n", "print(\"\\nSummary (cached):\")\n", "print(cached_response.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the cached API call only took 3.64 seconds total compared to 21.44 seconds for the non-cached API call. This is a significant improvement in overall latency due to caching." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2: Multi-turn Conversation with Incremental Caching\n", "\n", "Now, let's look at a multi-turn conversation where we add cache breakpoints as the conversation progresses." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Turn 1:\n", "User: What is the title of this novel?\n", "Assistant: The title of this novel is \"Pride and Prejudice\" by Jane Austen.\n", "User input tokens: 4\n", "Output tokens: 22\n", "Input tokens (cache read): 0\n", "Input tokens (cache write): 187354\n", "0.0% of input prompt cached (4 tokens)\n", "Time taken: 20.37 seconds\n", "\n", "Turn 2:\n", "User: Who are Mr. and Mrs. Bennet?\n", "Assistant: Mr. and Mrs. Bennet are the parents of five daughters (Jane, Elizabeth, Mary, Kitty, and Lydia) in Pride and Prejudice. \n", "\n", "Mr. Bennet is an intelligent but detached father who often retreats to his library to avoid his wife's dramatics. He has a satirical wit and tends to be amused by the follies of others, including his own family members. He shows particular fondness for his second daughter Elizabeth, who shares his sharp mind and wit.\n", "\n", "Mrs. Bennet is a woman primarily focused on getting her five daughters married to wealthy men. She is described as having \"poor nerves\" and is often anxious, dramatic, and somewhat foolish. Her main goal in life is to see her daughters well-married, particularly because their family estate is entailed to a male heir (Mr. Collins), meaning her daughters will be left with little financial security after Mr. Bennet's death. She is often described as lacking sophistication and good judgment, which sometimes embarrasses her more sensible daughters, particularly Elizabeth.\n", "\n", "Their marriage is portrayed as an ill-matched one, where Mr. Bennet married Mrs. Bennet in his youth because of her beauty, only to discover they were incompatible in terms of intellect and character. This serves as a cautionary tale about marrying without proper consideration of character compatibility.\n", "User input tokens: 4\n", "Output tokens: 297\n", "Input tokens (cache read): 187354\n", "Input tokens (cache write): 36\n", "100.0% of input prompt cached (187358 tokens)\n", "Time taken: 7.53 seconds\n", "\n", "Turn 3:\n", "User: What is Netherfield Park?\n", "Assistant: Netherfield Park is a large estate near the Bennet family home of Longbourn in the novel. It becomes significant to the plot when it is rented by Mr. Bingley, a wealthy young man who moves into the neighborhood. \n", "\n", "The arrival of Mr. Bingley at Netherfield sets much of the novel's plot in motion, as he quickly develops a romantic interest in Jane Bennet, the eldest Bennet daughter. It's also through Netherfield that Elizabeth Bennet first encounters Mr. Darcy, who is staying there as Mr. Bingley's friend.\n", "\n", "Netherfield serves as an important setting for several key scenes in the novel, including:\n", "- The ball where Mr. Darcy first slights Elizabeth\n", "- Jane's illness and subsequent stay at Netherfield (where Elizabeth comes to nurse her)\n", "- Various social interactions between the Bennets and the Bingley-Darcy party\n", "\n", "The estate symbolizes wealth and social status in the novel, and its occupancy by Mr. Bingley represents the possibility of social and financial advancement for the Bennet family through marriage. When Bingley suddenly leaves Netherfield, it creates significant disappointment and disruption in the hopes of the Bennet family, particularly for Jane.\n", "User input tokens: 4\n", "Output tokens: 289\n", "Input tokens (cache read): 187390\n", "Input tokens (cache write): 308\n", "100.0% of input prompt cached (187394 tokens)\n", "Time taken: 6.76 seconds\n", "\n", "Turn 4:\n", "User: What is the main theme of this novel?\n", "Assistant: The main theme of \"Pride and Prejudice\" is the interplay between pride and prejudice in relationships, particularly as shown through the central romance between Elizabeth Bennet and Mr. Darcy. However, there are several important related themes:\n", "\n", "1. Pride and Prejudice:\n", "- Darcy's pride in his social position initially makes him appear arrogant and disdainful\n", "- Elizabeth's prejudice against Darcy based on first impressions and Wickham's false account\n", "- Both characters must overcome these flaws to find happiness together\n", "\n", "2. Marriage and Social Class:\n", "- The pressure on young women to marry well for financial security\n", "- The conflict between marrying for love versus social advantage\n", "- Different types of marriages are portrayed (Elizabeth/Darcy, Jane/Bingley, Lydia/Wickham, Charlotte/Collins)\n", "\n", "3. Reputation and Social Expectations:\n", "- The importance of reputation in Regency society\n", "- How behavior reflects on family honor\n", "- The restrictions placed on women in this period\n", "\n", "4. Personal Growth and Self-Knowledge:\n", "- Elizabeth and Darcy both learn to recognize their own faults\n", "- The importance of overcoming first impressions\n", "- Character development through experience and reflection\n", "\n", "5. Family and Society:\n", "- The role of family connections in determining social status\n", "- The impact of family behavior on individual prospects\n", "- The balance between\n", "User input tokens: 4\n", "Output tokens: 300\n", "Input tokens (cache read): 187698\n", "Input tokens (cache write): 301\n", "100.0% of input prompt cached (187702 tokens)\n", "Time taken: 7.13 seconds\n" ] } ], "source": [ "class ConversationHistory:\n", " def __init__(self):\n", " # Initialize an empty list to store conversation turns\n", " self.turns = []\n", "\n", " def add_turn_assistant(self, content):\n", " # Add an assistant's turn to the conversation history\n", " self.turns.append({\n", " \"role\": \"assistant\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": content\n", " }\n", " ]\n", " })\n", "\n", " def add_turn_user(self, content):\n", " # Add a user's turn to the conversation history\n", " self.turns.append({\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": content\n", " }\n", " ]\n", " })\n", "\n", " def get_turns(self):\n", " # Retrieve conversation turns with specific formatting\n", " result = []\n", " user_turns_processed = 0\n", " # Iterate through turns in reverse order\n", " for turn in reversed(self.turns):\n", " if turn[\"role\"] == \"user\" and user_turns_processed < 1:\n", " # Add the last user turn with ephemeral cache control\n", " result.append({\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": turn[\"content\"][0][\"text\"],\n", " \"cache_control\": {\"type\": \"ephemeral\"}\n", " }\n", " ]\n", " })\n", " user_turns_processed += 1\n", " else:\n", " # Add other turns as they are\n", " result.append(turn)\n", " # Return the turns in the original order\n", " return list(reversed(result))\n", "\n", "# Initialize the conversation history\n", "conversation_history = ConversationHistory()\n", "\n", "# System message containing the book content\n", "# Note: 'book_content' should be defined elsewhere in the code\n", "system_message = f\"<file_contents> {book_content} </file_contents>\"\n", "\n", "# Predefined questions for our simulation\n", "questions = [\n", " \"What is the title of this novel?\",\n", " \"Who are Mr. and Mrs. Bennet?\",\n", " \"What is Netherfield Park?\",\n", " \"What is the main theme of this novel?\"\n", "]\n", "\n", "def simulate_conversation():\n", " for i, question in enumerate(questions, 1):\n", " print(f\"\\nTurn {i}:\")\n", " print(f\"User: {question}\")\n", " \n", " # Add user input to conversation history\n", " conversation_history.add_turn_user(question)\n", "\n", " # Record the start time for performance measurement\n", " start_time = time.time()\n", "\n", " # Make an API call to the assistant\n", " response = client.messages.create(\n", " model=MODEL_NAME,\n", " extra_headers={\n", " \"anthropic-beta\": \"prompt-caching-2024-07-31\"\n", " },\n", " max_tokens=300,\n", " system=[\n", " {\"type\": \"text\", \"text\": system_message, \"cache_control\": {\"type\": \"ephemeral\"}},\n", " ],\n", " messages=conversation_history.get_turns(),\n", " )\n", "\n", " # Record the end time\n", " end_time = time.time()\n", "\n", " # Extract the assistant's reply\n", " assistant_reply = response.content[0].text\n", " print(f\"Assistant: {assistant_reply}\")\n", "\n", " # Print token usage information\n", " input_tokens = response.usage.input_tokens\n", " output_tokens = response.usage.output_tokens\n", " input_tokens_cache_read = getattr(response.usage, 'cache_read_input_tokens', '---')\n", " input_tokens_cache_create = getattr(response.usage, 'cache_creation_input_tokens', '---')\n", " print(f\"User input tokens: {input_tokens}\")\n", " print(f\"Output tokens: {output_tokens}\")\n", " print(f\"Input tokens (cache read): {input_tokens_cache_read}\")\n", " print(f\"Input tokens (cache write): {input_tokens_cache_create}\")\n", "\n", " # Calculate and print the elapsed time\n", " elapsed_time = end_time - start_time\n", "\n", " # Calculate the percentage of input prompt cached\n", " total_input_tokens = input_tokens + (int(input_tokens_cache_read) if input_tokens_cache_read != '---' else 0)\n", " percentage_cached = (int(input_tokens_cache_read) / total_input_tokens * 100 if input_tokens_cache_read != '---' and total_input_tokens > 0 else 0)\n", "\n", " print(f\"{percentage_cached:.1f}% of input prompt cached ({total_input_tokens} tokens)\")\n", " print(f\"Time taken: {elapsed_time:.2f} seconds\")\n", "\n", " # Add assistant's reply to conversation history\n", " conversation_history.add_turn_assistant(assistant_reply)\n", "\n", "# Run the simulated conversation\n", "simulate_conversation()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see in this example, response times decreased from nearly 24 seconds to just 7-11 seconds after the initial cache setup, while maintaining the same level of quality across the answers. Most of this remaining latency is due to the time it takes to generate the response, which is not affected by prompt caching.\n", "\n", "And since nearly 100% of input tokens were cached in subsequent turns as we kept adjusting the cache breakpoints, we were able to read the next user message nearly instantly." ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 2 }

misc/prompt_caching.ipynb (536 lines of code) (raw):