misc/building_moderation

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a moderation filter with Claude\n", "This guide will show you how to use Claude to build a content moderation filter for user-generated text. The key idea is to define the moderation rules and categories directly in the prompt, allowing for easy customization and experimentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic Approach\n", "The basic approach is to provide Claude with a prompt that describes the categories you want to filter for (e.g. \"ALLOW\" and \"BLOCK\"), along with detailed descriptions or examples of what kinds of content should fall into each category. Then, you insert the user-generated text to be classified as part of the prompt, and ask Claude to categorize it based on the provided guidelines.\n", "\n", "Here's an example prompt structure:\n", "\n", "```text\n", "You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:\n", "\n", "BLOCK CATEGORY:\n", "- [Description or examples of content that should be blocked]\n", "\n", "ALLOW CATEGORY:\n", "- [Description or examples of content that is allowed]\n", "\n", "Here is the user-generated text to categorize:\n", "<user_text>{{USER_TEXT}}</user_text>\n", "\n", "Based on the guidelines above, classify this text as either ALLOW or BLOCK. Return nothing else.\n", "```\n", "\n", "To use this, you would replace `{{USER_TEXT}}` with the actual user-generated text to be classified, and then send the prompt to Claude using the Anthropic API. Claude's response should be either \"ALLOW\" or \"BLOCK\", indicating how the text should be handled based on your provided guidelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example usage\n", "Here's some example Python code that demonstrates how to use this approach:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install anthropic" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from anthropic import Anthropic\n", "client = Anthropic()\n", "MODEL_NAME = \"claude-3-haiku-20240307\"\n", "\n", "def moderate_text(user_text, guidelines):\n", " prompt_template = \"\"\"\n", " You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:\n", "\n", " {guidelines}\n", "\n", " Here is the user-generated text to categorize:\n", " <user_text>{user_text}</user_text>\n", "\n", " Based on the guidelines above, classify this text as either ALLOW or BLOCK. Return nothing else.\n", " \"\"\"\n", "\n", " # Format the prompt with the user text\n", " prompt = prompt_template.format(user_text=user_text, guidelines=guidelines)\n", "\n", " # Send the prompt to Claude and get the response\n", " response = client.messages.create(\n", " model=MODEL_NAME,\n", " max_tokens=10,\n", " messages=[{\"role\": \"user\", \"content\": prompt}]\n", " ).content[0].text\n", "\n", " return response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's an example of how you could use this function to moderate an array of user comments:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comment: This movie was great, I really enjoyed it. The main actor really killed it!\n", "Classification: ALLOW\n", "\n", "Comment: Delete this post now or you better hide. I am coming after you and your family.\n", "Classification: BLOCK\n", "\n", "Comment: Stay away from the 5G cellphones!! They are using 5G to control you.\n", "Classification: BLOCK\n", "\n", "Comment: Thanks for the helpful information!\n", "Classification: ALLOW\n", "\n" ] } ], "source": [ "example_guidelines = '''BLOCK CATEGORY:\n", " - Promoting violence, illegal activities, or hate speech\n", " - Explicit sexual content\n", " - Harmful misinformation or conspiracy theories\n", "\n", " ALLOW CATEGORY:\n", " - Most other content is allowed, as long as it is not explicitly disallowed\n", "'''\n", "\n", "user_comments = [\n", " \"This movie was great, I really enjoyed it. The main actor really killed it!\",\n", " \"Delete this post now or you better hide. I am coming after you and your family.\",\n", " \"Stay away from the 5G cellphones!! They are using 5G to control you.\",\n", " \"Thanks for the helpful information!\",\n", "]\n", "\n", "for comment in user_comments:\n", " classification = moderate_text(comment, example_guidelines)\n", " print(f\"Comment: {comment}\\nClassification: {classification}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customization\n", "\n", "One of the key benefits of this approach is that you can easily customize the moderation rules by modifying the descriptions or examples provided in the prompt for the \"BLOCK\" and \"ALLOW\" categories. This allows you to fine-tune the filtering to suit your specific needs or preferences.\n", "\n", "For example, if you wanted to Claude to moderate a rollercoaster enthusiast forum and ensure posts stay on topic, you could update the \"ALLOW\" and \"BLOCK\" category descriptions accordingly:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Title: Top 10 Wildest Inversions on Steel Coasters\n", "Classification: ALLOW\n", "\n", "Title: My Review of the New RMC Raptor Coaster at Cedar Point\n", "Classification: ALLOW\n", "\n", "Title: Best Places to Buy Cheap Hiking Gear\n", "Classification: BLOCK\n", "\n", "Title: Rumor: Is Six Flags Planning a Giga Coaster for 2025?\n", "Classification: ALLOW\n", "\n", "Title: My Thoughts on the Latest Marvel Movie\n", "Classification: BLOCK\n", "\n" ] } ], "source": [ "rollercoaster_guidelines = '''BLOCK CATEGORY:\n", "- Content that is not related to rollercoasters, theme parks, or the amusement industry\n", "- Explicit violence, hate speech, or illegal activities\n", "- Spam, advertisements, or self-promotion\n", "\n", "ALLOW CATEGORY:\n", "- Discussions about rollercoaster designs, ride experiences, and park reviews\n", "- Sharing news, rumors, or updates about new rollercoaster projects\n", "- Respectful debates about the best rollercoasters, parks, or ride manufacturers\n", "- Some mild profanity or crude language, as long as it is not directed at individuals\n", "'''\n", "\n", "post_titles = [\n", " \"Top 10 Wildest Inversions on Steel Coasters\",\n", " \"My Review of the New RMC Raptor Coaster at Cedar Point\",\n", " \"Best Places to Buy Cheap Hiking Gear\",\n", " \"Rumor: Is Six Flags Planning a Giga Coaster for 2025?\",\n", " \"My Thoughts on the Latest Marvel Movie\",\n", "]\n", "\n", "for title in post_titles:\n", " classification = moderate_text(title, rollercoaster_guidelines)\n", " print(f\"Title: {title}\\nClassification: {classification}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improving Performance with Chain of Thought (CoT)\n", "\n", "One technique that can enhance Claude's content moderation capabilities is \"chain-of-thought\" (CoT) prompting. This approach encourages Claude to break down its reasoning process into a step-by-step chain of thoughts, rather than just providing the final output.\n", "\n", "To leverage chain of thought for moderation, you can modify your prompt to explicitly instruct Claude to break down its process into clear steps inside `<thinking>` tags. Here's an example:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "<thinking>\n", "The post appears to be promoting a band rather than discussing rollercoasters, theme parks, or the amusement industry. This falls under the \"spam, advertisements, or self-promotion\" category, which is grounds for blocking the post.\n", "</thinking>\n", "\n", "<output>BLOCK</output>\n" ] } ], "source": [ "cot_prompt = '''You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:\n", "\n", "BLOCK CATEGORY:\n", "- Content that is not related to rollercoasters, theme parks, or the amusement industry\n", "- Explicit violence, hate speech, or illegal activities\n", "- Spam, advertisements, or self-promotion\n", "\n", "ALLOW CATEGORY:\n", "- Discussions about rollercoaster designs, ride experiences, and park reviews\n", "- Sharing news, rumors, or updates about new rollercoaster projects\n", "- Respectful debates about the best rollercoasters, parks, or ride manufacturers\n", "- Some mild profanity or crude language, as long as it is not directed at individuals\n", "\n", "First, inside of <thinking> tags, identify any potentially concerning aspects of the post based on the guidelines below and consider whether those aspects are serious enough to block the post or not. Finally, classify this text as either ALLOW or BLOCK inside <output> tags. Return nothing else.\n", "\n", "Given those instructions, here is the post to categorize:\n", "\n", "<user_post>{user_post}</user_post>'''\n", "\n", "user_post = \"Introducing my new band - Coaster Shredders. Check us out on YouTube!!\"\n", "\n", "response = client.messages.create(\n", " model=MODEL_NAME,\n", " max_tokens=1000,\n", " messages=[{\"role\": \"user\", \"content\": cot_prompt.format(user_post=user_post)}]\n", " ).content[0].text\n", "\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Improving Performance with Examples\n", "Another technique for improving performance is by adding a few examples to the prompt, you provide Claude with some initial training data or \"few-shot learning\" to better understand the desired categorization. This can be especially helpful for nuanced or ambiguous cases where the category boundaries may not be entirely clear from the text descriptions alone. Here's an example of how you could modify the prompt template to include examples:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ALLOW\n" ] } ], "source": [ "examples_prompt = '''You are a content moderation expert tasked with categorizing user-generated text based on the following guidelines:\n", "\n", "BLOCK CATEGORY:\n", "- Content that is not related to rollercoasters, theme parks, or the amusement industry\n", "- Explicit violence, hate speech, or illegal activities\n", "- Spam, advertisements, or self-promotion\n", "\n", "ALLOW CATEGORY:\n", "- Discussions about rollercoaster designs, ride experiences, and park reviews\n", "- Sharing news, rumors, or updates about new rollercoaster projects\n", "- Respectful debates about the best rollercoasters, parks, or ride manufacturers\n", "- Some mild profanity or crude language, as long as it is not directed at individuals\n", "\n", "Here are some examples:\n", "<examples>\n", "Text: I'm selling weight loss products, check my link to buy!\n", "Category: BLOCK\n", "\n", "Text: I hate my local park, the operations and customer service are terrible. I wish that place would just burn down.\n", "Category: BLOCK\n", "\n", "Text: Did anyone ride the new RMC raptor Trek Plummet 2 yet? I've heard it's insane!\n", "Category: ALLOW\n", "\n", "Text: Hercs > B&Ms. That's just facts, no cap! Arrow > Intamin for classic woodies too.\n", "Category: ALLOW\n", "</examples>\n", "\n", "Given those examples, here is the user-generated text to categorize:\n", "<user_text>{user_text}</user_text>\n", "\n", "Based on the guidelines above, classify this text as either ALLOW or BLOCK. Return nothing else.'''\n", "\n", "user_post = \"Why Boomerang Coasters Ain't It (Don't @ Me)\"\n", "\n", "response = client.messages.create(\n", " model=MODEL_NAME,\n", " max_tokens=1000,\n", " messages=[{\"role\": \"user\", \"content\": examples_prompt.format(user_text=user_post)}]\n", " ).content[0].text\n", "\n", "print(response)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 2 }

misc/building_moderation_filter.ipynb (354 lines of code) (raw):