courses/agent_assist/baseline_summarization_chat.ipynb (523 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "XvOluiKaG0Hb" }, "source": [ "# End to End Baseline Summarisation\n", "\n", "In this notebook you will use the configured conversation profile from earlier in the lab to perform summarization of chat transcripts with redacted PII. You will need the integration ID of your conversation profile created earlier to complete this lab.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "mLtMeOqgHVEc" }, "source": [ "# Installing required libraries and Authenticating GCP Credentials" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mq2u5TNpGzHQ", "tags": [] }, "outputs": [], "source": [ "! pip install -q google-cloud-storage google-cloud-dlp google-cloud-dialogflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__IMPORTANT:__ Restart the kernel for the notebook by going to __Kernel__ and __Restart Kernel__ before moving forward. You do not need to run the first cell again after completing the package installation" ] }, { "cell_type": "markdown", "metadata": { "id": "n6KG5mj1I-hr" }, "source": [ "## Configure Google Cloud credentials\n", "\n", "__Note:__ Replace `project-name` with your Project ID. You will need to uncomment the commented lines first if you are running this notebook in a Google Colab environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "LF0QStZrHzkq", "tags": [] }, "outputs": [], "source": [ "PROJECT_NAME='project-name' \n", "\n", "!gcloud config set project $PROJECT_NAME" ] }, { "cell_type": "markdown", "metadata": { "id": "zgNLnpuPLK9T" }, "source": [ "## Import required libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aPhHyXo0LFSw", "tags": [] }, "outputs": [], "source": [ "from typing import Dict, List\n", "import csv\n", "import glob\n", "import json\n", "import time\n", "import re\n", "import json\n", "import pandas as pd\n", "import pickle\n", "from google.cloud import storage\n", "import google.cloud.dlp\n", "from google.cloud import dialogflow_v2beta1 as dialogflow\n", "import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace the value of the`CONV_PROFILE_ID` variable with the integration ID you recorded earlier.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "v5gWFyL2LSIe", "tags": [] }, "outputs": [], "source": [ "CONV_PROFILE_ID = \"projects/project-name/locations/global/conversationProfiles/profile-id\"\n", "GCS_BUCKET_URI = \"gs://summarization_integration_test_data\" \n", "GCS_BUCKET_NAME = GCS_BUCKET_URI.split(\"//\")[1]\n", "TRANSCRIPTS_INPUT_FOLDER_PREFIX = \"data\" \n", "SUPPORTED_FILE_FORMATS = [\"json\"]\n", "\n", "project_id = PROJECT_NAME\n", "location = \"global\"\n", "project_path = '/'.join(CONV_PROFILE_ID.split('/')[:4])\n", "conversation_profile_id = CONV_PROFILE_ID" ] }, { "cell_type": "markdown", "metadata": { "id": "g6vdll99TXgG" }, "source": [ "# Step 1: Run PII redaction on chat transcripts" ] }, { "cell_type": "markdown", "metadata": { "id": "NqEYIb8eTqYg" }, "source": [ "## Utility Functions" ] }, { "cell_type": "markdown", "metadata": { "id": "JyQA7VHUBW2C" }, "source": [ "Before summarizing transcripts, you will redact possibly sensitive information found in the transcripts. This will lower the risk of accidental data leakage.\n", "\n", "**Note**: `INFO_TYPES` should be fine-tuned to fit customer's requirements. The existing `INFO_TYPES` in the cell below is the default setting but is subject to developer's discretion. To fine-tune `INFO_TYPES`, please refer to https://cloud.google.com/dlp/docs/infotypes-reference\n", "\n", "First, instaniate a client to interact with the Data Loss Prevention (DLP) API and a function (`redact_dlp`) to redact sensitive information" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nEJ8d9aST8Cg", "tags": [] }, "outputs": [], "source": [ "dlp = google.cloud.dlp_v2.DlpServiceClient()\n", "INFO_TYPES = [\"AGE\",\"CREDIT_CARD_NUMBER\",\"CREDIT_CARD_TRACK_NUMBER\",\"DATE\",\"DATE_OF_BIRTH\",\n", " \"DOMAIN_NAME\",\"EMAIL_ADDRESS\",\"FEMALE_NAME\",\"MALE_NAME\",\"FIRST_NAME\",\"GENDER\",\n", " \"GENERIC_ID\",\"IP_ADDRESS\",\"LAST_NAME\",\"LOCATION\",\"PERSON_NAME\",\"PHONE_NUMBER\",\n", " \"STREET_ADDRESS\"]\n", "\n", "def redact_dlp(input_str,replacement_str=r\"[redacted]\"):\n", "\n", " inspect_config = {\"info_types\": [{\"name\": info_type} for info_type in INFO_TYPES]}\n", " deidentify_config = {\n", " \"info_type_transformations\": {\n", " \"transformations\": [\n", " {\n", " \"primitive_transformation\": {\n", " \"replace_config\": {\n", " \"new_value\": {\"string_value\": replacement_str}\n", " }\n", " }\n", " }\n", " ]\n", " }\n", " }\n", " item = {\"value\": input_str}\n", " response = dlp.deidentify_content(\n", " request={\n", " \"parent\" :\"projects/{}\".format(PROJECT_NAME),\n", " \"deidentify_config\": deidentify_config,\n", " \"inspect_config\": inspect_config,\n", " \"item\": item,\n", " }\n", " )\n", "\n", " return str(response.item.value).strip()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before defining a function to apply the DLP API, you define a function to parse the chat transcripts. The code following the definition of the `parse_chat_transcripts` imports the transcripts into a Pandas dataframe to make it easier to parse and apply the DLP API to the appropriate field of the transcripts. It will take a couple of minutes to parse the 150 transcripts in the Cloud Storage location being used for this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "G4AKaXJgq0FM", "tags": [] }, "outputs": [], "source": [ "storage_client = storage.Client()\n", "\n", "def parse_chat_transcripts(file_name, chat_transcript):\n", " result_list = []\n", " conversation_entries_list = chat_transcript['entries']\n", " for index, conversation_entry in enumerate(conversation_entries_list):\n", " result_dict = {}\n", " result_dict['conversation_id']=file_name\n", " result_dict['turn_id'] = index\n", " result_dict['role'] = conversation_entry['role']\n", " result_dict['text'] = redact_dlp(conversation_entry['text'])\n", " result_list.append(result_dict)\n", " return result_list\n", "\n", "INPUT_TRANSCRIPT_FILES_GCS_PATHS = storage_client.list_blobs(GCS_BUCKET_NAME, prefix= TRANSCRIPTS_INPUT_FOLDER_PREFIX)\n", "index = 1\n", "all_transcripts = []\n", "_bucket = storage_client.get_bucket(GCS_BUCKET_NAME)\n", "\n", "for chat_file_name in INPUT_TRANSCRIPT_FILES_GCS_PATHS:\n", " if (str(chat_file_name.name).split(\"/\")[1] != '') and (str(chat_file_name.name).split(\"/\")[1].split(\".\")[-1] in SUPPORTED_FILE_FORMATS):\n", " try:\n", " _blob = _bucket.blob(chat_file_name.name)\n", " with _blob.open(mode='r') as f:\n", " chat = json.load(f)\n", " temp = parse_chat_transcripts(str(chat_file_name.name).split(\"/\")[1].split(\".\")[0], chat)\n", " all_transcripts.extend(temp)\n", " if index % 10 == 0:\n", " print(f\"Conversations Processed :: {str(index)}\")\n", " index += 1\n", "\n", " except Exception as e:\n", " #print(\"Exception Occurred for Chat: {} \\n {}\".format(chat_file_name.name, e))\n", " continue" ] }, { "cell_type": "markdown", "metadata": { "id": "thdjQUD2rodv" }, "source": [ "Before applying the baseline summarization model, you should explore the preprocessed and redacted output from one of the conversations. Here you will convert the `all_transcripts` into a Pandas dataframe and then look at one of the conversations. Note the portions of the conversation that were redacted by the DLP API." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "eval_df = pd.DataFrame(all_transcripts)\n", "mask = eval_df['conversation_id']=='034' #Update to view other conversations\n", "eval_df[mask]" ] }, { "cell_type": "markdown", "metadata": { "id": "c2j9vBCnXK17" }, "source": [ "# Step 2: Generate summaries from Baseline Summarization Model\n", "\n", "In this step you will generate summaries for the redacted transcripts from the previous steps after defining a sequence of helper functions to work through the appropriate steps. The comments in the code give a rough description of each of the helper functions being created." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-BOSlFllaV2J", "tags": [] }, "outputs": [], "source": [ "# Function to create a conversation for a given conservation profile\n", "\n", "def create_conversation(client: dialogflow.ConversationsClient, project_id: str,\n", " conversation_profile_id: str):\n", "\n", " conversation = dialogflow.Conversation()\n", " conversation.conversation_profile = conversation_profile_id\n", "\n", " request = dialogflow.CreateConversationRequest(\n", " parent=project_id,\n", " conversation=conversation,\n", " )\n", " response = client.create_conversation(request=request)\n", " return response\n", "\n", "# Function to create a participant for a conversation (with a given conversation_id) with a specific role\n", "\n", "def create_participant(client: dialogflow.ParticipantsClient, conversation_id,\n", " role: dialogflow.Participant.Role):\n", "\n", " request = dialogflow.CreateParticipantRequest(\n", " parent=conversation_id,\n", " participant=dialogflow.Participant(role=role),\n", " )\n", " response = client.create_participant(request=request)\n", "\n", " return response\n", "\n", "# Function to suggest a conversation summary using the configured conversation profile.\n", "\n", "def suggest_conversation_summary(client: dialogflow.ConversationsClient,\n", " conversation_id: str):\n", "\n", " request = dialogflow.SuggestConversationSummaryRequest(\n", " conversation=conversation_id,)\n", " response = client.suggest_conversation_summary(request=request)\n", "\n", " return response\n", "\n", "# Function to complete a conversation with a given conversation id.\n", "\n", "def complete_conversation(client: dialogflow.ConversationsClient,\n", " conversation_id: str):\n", "\n", " request = dialogflow.CompleteConversationRequest(name=conversation_id,)\n", " response = client.complete_conversation(request)\n", "\n", " return response\n", "\n", "# Function to return a summary for a conversation using a specific conversation profile\n", "# using the earlier helper functions.\n", "\n", "def get_summary(\n", " conversations_client: dialogflow.ConversationsClient,\n", " participants_client: dialogflow.ParticipantsClient,\n", " project_id: str,\n", " conversation_profile_id: str,\n", " conversation,\n", "):\n", "\n", " create_conversation_response = create_conversation(\n", " client=conversations_client,\n", " project_id=project_id,\n", " conversation_profile_id=conversation_profile_id,\n", " )\n", " conversation_id = create_conversation_response.name\n", "\n", " create_end_user_participant_response = create_participant(\n", " client=participants_client,\n", " conversation_id=conversation_id,\n", " role=dialogflow.Participant.Role.END_USER,\n", " )\n", " end_user_participant_id = create_end_user_participant_response.name\n", "\n", " create_human_agent_participant_response = create_participant(\n", " client=participants_client,\n", " conversation_id=conversation_id,\n", " role=dialogflow.Participant.Role.HUMAN_AGENT,\n", " )\n", " human_agent_participant_id = create_human_agent_participant_response.name\n", "\n", " batch_request = dialogflow.BatchCreateMessagesRequest()\n", " batch_request.parent = conversation_id\n", " turn_count = 0\n", " for role, text in conversation:\n", " if turn_count > 199: # API was erroring out if the conv length is more than 200\n", " # Pushing first 200 messages into the conversation\n", " batch_response = conversations_client.batch_create_messages(request=batch_request)\n", "\n", " # re-initiatizing batch request to continue updating messages\n", " batch_request = dialogflow.BatchCreateMessagesRequest()\n", " batch_request.parent = conversation_id\n", "\n", " turn_count = 0\n", "\n", " participant_id = human_agent_participant_id if role == 'AGENT' else end_user_participant_id\n", "\n", " #Batch creating Conversation\n", " requests = dialogflow.CreateMessageRequest()\n", " requests.parent = conversation_id\n", " requests.message.content = text\n", " requests.message.participant = participant_id\n", " requests.message.send_time = datetime.datetime.now()\n", "\n", " batch_request.requests.append(requests)\n", " turn_count += 1\n", "\n", " batch_create_message_response = conversations_client.batch_create_messages(request=batch_request)\n", " suggest_conversation_summary_response = suggest_conversation_summary(\n", " client=conversations_client,\n", " conversation_id=conversation_id,\n", " )\n", "\n", " return suggest_conversation_summary_response" ] }, { "cell_type": "markdown", "metadata": { "id": "VHOXewKXs6KZ" }, "source": [ "Now call the Summarization API for transcript summarization to add the summary to the conversation strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UG8Qi6zicZr2", "tags": [] }, "outputs": [], "source": [ "conversations_client = dialogflow.ConversationsClient()\n", "participants_client = dialogflow.ParticipantsClient()\n", "results = []\n", "\n", "for conversation_id in eval_df['conversation_id'].unique():\n", "\n", " #print(f'Running inference for: {conversation_id}')\n", " \n", " conversation = []\n", " conversation_df = eval_df.loc[(eval_df['conversation_id'] == conversation_id)]\n", "\n", " for idx in conversation_df.index:\n", "\n", " conversation.append((conversation_df.loc[idx, 'role'], conversation_df.loc[idx, 'text']))\n", "\n", " get_summary_response = get_summary(\n", " conversations_client=conversations_client,\n", " participants_client=participants_client,\n", " project_id=project_path,\n", " conversation_profile_id=conversation_profile_id,\n", " conversation=conversation,\n", " )\n", "\n", " conversation_string = '\\n'.join(\n", " (f'{role}: {text}' for role, text in conversation))\n", " results.append({\n", " 'transcript_id': conversation_id,\n", " 'full_conversation': conversation_string,\n", " 'summary': get_summary_response.summary.text\n", " })\n", "\n", " if int(conversation_id) % 10 == 0:\n", " print(f'{int(conversation_id)} conversations have been summarized')" ] }, { "cell_type": "markdown", "metadata": { "id": "hlmNZLofrzeA" }, "source": [ "Now we can explore the output from the baseline summarization model for the conversation (`034`) that you looked at earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pprint\n", "\n", "summ_df = pd.DataFrame(results)\n", "mask = summ_df['transcript_id']=='034'\n", "pprint.pprint(summ_df[mask].iloc[0]['summary'])" ] } ], "metadata": { "colab": { "provenance": [] }, "environment": { "kernel": "conda-root-py", "name": "workbench-notebooks.m113", "type": "gcloud", "uri": "gcr.io/deeplearning-platform-release/workbench-notebooks:m113" }, "kernelspec": { "display_name": "Python 3 (ipykernel) (Local)", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }