# End to End Baseline Summarisation

In this notebook you will use the configured conversation profile from earlier in the lab to perform summarization of chat transcripts with redacted PII. You will need the integration ID of your conversation profile created earlier to complete this lab.


# Installing required libraries and Authenticating GCP Credentials

In [None]:
! pip install -q google-cloud-storage google-cloud-dlp google-cloud-dialogflow

__IMPORTANT:__ Restart the kernel for the notebook by going to __Kernel__ and __Restart Kernel__ before moving forward. You do not need to run the first cell again after completing the package installation

## Configure Google Cloud credentials

__Note:__ Replace `project-name` with your Project ID. You will need to uncomment the commented lines first if you are running this notebook in a Google Colab environment.

In [None]:
PROJECT_NAME='project-name' 

!gcloud config set project $PROJECT_NAME

## Import required libraries

In [None]:
from typing import Dict, List
import csv
import glob
import json
import time
import re
import json
import pandas as pd
import pickle
from google.cloud import storage
import google.cloud.dlp
from google.cloud import dialogflow_v2beta1 as dialogflow
import datetime

Replace the value of the`CONV_PROFILE_ID` variable with the integration ID you recorded earlier.


In [None]:
CONV_PROFILE_ID = "projects/project-name/locations/global/conversationProfiles/profile-id"
GCS_BUCKET_URI = "gs://summarization_integration_test_data" 
GCS_BUCKET_NAME = GCS_BUCKET_URI.split("//")[1]
TRANSCRIPTS_INPUT_FOLDER_PREFIX = "data" 
SUPPORTED_FILE_FORMATS = ["json"]

project_id = PROJECT_NAME
location = "global"
project_path = '/'.join(CONV_PROFILE_ID.split('/')[:4])
conversation_profile_id = CONV_PROFILE_ID

# Step 1: Run PII redaction on chat transcripts

## Utility Functions

Before summarizing transcripts, you will redact possibly sensitive information found in the transcripts. This will lower the risk of accidental data leakage.

**Note**: `INFO_TYPES` should be fine-tuned to fit customer's requirements. The existing `INFO_TYPES` in the cell below is the default setting but is subject to developer's discretion. To fine-tune `INFO_TYPES`, please refer to https://cloud.google.com/dlp/docs/infotypes-reference

First, instaniate a client to interact with the Data Loss Prevention (DLP) API and a function (`redact_dlp`) to redact sensitive information

In [None]:
dlp = google.cloud.dlp_v2.DlpServiceClient()
INFO_TYPES = ["AGE","CREDIT_CARD_NUMBER","CREDIT_CARD_TRACK_NUMBER","DATE","DATE_OF_BIRTH",
           "DOMAIN_NAME","EMAIL_ADDRESS","FEMALE_NAME","MALE_NAME","FIRST_NAME","GENDER",
           "GENERIC_ID","IP_ADDRESS","LAST_NAME","LOCATION","PERSON_NAME","PHONE_NUMBER",
           "STREET_ADDRESS"]

def redact_dlp(input_str,replacement_str=r"[redacted]"):

    inspect_config = {"info_types": [{"name": info_type} for info_type in INFO_TYPES]}
    deidentify_config = {
        "info_type_transformations": {
            "transformations": [
                {
                    "primitive_transformation": {
                        "replace_config": {
                            "new_value": {"string_value": replacement_str}
                        }
                    }
                }
            ]
        }
    }
    item = {"value": input_str}
    response = dlp.deidentify_content(
        request={
            "parent" :"projects/{}".format(PROJECT_NAME),
            "deidentify_config": deidentify_config,
            "inspect_config": inspect_config,
            "item": item,
        }
    )

    return str(response.item.value).strip()

Before defining a function to apply the DLP API, you define a function to parse the chat transcripts. The code following the definition of the `parse_chat_transcripts` imports the transcripts into a Pandas dataframe to make it easier to parse and apply the DLP API to the appropriate field of the transcripts. It will take a couple of minutes to parse the 150 transcripts in the Cloud Storage location being used for this notebook.

In [None]:
storage_client = storage.Client()

def parse_chat_transcripts(file_name, chat_transcript):
  result_list = []
  conversation_entries_list = chat_transcript['entries']
  for index, conversation_entry in enumerate(conversation_entries_list):
    result_dict = {}
    result_dict['conversation_id']=file_name
    result_dict['turn_id'] = index
    result_dict['role'] = conversation_entry['role']
    result_dict['text'] = redact_dlp(conversation_entry['text'])
    result_list.append(result_dict)
  return result_list

INPUT_TRANSCRIPT_FILES_GCS_PATHS = storage_client.list_blobs(GCS_BUCKET_NAME, prefix= TRANSCRIPTS_INPUT_FOLDER_PREFIX)
index = 1
all_transcripts = []
_bucket = storage_client.get_bucket(GCS_BUCKET_NAME)

for chat_file_name in INPUT_TRANSCRIPT_FILES_GCS_PATHS:
  if (str(chat_file_name.name).split("/")[1] != '') and (str(chat_file_name.name).split("/")[1].split(".")[-1] in SUPPORTED_FILE_FORMATS):
    try:
      _blob = _bucket.blob(chat_file_name.name)
      with _blob.open(mode='r') as f:
        chat = json.load(f)
      temp = parse_chat_transcripts(str(chat_file_name.name).split("/")[1].split(".")[0], chat)
      all_transcripts.extend(temp)
      if index % 10 == 0:
       print(f"Conversations Processed :: {str(index)}")
      index += 1

    except Exception as e:
      #print("Exception Occurred for Chat: {} \n {}".format(chat_file_name.name, e))
      continue

Before applying the baseline summarization model, you should explore the preprocessed and redacted output from one of the conversations. Here you will convert the `all_transcripts` into a Pandas dataframe and then look at one of the conversations. Note the portions of the conversation that were redacted by the DLP API.

In [None]:
eval_df = pd.DataFrame(all_transcripts)
mask = eval_df['conversation_id']=='034' #Update to view other conversations
eval_df[mask]

# Step 2: Generate summaries from Baseline Summarization Model

In this step you will generate summaries for the redacted transcripts from the previous steps after defining a sequence of helper functions to work through the appropriate steps. The comments in the code give a rough description of each of the helper functions being created.

In [None]:
# Function to create a conversation for a given conservation profile

def create_conversation(client: dialogflow.ConversationsClient, project_id: str,
                        conversation_profile_id: str):

  conversation = dialogflow.Conversation()
  conversation.conversation_profile = conversation_profile_id

  request = dialogflow.CreateConversationRequest(
      parent=project_id,
      conversation=conversation,
  )
  response = client.create_conversation(request=request)
  return response

# Function to create a participant for a conversation (with a given conversation_id) with a specific role

def create_participant(client: dialogflow.ParticipantsClient, conversation_id,
                       role: dialogflow.Participant.Role):

  request = dialogflow.CreateParticipantRequest(
      parent=conversation_id,
      participant=dialogflow.Participant(role=role),
  )
  response = client.create_participant(request=request)

  return response

# Function to suggest a conversation summary using the configured conversation profile.

def suggest_conversation_summary(client: dialogflow.ConversationsClient,
                                 conversation_id: str):

  request = dialogflow.SuggestConversationSummaryRequest(
      conversation=conversation_id,)
  response = client.suggest_conversation_summary(request=request)

  return response

# Function to complete a conversation with a given conversation id.

def complete_conversation(client: dialogflow.ConversationsClient,
                          conversation_id: str):

  request = dialogflow.CompleteConversationRequest(name=conversation_id,)
  response = client.complete_conversation(request)

  return response

# Function to return a summary for a conversation using a specific conversation profile
# using the earlier helper functions.

def get_summary(
    conversations_client: dialogflow.ConversationsClient,
    participants_client: dialogflow.ParticipantsClient,
    project_id: str,
    conversation_profile_id: str,
    conversation,
):

  create_conversation_response = create_conversation(
      client=conversations_client,
      project_id=project_id,
      conversation_profile_id=conversation_profile_id,
  )
  conversation_id = create_conversation_response.name

  create_end_user_participant_response = create_participant(
      client=participants_client,
      conversation_id=conversation_id,
      role=dialogflow.Participant.Role.END_USER,
  )
  end_user_participant_id = create_end_user_participant_response.name

  create_human_agent_participant_response = create_participant(
      client=participants_client,
      conversation_id=conversation_id,
      role=dialogflow.Participant.Role.HUMAN_AGENT,
  )
  human_agent_participant_id = create_human_agent_participant_response.name

  batch_request = dialogflow.BatchCreateMessagesRequest()
  batch_request.parent = conversation_id
  turn_count = 0
  for role, text in conversation:
    if turn_count > 199: # API was erroring out if the conv length is more than 200
      # Pushing first 200 messages into the conversation
      batch_response = conversations_client.batch_create_messages(request=batch_request)

      # re-initiatizing batch request to continue updating messages
      batch_request = dialogflow.BatchCreateMessagesRequest()
      batch_request.parent = conversation_id

      turn_count = 0

    participant_id = human_agent_participant_id if role == 'AGENT' else end_user_participant_id

    #Batch creating Conversation
    requests = dialogflow.CreateMessageRequest()
    requests.parent = conversation_id
    requests.message.content = text
    requests.message.participant = participant_id
    requests.message.send_time = datetime.datetime.now()

    batch_request.requests.append(requests)
    turn_count += 1

  batch_create_message_response = conversations_client.batch_create_messages(request=batch_request)
  suggest_conversation_summary_response = suggest_conversation_summary(
      client=conversations_client,
      conversation_id=conversation_id,
  )

  return suggest_conversation_summary_response

Now call the Summarization API for transcript summarization to add the summary to the conversation strings.

In [None]:
conversations_client = dialogflow.ConversationsClient()
participants_client = dialogflow.ParticipantsClient()
results = []

for conversation_id in eval_df['conversation_id'].unique():

  #print(f'Running inference for: {conversation_id}')
  
  conversation = []
  conversation_df = eval_df.loc[(eval_df['conversation_id'] == conversation_id)]

  for idx in conversation_df.index:

    conversation.append((conversation_df.loc[idx, 'role'], conversation_df.loc[idx, 'text']))

  get_summary_response = get_summary(
      conversations_client=conversations_client,
      participants_client=participants_client,
      project_id=project_path,
      conversation_profile_id=conversation_profile_id,
      conversation=conversation,
  )

  conversation_string = '\n'.join(
      (f'{role}: {text}' for role, text in conversation))
  results.append({
      'transcript_id': conversation_id,
      'full_conversation': conversation_string,
      'summary': get_summary_response.summary.text
  })

  if int(conversation_id) % 10 == 0:
    print(f'{int(conversation_id)} conversations have been summarized')

Now we can explore the output from the baseline summarization model for the conversation (`034`) that you looked at earlier.

In [None]:
import pprint

summ_df = pd.DataFrame(results)
mask = summ_df['transcript_id']=='034'
pprint.pprint(summ_df[mask].iloc[0]['summary'])