data-labelling/labeling_adjustment_job

{ "cells": [ { "cell_type": "markdown", "id": "a56a7e16", "metadata": {}, "source": [ "# Labeling Adjustment Job Adaptation\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "a56a7e16", "metadata": {}, "source": [ "\n", "## Labeling Adjustment Jobs\n", "\n", "This notebook is focusing on creation of Labeling Adjustment Jobs in SageMaker Ground Truth.\n", "\n", "More details about the usage of label adjustment jobs as well as their creation can be found in official documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-verification-data.html\n", "\n", "## Customer use case description\n", "\n", "The example provided here is given for the bounding box labeling job with multiple object detection on image data. \n", "\n", "Once your customer has originally labeled their dataset for object detection and trained their first models it is possible that the business requirements and priorities might change. Therefore, individual original objects which we did want to detect originally with our model might become irrelevant for further detection and should be removed, while we would want to add additional labels to be detected in our dataset.\n", "\n", "This will require the original dataset to be re-labeled with the labeling adjustment job displaying the already existing labels which we want to keep while removing the labels which are not anymore in target scope. The current SageMaker Ground Truth UI enables us to remove unwanted labels from the labeling team workforce UI before launching labeling adjustment job, which will also remove the labels visually from each individual image displayed to the labeling team.\n", "\n", "However, jobs launched in this way will fail on every example image during consolidation stage where the labels have not been adjusted by the labeling team. To avoid this issue, we need to process the existing output manifest file and remove all the unwanted labels from the manifest file directly before launching labeling adjustment job.\n", "\n", "The script provided in this notebook accepts as input a set of labels to remove from the output manifest file, and the name of the labeling job containing the output manifest file to adjust. It will generate the cleaned output manifest file with only target labels removed from the latest labeling job that can be used to safely launch label adjustment job." ] }, { "cell_type": "markdown", "id": "13912066", "metadata": {}, "source": [ "### Function code" ] }, { "cell_type": "code", "execution_count": 3, "id": "d38f5df4", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import os\n", "import botocore\n", "import json\n", "\n", "sagemaker_client = boto3.client(\"sagemaker\")\n", "s3_client = boto3.client(\"s3\")\n", "\n", "##### Helper function for communication with aws services (sagemaker and s3)\n", "def get_labeling_job_output_manifest_file_location(\n", " labeling_job_name: str, sagemaker_client: botocore.client\n", ") -> str:\n", " \"\"\"\n", " # ref: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_labeling_job\n", " \"\"\"\n", " s3_output_location = sagemaker_client.describe_labeling_job(LabelingJobName=labeling_job_name)[\n", " \"OutputConfig\"\n", " ][\"S3OutputPath\"]\n", " manifest_file_relative_path_from_output_location = \"{}/manifests/output/output.manifest\".format(\n", " labeling_job_name\n", " )\n", " output_manifest_absolute_path = os.path.join(\n", " s3_output_location, manifest_file_relative_path_from_output_location\n", " )\n", "\n", " return output_manifest_absolute_path\n", "\n", "\n", "def get_labeling_job_attribute_name(\n", " labeling_job_name: str, sagemaker_client: botocore.client\n", ") -> str:\n", " \"\"\"\n", " # ref: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_labeling_job\n", " \"\"\"\n", " labeling_job_attribute_name = sagemaker_client.describe_labeling_job(\n", " LabelingJobName=labeling_job_name\n", " )[\"LabelAttributeName\"]\n", " return labeling_job_attribute_name\n", "\n", "\n", "def split_bucket_key_from_s3_path(s3_full_path: str) -> (str, str):\n", " \"\"\"\n", " full s3 path in format: s3://BUCKET/KEY\n", " \"\"\"\n", " split_location = s3_full_path[5:].find(\"/\") + 5\n", " return s3_full_path[5:split_location], s3_full_path[split_location + 1 :]\n", "\n", "\n", "def read_s3_file(file_path: str, s3_client: botocore.client):\n", " bucket_name, key = split_bucket_key_from_s3_path(file_path)\n", " response = s3_client.get_object(Bucket=bucket_name, Key=key)[\"Body\"].read()\n", " return response\n", "\n", "\n", "def save_file_to_s3(file_path: str, object_to_save, s3_client: botocore.client):\n", " bucket_name, key = split_bucket_key_from_s3_path(file_path)\n", " s3_client.put_object(Body=object_to_save, Bucket=bucket_name, Key=key)\n", "\n", "\n", "#### Helper functions to process the output.manifest file and cleanup unnecessary labels\n", "def get_class_ids_for_removable_labels(label_annotations_metadata, labels_to_remove):\n", " class_ids_for_removable_labels = []\n", " for label in labels_to_remove:\n", " for key, value in label_annotations_metadata[\"class-map\"].items():\n", " if value == label:\n", " class_ids_for_removable_labels.append(key)\n", " del label_annotations_metadata[\"class-map\"][key]\n", " break\n", " return label_annotations_metadata, class_ids_for_removable_labels\n", "\n", "\n", "def clean_up_annotations(label_annotations, class_ids_to_remove):\n", " removed_annotation_positions = []\n", " new_annotations_list = []\n", " for i in range(len(label_annotations[\"annotations\"])):\n", " if str(label_annotations[\"annotations\"][i][\"class_id\"]) in class_ids_to_remove:\n", " removed_annotation_positions.append(i)\n", " else:\n", " new_annotations_list.append(label_annotations[\"annotations\"][i])\n", " label_annotations[\"annotations\"] = new_annotations_list\n", " return label_annotations, removed_annotation_positions\n", "\n", "\n", "def clean_up_metadata(label_annotations_metadata, removed_marked_labels_positions):\n", " for i in range(len(removed_marked_labels_positions)):\n", " del label_annotations_metadata[\"objects\"][removed_marked_labels_positions[i] - i]\n", "\n", " label_annotations_metadata[\"adjustment-status\"] = \"adjusted\"\n", " return label_annotations_metadata\n", "\n", "\n", "#### Main function to remove all the unnecessary labels from manifest file\n", "def remove_labels_from_output_manifest_file(\n", " remove_labels: list, marked_labels: list, labeling_job_attribute_name: str\n", "):\n", " \"\"\"\n", " remove_labels (list[str]): list of labels we want to remove from output.manifest file\n", " marked_labels (list[marked_labels_per_document]): content of output.manifest file marked labels per document\n", " format of marked_labels_per_document:\n", " 'all_keys': ['source-ref', 'category', 'category-metadata','chain-job-name','chain-job-name-metadata']\n", " 'category' (chain-job-name): ['image_size', 'annotations']\n", " 'category-metadata' (chain-job-name-metadata): ['objects', 'class-map', 'type', 'human-annotated', 'creation-date', 'job-name', 'adjustment-status']\n", " labeling_job_attribute_name (str): name of the labeling job attribute to find adequate annotations and annotations_meta data to be adjusted\n", " \"\"\"\n", " nmb_keys_previous = len(list(marked_labels[0].keys()))\n", " total_nmb_of_removed_marked_labels = 0\n", "\n", " for label in marked_labels:\n", " nmb_keys = len(list(label.keys()))\n", " if nmb_keys_previous != nmb_keys:\n", " assert \"Label does not have same amount of keys as others! This is unexpected behaviour since each should have same amount of jobs run...\"\n", "\n", " latest_annotations_name = labeling_job_attribute_name\n", " latest_annotations_metadata_name = \"{}-metadata\".format(labeling_job_attribute_name)\n", "\n", " (\n", " label[latest_annotations_metadata_name],\n", " class_ids_to_remove,\n", " ) = get_class_ids_for_removable_labels(\n", " label[latest_annotations_metadata_name], remove_labels\n", " )\n", "\n", " # every labeling job class-map should have one label mentioned only once, but not every class needs to be present\n", " assert len(class_ids_to_remove) <= len(remove_labels)\n", "\n", " label[latest_annotations_name], removed_marked_labels_positions = clean_up_annotations(\n", " label[latest_annotations_name], class_ids_to_remove\n", " )\n", " label[latest_annotations_metadata_name] = clean_up_metadata(\n", " label[latest_annotations_metadata_name], removed_marked_labels_positions\n", " )\n", " total_nmb_of_removed_marked_labels += len(removed_marked_labels_positions)\n", "\n", " # this will log for you the total number of labels that have been removed from your manifest file\n", " # you can use it to check the expectations depending on how many labels of the target type to be removed,\n", " # was expected in input manifest file\n", " print(\"In total we have removed {} marked labels.\".format(total_nmb_of_removed_marked_labels))\n", " return marked_labels\n", "\n", "\n", "def main_function(\n", " labeling_job_name, remove_labels, path_to_save_results_to, sagemaker_client, s3_client\n", "):\n", " output_file_path = get_labeling_job_output_manifest_file_location(\n", " labeling_job_name, sagemaker_client\n", " )\n", " output_file_content = read_s3_file(output_file_path, s3_client)\n", "\n", " labels = []\n", " for line in output_file_content.splitlines():\n", " labels.append(json.loads(line))\n", "\n", " cleaned_labels = remove_labels_from_output_manifest_file(\n", " remove_labels, labels, get_labeling_job_attribute_name(labeling_job_name, sagemaker_client)\n", " )\n", " # you can uncomment this to generate a smaller output file for testing\n", " # cleaned_labels = cleaned_labels[:15]\n", "\n", " # function to save back all the marked labels to cleaned up manifest file\n", " output_manifest_cleaned_content = \"\"\n", " for clean_label in cleaned_labels:\n", " output_manifest_cleaned_content = (\n", " output_manifest_cleaned_content + json.dumps(clean_label) + \"\\n\"\n", " )\n", "\n", " save_file_to_s3(path_to_save_results_to, output_manifest_cleaned_content, s3_client)" ] }, { "cell_type": "markdown", "id": "098d85e8", "metadata": {}, "source": [ "### Parameter setup and script execution" ] }, { "cell_type": "code", "execution_count": 4, "id": "ea0a5772", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In total we have removed 4 marked labels.\n" ] } ], "source": [ "#### program execution\n", "\n", "# these are input parameters to adjust\n", "labeling_job_name = \"<name_of_the_labeling_job_you_want_to_run_label_adjustment_job_for>\"\n", "remove_labels = [\"<label_1>\", \"<label_2>\", \"<label_3>\", \"<label_4>\"]\n", "\n", "path_to_save_results_to = (\n", " \"s3://<bucket_dst>/<path_you_want_your_cleaned_output_manifest_file_saved_to>/output.manifest\"\n", ")\n", "\n", "main_function(\n", " labeling_job_name, remove_labels, path_to_save_results_to, sagemaker_client, s3_client\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/ground_truth_labeling_jobs|labeling_adjustment_job_adaptation|labeling_adjustment_job_adaptation.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }

data-labelling/labeling_adjustment_job_adaptation.ipynb (333 lines of code) (raw):