sagemaker/08_distributed_summarization_bart

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Huggingface Sagemaker-sdk - Distributed Training Demo\n", "### Distributed Summarization with `transformers` scripts + `Trainer` and `samsum` dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Tutorial](#Tutorial) \n", "2. [Set up a development environment and install sagemaker](#Set-up-a-development-environment-and-install-sagemaker)\n", " 1. [Installation](#Installation) \n", " 2. [Development environment](#Development-environment) \n", " 3. [Permissions](#Permissions) \n", "4. [Choose 🤗 Transformers `examples/` script](#Choose-%F0%9F%A4%97-Transformers-examples/-script) \n", "1. [Configure distributed training and hyperparameters](#Configure-distributed-training-and-hyperparameters) \n", "2. [Create a `HuggingFace` estimator and start training](#Create-a-HuggingFace-estimator-and-start-training) \n", "3. [Upload the fine-tuned model to huggingface.co](#Upload-the-fine-tuned-model-to-huggingface.co)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial\n", "\n", "We will use the new [Hugging Face DLCs](https://github.com/aws/deep-learning-containers/tree/master/huggingface) and [Amazon SageMaker extension](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#huggingface-estimator) to train a distributed Seq2Seq-transformer model on `summarization` using the `transformers` and `datasets` libraries and upload it afterwards to [huggingface.co](http://huggingface.co) and test it.\n", "\n", "As [distributed training strategy](https://huggingface.co/transformers/sagemaker.html#distributed-training-data-parallel) we are going to use [SageMaker Data Parallelism](https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/), which has been built into the [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) API. To use data-parallelism we only have to define the `distribution` parameter in our `HuggingFace` estimator.\n", "\n", "```python\n", "# configuration for running training on smdistributed Data Parallel\n", "distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}\n", "```\n", "\n", "In this tutorial, we will use an Amazon SageMaker Notebook Instance for running our training job. You can learn [here how to set up a Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html).\n", "\n", "**What are we going to do:**\n", "\n", "- Set up a development environment and install sagemaker\n", "- Chose 🤗 Transformers `examples/` script\n", "- Configure distributed training and hyperparameters\n", "- Create a `HuggingFace` estimator and start training\n", "- Upload the fine-tuned model to [huggingface.co](http://huggingface.co)\n", "- Test inference\n", "\n", "### Model and Dataset\n", "\n", "We are going to fine-tune [facebook/bart-base](https://huggingface.co/facebook/bart-base) on the [samsum](https://huggingface.co/datasets/samsum) dataset. *\"BART is sequence-to-sequence model trained with denoising as pretraining objective.\"* [[REF](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md)]\n", "\n", "The `samsum` dataset contains about 16k messenger-like conversations with summaries. \n", "\n", "```python\n", "{'id': '13818513',\n", " 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',\n", " 'dialogue': \"Amanda: I baked cookies. Do you want some?\\r\\nJerry: Sure!\\r\\nAmanda: I'll bring you tomorrow :-)\"}\n", "```\n", "\n", "_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Set up a development environment and install sagemaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "_**Note:** The use of Jupyter is optional: We could also launch SageMaker Training jobs from anywhere we have an SDK installed, connectivity to the cloud and appropriate permissions, such as a Laptop, another IDE or a task scheduler like Airflow or AWS Step Functions._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"sagemaker>=2.48.0\" --upgrade\n", "#!apt install git-lfs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash\n", "!sudo yum install git-lfs -y\n", "!git lfs install" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Development environment " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sagemaker.huggingface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Permissions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "sess = sagemaker.Session()\n", "# sagemaker session bucket -> used for uploading data, models and logs\n", "# sagemaker will automatically create this bucket if it not exists\n", "sagemaker_session_bucket=None\n", "if sagemaker_session_bucket is None and sess is not None:\n", " # set to default bucket if a bucket name is not given\n", " sagemaker_session_bucket = sess.default_bucket()\n", "\n", "try:\n", " role = sagemaker.get_execution_role()\n", "except ValueError:\n", " iam = boto3.client('iam')\n", " role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n", "\n", "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", "\n", "print(f\"sagemaker role arn: {role}\")\n", "print(f\"sagemaker bucket: {sess.default_bucket()}\")\n", "print(f\"sagemaker session region: {sess.boto_region_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Choose 🤗 Transformers `examples/` script\n", "\n", "The [🤗 Transformers repository](https://github.com/huggingface/transformers/tree/master/examples) contains several `examples/`scripts for fine-tuning models on tasks from `language-modeling` to `token-classification`. In our case, we are using the `run_summarization.py` from the `seq2seq/` examples. \n", "\n", "_**Note**: you can use this tutorial identical to train your model on a different examples script._\n", "\n", "Since the `HuggingFace` Estimator has git support built-in, we can specify a [training script that is stored in a GitHub repository](https://sagemaker.readthedocs.io/en/stable/overview.html#use-scripts-stored-in-a-git-repository) as `entry_point` and `source_dir`.\n", "\n", "We are going to use the `transformers 4.4.2` DLC which means we need to configure the `v4.4.2` as the branch to pull the compatible example scripts." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'} # v4.6.1 is referring to the `transformers_version` you use in the estimator." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure distributed training and hyperparameters\n", "\n", "Next, we will define our `hyperparameters` and configure our distributed training strategy. As hyperparameter, we can define any [Seq2SeqTrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#seq2seqtrainingarguments) and the ones defined in [run_summarization.py](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#sequence-to-sequence-training-and-evaluation). " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# hyperparameters, which are passed into the training job\n", "hyperparameters={'per_device_train_batch_size': 4,\n", " 'per_device_eval_batch_size': 4,\n", " 'model_name_or_path': 'facebook/bart-large-cnn',\n", " 'dataset_name': 'samsum',\n", " 'do_train': True,\n", " 'do_eval': True,\n", " 'do_predict': True,\n", " 'predict_with_generate': True,\n", " 'output_dir': '/opt/ml/model',\n", " 'num_train_epochs': 3,\n", " 'learning_rate': 5e-5,\n", " 'seed': 7,\n", " 'fp16': True,\n", " }\n", "\n", "# configuration for running training on smdistributed Data Parallel\n", "distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create a `HuggingFace` estimator and start training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.huggingface import HuggingFace\n", "\n", "# create the Estimator\n", "huggingface_estimator = HuggingFace(\n", " entry_point='run_summarization.py', # script\n", " source_dir='./examples/pytorch/summarization', # relative path to example\n", " git_config=git_config,\n", " instance_type='ml.p3dn.24xlarge',\n", " instance_count=2,\n", " transformers_version='4.26.0',\n", " pytorch_version='1.13.1',\n", " py_version='py39',\n", " role=role,\n", " hyperparameters = hyperparameters,\n", " distribution = distribution\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# starting the train job\n", "huggingface_estimator.fit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploying the endpoint\n", "\n", "To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = huggingface_estimator.deploy(1,\"ml.g4dn.xlarge\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we use the returned predictor object to call the endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n", " Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n", " Jeff: ok.\n", " Jeff: and how can I get started? \n", " Jeff: where can I find documentation? \n", " Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n", " '''\n", "\n", "data= {\"inputs\":conversation}\n", "\n", "predictor.predict(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we delete the endpoint again." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Upload the fine-tuned model to [huggingface.co](http://huggingface.co)\n", "\n", "We can download our model from Amazon S3 and unzip it using the following snippet.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tarfile\n", "from sagemaker.s3 import S3Downloader\n", "\n", "local_path = 'my_bart_model'\n", "\n", "os.makedirs(local_path, exist_ok = True)\n", "\n", "# download model from S3\n", "S3Downloader.download(\n", " s3_uri=huggingface_estimator.model_data, # s3 uri where the trained model is located\n", " local_path=local_path, # local path where *.targ.gz is saved\n", " sagemaker_session=sess # sagemaker session used for training the model\n", ")\n", "\n", "# unzip model\n", "tar = tarfile.open(f\"{local_path}/model.tar.gz\", \"r:gz\")\n", "tar.extractall(path=local_path)\n", "tar.close()\n", "os.remove(f\"{local_path}/model.tar.gz\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we are going to upload our model to [huggingface.co](http://huggingface.co) we need to create a `model_card`. The `model_card` describes the model includes hyperparameters, results and which dataset was used for training. To create a `model_card` we create a `README.md` in our `local_path`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read eval and test results \n", "with open(f\"{local_path}/eval_results.json\") as f:\n", " eval_results_raw = json.load(f)\n", " eval_results={}\n", " eval_results[\"eval_rouge1\"] = eval_results_raw[\"eval_rouge1\"]\n", " eval_results[\"eval_rouge2\"] = eval_results_raw[\"eval_rouge2\"]\n", " eval_results[\"eval_rougeL\"] = eval_results_raw[\"eval_rougeL\"]\n", " eval_results[\"eval_rougeLsum\"] = eval_results_raw[\"eval_rougeLsum\"]\n", "\n", "with open(f\"{local_path}/test_results.json\") as f:\n", " test_results_raw = json.load(f)\n", " test_results={}\n", " test_results[\"test_rouge1\"] = test_results_raw[\"test_rouge1\"]\n", " test_results[\"test_rouge2\"] = test_results_raw[\"test_rouge2\"]\n", " test_results[\"test_rougeL\"] = test_results_raw[\"test_rougeL\"]\n", " test_results[\"test_rougeLsum\"] = test_results_raw[\"test_rougeLsum\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we extract all the metrics we want to include we are going to create our `README.md`. Additionally to the automated generation of the results table we add the metrics manually to the `metadata` of our model card under `model-index`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(eval_results)\n", "print(test_results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "\n", "MODEL_CARD_TEMPLATE = \"\"\"\n", "---\n", "language: en\n", "tags:\n", "- sagemaker\n", "- bart\n", "- summarization\n", "license: apache-2.0\n", "datasets:\n", "- samsum\n", "model-index:\n", "- name: {model_name}\n", " results:\n", " - task: \n", " name: Abstractive Text Summarization\n", " type: abstractive-text-summarization\n", " dataset:\n", " name: \"SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization\" \n", " type: samsum\n", " metrics:\n", " - name: Validation ROGUE-1\n", " type: rogue-1\n", " value: 42.621\n", " - name: Validation ROGUE-2\n", " type: rogue-2\n", " value: 21.9825\n", " - name: Validation ROGUE-L\n", " type: rogue-l\n", " value: 33.034\n", " - name: Test ROGUE-1\n", " type: rogue-1\n", " value: 41.3174\n", " - name: Test ROGUE-2\n", " type: rogue-2\n", " value: 20.8716\n", " - name: Test ROGUE-L\n", " type: rogue-l\n", " value: 32.1337\n", "widget:\n", "- text: | \n", " Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n", " Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n", " Jeff: ok.\n", " Jeff: and how can I get started? \n", " Jeff: where can I find documentation? \n", " Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n", "---\n", "## `{model_name}`\n", "This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container.\n", "For more information look at:\n", "- [🤗 Transformers Documentation: Amazon SageMaker](https://huggingface.co/transformers/sagemaker.html)\n", "- [Example Notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)\n", "- [Amazon SageMaker documentation for Hugging Face](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)\n", "- [Python SDK SageMaker documentation for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)\n", "- [Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers)\n", "## Hyperparameters\n", " {hyperparameters}\n", "## Usage\n", " from transformers import pipeline\n", " summarizer = pipeline(\"summarization\", model=\"philschmid/{model_name}\")\n", " conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n", " Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n", " Jeff: ok.\n", " Jeff: and how can I get started? \n", " Jeff: where can I find documentation? \n", " Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n", " '''\n", " summarizer(conversation)\n", "## Results\n", "| key | value |\n", "| --- | ----- |\n", "{eval_table}\n", "{test_table}\n", "\"\"\"\n", "\n", "# Generate model card (todo: add more data from Trainer)\n", "model_card = MODEL_CARD_TEMPLATE.format(\n", " model_name=f\"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}\",\n", " hyperparameters=json.dumps(hyperparameters, indent=4, sort_keys=True),\n", " eval_table=\"\\n\".join(f\"| {k} | {v} |\" for k, v in eval_results.items()),\n", " test_table=\"\\n\".join(f\"| {k} | {v} |\" for k, v in test_results.items()),\n", ")\n", "with open(f\"{local_path}/README.md\", \"w\") as f:\n", " f.write(model_card)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we have our unzipped model and model card located in `my_bart_model` we can use the either `huggingface_hub` SDK to create a repository and upload it to [huggingface.co](http://huggingface.co) or go to https://huggingface.co/new an create a new repository and upload it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from getpass import getpass\n", "from huggingface_hub import HfApi, Repository\n", "\n", "hf_username = \"philschmid\" # your username on huggingface.co\n", "hf_email = \"philipp@huggingface.co\" # email used for commit\n", "repository_name = f\"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}\" # repository name on huggingface.co\n", "password = getpass(\"Enter your password:\") # creates a prompt for entering password\n", "\n", "# get hf token\n", "token = HfApi().login(username=hf_username, password=password)\n", "\n", "# create repository\n", "repo_url = HfApi().create_repo(token=token, name=repository_name, exist_ok=True)\n", "\n", "# create a Repository instance\n", "model_repo = Repository(use_auth_token=token,\n", " clone_from=repo_url,\n", " local_dir=local_path,\n", " git_user=hf_username,\n", " git_email=hf_email)\n", "\n", "# push model to the hub\n", "model_repo.push_to_hub()\n", "\n", "print(f\"https://huggingface.co/{hf_username}/{repository_name}\")" ] } ], "metadata": { "instance_type": "ml.t3.medium", "interpreter": { "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9" }, "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 4 }

sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb (570 lines of code) (raw):