sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb (570 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Huggingface Sagemaker-sdk - Distributed Training Demo\n",
"### Distributed Summarization with `transformers` scripts + `Trainer` and `samsum` dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. [Tutorial](#Tutorial) \n",
"2. [Set up a development environment and install sagemaker](#Set-up-a-development-environment-and-install-sagemaker)\n",
" 1. [Installation](#Installation) \n",
" 2. [Development environment](#Development-environment) \n",
" 3. [Permissions](#Permissions) \n",
"4. [Choose 🤗 Transformers `examples/` script](#Choose-%F0%9F%A4%97-Transformers-examples/-script) \n",
"1. [Configure distributed training and hyperparameters](#Configure-distributed-training-and-hyperparameters) \n",
"2. [Create a `HuggingFace` estimator and start training](#Create-a-HuggingFace-estimator-and-start-training) \n",
"3. [Upload the fine-tuned model to huggingface.co](#Upload-the-fine-tuned-model-to-huggingface.co)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial\n",
"\n",
"We will use the new [Hugging Face DLCs](https://github.com/aws/deep-learning-containers/tree/master/huggingface) and [Amazon SageMaker extension](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#huggingface-estimator) to train a distributed Seq2Seq-transformer model on `summarization` using the `transformers` and `datasets` libraries and upload it afterwards to [huggingface.co](http://huggingface.co) and test it.\n",
"\n",
"As [distributed training strategy](https://huggingface.co/transformers/sagemaker.html#distributed-training-data-parallel) we are going to use [SageMaker Data Parallelism](https://aws.amazon.com/blogs/aws/managed-data-parallelism-in-amazon-sagemaker-simplifies-training-on-large-datasets/), which has been built into the [Trainer](https://huggingface.co/transformers/main_classes/trainer.html) API. To use data-parallelism we only have to define the `distribution` parameter in our `HuggingFace` estimator.\n",
"\n",
"```python\n",
"# configuration for running training on smdistributed Data Parallel\n",
"distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}\n",
"```\n",
"\n",
"In this tutorial, we will use an Amazon SageMaker Notebook Instance for running our training job. You can learn [here how to set up a Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html).\n",
"\n",
"**What are we going to do:**\n",
"\n",
"- Set up a development environment and install sagemaker\n",
"- Chose 🤗 Transformers `examples/` script\n",
"- Configure distributed training and hyperparameters\n",
"- Create a `HuggingFace` estimator and start training\n",
"- Upload the fine-tuned model to [huggingface.co](http://huggingface.co)\n",
"- Test inference\n",
"\n",
"### Model and Dataset\n",
"\n",
"We are going to fine-tune [facebook/bart-base](https://huggingface.co/facebook/bart-base) on the [samsum](https://huggingface.co/datasets/samsum) dataset. *\"BART is sequence-to-sequence model trained with denoising as pretraining objective.\"* [[REF](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md)]\n",
"\n",
"The `samsum` dataset contains about 16k messenger-like conversations with summaries. \n",
"\n",
"```python\n",
"{'id': '13818513',\n",
" 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.',\n",
" 'dialogue': \"Amanda: I baked cookies. Do you want some?\\r\\nJerry: Sure!\\r\\nAmanda: I'll bring you tomorrow :-)\"}\n",
"```\n",
"\n",
"_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Set up a development environment and install sagemaker"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"_**Note:** The use of Jupyter is optional: We could also launch SageMaker Training jobs from anywhere we have an SDK installed, connectivity to the cloud and appropriate permissions, such as a Laptop, another IDE or a task scheduler like Airflow or AWS Step Functions._"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install \"sagemaker>=2.48.0\" --upgrade\n",
"#!apt install git-lfs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash\n",
"!sudo yum install git-lfs -y\n",
"!git lfs install"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Development environment "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker.huggingface"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Permissions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it._"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker\n",
"import boto3\n",
"sess = sagemaker.Session()\n",
"# sagemaker session bucket -> used for uploading data, models and logs\n",
"# sagemaker will automatically create this bucket if it not exists\n",
"sagemaker_session_bucket=None\n",
"if sagemaker_session_bucket is None and sess is not None:\n",
" # set to default bucket if a bucket name is not given\n",
" sagemaker_session_bucket = sess.default_bucket()\n",
"\n",
"try:\n",
" role = sagemaker.get_execution_role()\n",
"except ValueError:\n",
" iam = boto3.client('iam')\n",
" role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']\n",
"\n",
"sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n",
"\n",
"print(f\"sagemaker role arn: {role}\")\n",
"print(f\"sagemaker bucket: {sess.default_bucket()}\")\n",
"print(f\"sagemaker session region: {sess.boto_region_name}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Choose 🤗 Transformers `examples/` script\n",
"\n",
"The [🤗 Transformers repository](https://github.com/huggingface/transformers/tree/master/examples) contains several `examples/`scripts for fine-tuning models on tasks from `language-modeling` to `token-classification`. In our case, we are using the `run_summarization.py` from the `seq2seq/` examples. \n",
"\n",
"_**Note**: you can use this tutorial identical to train your model on a different examples script._\n",
"\n",
"Since the `HuggingFace` Estimator has git support built-in, we can specify a [training script that is stored in a GitHub repository](https://sagemaker.readthedocs.io/en/stable/overview.html#use-scripts-stored-in-a-git-repository) as `entry_point` and `source_dir`.\n",
"\n",
"We are going to use the `transformers 4.4.2` DLC which means we need to configure the `v4.4.2` as the branch to pull the compatible example scripts."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.26.0'} # v4.6.1 is referring to the `transformers_version` you use in the estimator."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure distributed training and hyperparameters\n",
"\n",
"Next, we will define our `hyperparameters` and configure our distributed training strategy. As hyperparameter, we can define any [Seq2SeqTrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#seq2seqtrainingarguments) and the ones defined in [run_summarization.py](https://github.com/huggingface/transformers/tree/master/examples/seq2seq#sequence-to-sequence-training-and-evaluation). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# hyperparameters, which are passed into the training job\n",
"hyperparameters={'per_device_train_batch_size': 4,\n",
" 'per_device_eval_batch_size': 4,\n",
" 'model_name_or_path': 'facebook/bart-large-cnn',\n",
" 'dataset_name': 'samsum',\n",
" 'do_train': True,\n",
" 'do_eval': True,\n",
" 'do_predict': True,\n",
" 'predict_with_generate': True,\n",
" 'output_dir': '/opt/ml/model',\n",
" 'num_train_epochs': 3,\n",
" 'learning_rate': 5e-5,\n",
" 'seed': 7,\n",
" 'fp16': True,\n",
" }\n",
"\n",
"# configuration for running training on smdistributed Data Parallel\n",
"distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a `HuggingFace` estimator and start training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.huggingface import HuggingFace\n",
"\n",
"# create the Estimator\n",
"huggingface_estimator = HuggingFace(\n",
" entry_point='run_summarization.py', # script\n",
" source_dir='./examples/pytorch/summarization', # relative path to example\n",
" git_config=git_config,\n",
" instance_type='ml.p3dn.24xlarge',\n",
" instance_count=2,\n",
" transformers_version='4.26.0',\n",
" pytorch_version='1.13.1',\n",
" py_version='py39',\n",
" role=role,\n",
" hyperparameters = hyperparameters,\n",
" distribution = distribution\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# starting the train job\n",
"huggingface_estimator.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deploying the endpoint\n",
"\n",
"To deploy our endpoint, we call `deploy()` on our HuggingFace estimator object, passing in our desired number of instances and instance type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictor = huggingface_estimator.deploy(1,\"ml.g4dn.xlarge\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we use the returned predictor object to call the endpoint."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n",
" Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n",
" Jeff: ok.\n",
" Jeff: and how can I get started? \n",
" Jeff: where can I find documentation? \n",
" Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n",
" '''\n",
"\n",
"data= {\"inputs\":conversation}\n",
"\n",
"predictor.predict(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we delete the endpoint again."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"predictor.delete_endpoint()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Upload the fine-tuned model to [huggingface.co](http://huggingface.co)\n",
"\n",
"We can download our model from Amazon S3 and unzip it using the following snippet.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import tarfile\n",
"from sagemaker.s3 import S3Downloader\n",
"\n",
"local_path = 'my_bart_model'\n",
"\n",
"os.makedirs(local_path, exist_ok = True)\n",
"\n",
"# download model from S3\n",
"S3Downloader.download(\n",
" s3_uri=huggingface_estimator.model_data, # s3 uri where the trained model is located\n",
" local_path=local_path, # local path where *.targ.gz is saved\n",
" sagemaker_session=sess # sagemaker session used for training the model\n",
")\n",
"\n",
"# unzip model\n",
"tar = tarfile.open(f\"{local_path}/model.tar.gz\", \"r:gz\")\n",
"tar.extractall(path=local_path)\n",
"tar.close()\n",
"os.remove(f\"{local_path}/model.tar.gz\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we are going to upload our model to [huggingface.co](http://huggingface.co) we need to create a `model_card`. The `model_card` describes the model includes hyperparameters, results and which dataset was used for training. To create a `model_card` we create a `README.md` in our `local_path`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# read eval and test results \n",
"with open(f\"{local_path}/eval_results.json\") as f:\n",
" eval_results_raw = json.load(f)\n",
" eval_results={}\n",
" eval_results[\"eval_rouge1\"] = eval_results_raw[\"eval_rouge1\"]\n",
" eval_results[\"eval_rouge2\"] = eval_results_raw[\"eval_rouge2\"]\n",
" eval_results[\"eval_rougeL\"] = eval_results_raw[\"eval_rougeL\"]\n",
" eval_results[\"eval_rougeLsum\"] = eval_results_raw[\"eval_rougeLsum\"]\n",
"\n",
"with open(f\"{local_path}/test_results.json\") as f:\n",
" test_results_raw = json.load(f)\n",
" test_results={}\n",
" test_results[\"test_rouge1\"] = test_results_raw[\"test_rouge1\"]\n",
" test_results[\"test_rouge2\"] = test_results_raw[\"test_rouge2\"]\n",
" test_results[\"test_rougeL\"] = test_results_raw[\"test_rougeL\"]\n",
" test_results[\"test_rougeLsum\"] = test_results_raw[\"test_rougeLsum\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we extract all the metrics we want to include we are going to create our `README.md`. Additionally to the automated generation of the results table we add the metrics manually to the `metadata` of our model card under `model-index`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(eval_results)\n",
"print(test_results)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"\n",
"MODEL_CARD_TEMPLATE = \"\"\"\n",
"---\n",
"language: en\n",
"tags:\n",
"- sagemaker\n",
"- bart\n",
"- summarization\n",
"license: apache-2.0\n",
"datasets:\n",
"- samsum\n",
"model-index:\n",
"- name: {model_name}\n",
" results:\n",
" - task: \n",
" name: Abstractive Text Summarization\n",
" type: abstractive-text-summarization\n",
" dataset:\n",
" name: \"SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization\" \n",
" type: samsum\n",
" metrics:\n",
" - name: Validation ROGUE-1\n",
" type: rogue-1\n",
" value: 42.621\n",
" - name: Validation ROGUE-2\n",
" type: rogue-2\n",
" value: 21.9825\n",
" - name: Validation ROGUE-L\n",
" type: rogue-l\n",
" value: 33.034\n",
" - name: Test ROGUE-1\n",
" type: rogue-1\n",
" value: 41.3174\n",
" - name: Test ROGUE-2\n",
" type: rogue-2\n",
" value: 20.8716\n",
" - name: Test ROGUE-L\n",
" type: rogue-l\n",
" value: 32.1337\n",
"widget:\n",
"- text: | \n",
" Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n",
" Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n",
" Jeff: ok.\n",
" Jeff: and how can I get started? \n",
" Jeff: where can I find documentation? \n",
" Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n",
"---\n",
"## `{model_name}`\n",
"This model was trained using Amazon SageMaker and the new Hugging Face Deep Learning container.\n",
"For more information look at:\n",
"- [🤗 Transformers Documentation: Amazon SageMaker](https://huggingface.co/transformers/sagemaker.html)\n",
"- [Example Notebooks](https://github.com/huggingface/notebooks/tree/master/sagemaker)\n",
"- [Amazon SageMaker documentation for Hugging Face](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html)\n",
"- [Python SDK SageMaker documentation for Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)\n",
"- [Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers)\n",
"## Hyperparameters\n",
" {hyperparameters}\n",
"## Usage\n",
" from transformers import pipeline\n",
" summarizer = pipeline(\"summarization\", model=\"philschmid/{model_name}\")\n",
" conversation = '''Jeff: Can I train a 🤗 Transformers model on Amazon SageMaker? \n",
" Philipp: Sure you can use the new Hugging Face Deep Learning Container. \n",
" Jeff: ok.\n",
" Jeff: and how can I get started? \n",
" Jeff: where can I find documentation? \n",
" Philipp: ok, ok you can find everything here. https://huggingface.co/blog/the-partnership-amazon-sagemaker-and-hugging-face \n",
" '''\n",
" summarizer(conversation)\n",
"## Results\n",
"| key | value |\n",
"| --- | ----- |\n",
"{eval_table}\n",
"{test_table}\n",
"\"\"\"\n",
"\n",
"# Generate model card (todo: add more data from Trainer)\n",
"model_card = MODEL_CARD_TEMPLATE.format(\n",
" model_name=f\"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}\",\n",
" hyperparameters=json.dumps(hyperparameters, indent=4, sort_keys=True),\n",
" eval_table=\"\\n\".join(f\"| {k} | {v} |\" for k, v in eval_results.items()),\n",
" test_table=\"\\n\".join(f\"| {k} | {v} |\" for k, v in test_results.items()),\n",
")\n",
"with open(f\"{local_path}/README.md\", \"w\") as f:\n",
" f.write(model_card)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we have our unzipped model and model card located in `my_bart_model` we can use the either `huggingface_hub` SDK to create a repository and upload it to [huggingface.co](http://huggingface.co) or go to https://huggingface.co/new an create a new repository and upload it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from getpass import getpass\n",
"from huggingface_hub import HfApi, Repository\n",
"\n",
"hf_username = \"philschmid\" # your username on huggingface.co\n",
"hf_email = \"philipp@huggingface.co\" # email used for commit\n",
"repository_name = f\"{hyperparameters['model_name_or_path'].split('/')[1]}-{hyperparameters['dataset_name']}\" # repository name on huggingface.co\n",
"password = getpass(\"Enter your password:\") # creates a prompt for entering password\n",
"\n",
"# get hf token\n",
"token = HfApi().login(username=hf_username, password=password)\n",
"\n",
"# create repository\n",
"repo_url = HfApi().create_repo(token=token, name=repository_name, exist_ok=True)\n",
"\n",
"# create a Repository instance\n",
"model_repo = Repository(use_auth_token=token,\n",
" clone_from=repo_url,\n",
" local_dir=local_path,\n",
" git_user=hf_username,\n",
" git_email=hf_email)\n",
"\n",
"# push model to the hub\n",
"model_repo.push_to_hub()\n",
"\n",
"print(f\"https://huggingface.co/{hf_username}/{repository_name}\")"
]
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"interpreter": {
"hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
},
"kernelspec": {
"display_name": "conda_pytorch_p39",
"language": "python",
"name": "conda_pytorch_p39"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 4
}