In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# üå¶Ô∏è Weather forecasting -- _Training_

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/people-and-planet-ai/weather-forecasting/notebooks/3-training.ipynb)

This sample is broken into the following notebooks:

* [![Open in Colab](https://github.com/googlecolab/open_in_colab/raw/main/images/icon16.png) **üß≠ Overview**](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/people-and-planet-ai/weather-forecasting/notebooks/1-overview.ipynb):
  Go through what we want to achieve, and explore the data we want to use as _inputs and outputs_ for our model.

* [![Open in Colab](https://github.com/googlecolab/open_in_colab/raw/main/images/icon16.png) **üóÑÔ∏è Create the dataset**](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/people-and-planet-ai/weather-forecasting/notebooks/2-dataset.ipynb):
  Use [Apache Beam](https://beam.apache.org/) to fetch data from [Earth Engine](https://earthengine.google.com/) in parallel, and create a dataset for our model in [Dataflow](https://cloud.google.com/dataflow).

* ![Open in Colab](https://github.com/googlecolab/open_in_colab/raw/main/images/icon16.png) **üß† Train the model**:
  Build a simple _Fully Convolutional Network_ in [PyTorch](https://pytorch.org/) and train it in [Vertex AI](https://cloud.google.com/vertex-ai/docs/training/custom-training) with the dataset we created.

* [![Open in Colab](https://github.com/googlecolab/open_in_colab/raw/main/images/icon16.png) **üîÆ Model predictions**](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/people-and-planet-ai/weather-forecasting/notebooks/4-predictions.ipynb):
  Get predictions from the model with data it has never seen before.

This sample leverages geospatial satellite and precipitation data from [Google Earth Engine](https://earthengine.google.com/).
Using satellite imagery, you'll build and train a model for rain "nowcasting" i.e. predicting the amount of rainfall for a given geospatial region and time in the immediate future.

* ‚è≤Ô∏è **Time estimate**: ~40 minutes
* üí∞ **Cost estimate**: [a few cents on Vertex AI](https://cloud.google.com/vertex-ai/pricing#custom-trained_models)

üíö This is one of many **machine learning how-to samples** inspired from **real climate solutions** aired on the [People and Planet AI üé• series](https://www.youtube.com/playlist?list=PLIivdWyY5sqI-llB35Dcb187ZG155Rs_7).

# üé¨ Before you begin

Let's start by cloning the GitHub repository, and installing some dependencies.

In [None]:
# Now let's get the code from GitHub and navigate to the sample.
!git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
%cd python-docs-samples/people-and-planet-ai/weather-forecasting

The [`weather-model`](../serving/weather-model) local package contains the model definition and the training script.
This ensures we use the same model definition for both training and predictions.


In [None]:
# Upgrade `setuptools` to install packages from pyproject.toml files.
!pip install --quiet --upgrade --no-warn-conflicts pip setuptools

# We need `build` and `virtualenv` to build the local packages.
!pip install --quiet build virtualenv

# Install the `weather-model` local package.
!pip install google-cloud-aiplatform serving/weather-model

> **üõë Restart the runtime üõë**

Colab already comes with many dependencies pre-loaded.
In order to ensure everything runs as expected, we **_must_ restart the runtime**. This allows Colab to load the latest versions of the libraries.

!["Runtime" > "Restart runtime"](images/restart-runtime.png)

In [None]:
# Alternatively, restart the runtime by ending the process.
exit()

After restarting the runtime, let's navigate back into the sample directory.

In [None]:
%cd python-docs-samples/people-and-planet-ai/weather-forecasting

[Errno 2] No such file or directory: 'python-docs-samples/people-and-planet-ai/weather-forecasting'
/content/python-docs-samples/people-and-planet-ai/weather-forecasting/python-docs-samples/people-and-planet-ai/weather-forecasting


## ‚òÅÔ∏è My Google Cloud resources

Make sure you have followed these steps to configure your Google Cloud project:

1. Enable the APIs: _Vertex AI_

  <button>

  [Click here to enable the APIs](aiplatform.googleapis.com)
  </button>

1. Create or use an existing Cloud Storage bucket.

  <button>

  [Click here to create a new Cloud Storage bucket](https://console.cloud.google.com/storage/create-bucket)
  </button>

Once you have everything ready, you can go ahead and fill in your Google Cloud resources in the following code cell.
Make sure you run it!

In [None]:
from __future__ import annotations

import os
from google.colab import auth

# Please fill in these values.
project = ""  # @param {type:"string"}
bucket = ""  # @param {type:"string"}
location = "us-central1"  # @param {type:"string"}

# Quick input validations.
assert project, "‚ö†Ô∏è Please provide a Google Cloud project ID"
assert bucket, "‚ö†Ô∏è Please provide a Cloud Storage bucket name"
assert not bucket.startswith(
    "gs://"
), f"‚ö†Ô∏è Please remove the gs:// prefix from the bucket name: {bucket}"
assert location, "‚ö†Ô∏è Please provide a Google Cloud location"

# Authenticate to Colab.
auth.authenticate_user()

# Set GOOGLE_CLOUD_PROJECT for google.auth.default().
os.environ["GOOGLE_CLOUD_PROJECT"] = project

# Set the gcloud project for other gcloud commands.
!gcloud config set project {project}

# üß† Train the model locally

We need our model for both training and for prediction.
So we created the local [`weather-model`](../serving/weather-model) module.
It contains [`weather/model.py`](../serving/weather-model/weather/model.py) where the model is defined, and [`weather/trainer.py`](../serving/weather-model/weather/trainer.py) where all the training code lives.

## üìñ Read the dataset

Unfortunately, PyTorch cannot read files from Cloud Storage out of the box.
Fortunately, Vertex AI uses [Cloud Storage FUSE](https://cloud.google.com/blog/products/ai-machine-learning/cloud-storage-file-system-ai-training) to mount and access Cloud Storage files as if they were local files.

For now, let's download the data files we created in the [üóÑÔ∏è **Create the dataset**](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/people-and-planet-ai/weather-forecasting/notebooks/2-dataset.ipynb) notebook to have them locally.

In [None]:
data_path_gcs = f"gs://{bucket}/weather/data"

!mkdir -p data-training
!gsutil -m cp {data_path_gcs}/* data-training

First, we need to load the dataset to feed it to the model.
To read a dataset in PyTorch, we could manually instantiate a subclass of `torch.utils.data.Dataset`, but we're going to use [Hugging Face ü§ó Datasets](https://huggingface.co/docs/datasets/main/en/index), which are a high-level interface to use datasets more easily.

Our data files are compressed NumPy files, which we can easily load with NumPy.
To load them into a ü§ó Dataset, we can use [`Dataset.from_dict`](https://huggingface.co/docs/datasets/main/en/loading#python-dictionary) and pass it a dictionary containing all the file names of our data files.
Then, we use [`Dataset.map`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) to read the data files and processs the examples in parallel.
Additionally, we _augment_ the data by rotating and flipping each example.
To split the our dataset into training and a testing/validation subsets, we use [`Dataset.train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split).

In [`weather/trainer.py`](../serving/weather-model/weather/trainer.py) we defined the `read_dataset` function to load our data files, and returns us a ü§ó Dataset with train/test splits.

In [None]:
from weather.trainer import read_dataset

data_path = "data-training"
train_test_ratio = 0.9  # 90% train, 10% test

# Read the dataset with train/test splits.
dataset = read_dataset(data_path, train_test_ratio)

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['inputs', 'labels'],
        num_rows: 3069
    })
    test: Dataset({
        features: ['inputs', 'labels'],
        num_rows: 341
    })
})


> üí° For more information on loading data into a ü§ó Dataset, refer to the [Loading data](https://huggingface.co/docs/datasets/main/en/loading) guide.

ü§ó Datasets allow for random access just like PyTorch Datasets.

Let's see the shapes of the first training example from the `train` split.
When we access an example, we get an `{'inputs': list, 'labels': list}` dictionary, where each value is a [Python list](https://docs.python.org/3/library/stdtypes.html#list).
We can then convert them into [PyTorch tensors](https://pytorch.org/docs/stable/tensors.html) for further use.

In [None]:
import torch

train_dataset = dataset["train"]
example = train_dataset[0]  # random access the first element

print(f"inputs: {torch.as_tensor(example['inputs']).shape}")
print(f"labels: {torch.as_tensor(example['labels']).shape}")

inputs: torch.Size([5, 5, 52])
labels: torch.Size([5, 5, 2])


The _inputs_ have the shape `(width, height, num_inputs)`, where each input is the value of an Earth Engine band.

The _outputs_ have the shape `(width, height, num_outputs)`, where each output is a prediction.
We're predicting for 2 and 6 hours into the future, so we get 2 outputs.

## üìì Define the model

First we define our model, which is a very simple _Fully Convolutional Network_.
The input data can consist of potentially very large numbers, but machine learning generally prefers small numbers around -1 and 1.
So in [`weather/model.py`](../weather/model.py) we defined a `Normalization` layer which applies [Z-Score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score) to normalize all the model's inputs as a first step.
But we need to provide it with the [_mean_](https://en.wikipedia.org/wiki/Mean) and [_standard deviation_](https://en.wikipedia.org/wiki/Standard_deviation) from the training dataset.

A model always processes _batches_ of inputs, so we always get an extra _first_ dimension.
This means that for all the layers in the model, our inputs have the shape `(batch, width, height, num_inputs)`, and our outputs have the shape `(batch, width, height, num_outputs)`.

We need to calculate the mean and standard deviation for each input, so each band is normalized within its own range.
Both the mean and standard deviation must have the shape `(batch, width, height, num_inputs)`, which allows them to _broadcast_ to any batch size, width and height, as long as the `num_inputs` match.

In [None]:
import numpy as np

# Let's get the mean and standard deviation.
data = np.array(dataset["train"]["inputs"], np.float32)
mean = data.mean(axis=(0, 1, 2))[None, None, None, :]
std = data.std(axis=(0, 1, 2))[None, None, None, :]

print(f"mean: {mean.shape}")
print(f"std:  {std.shape}")

mean: (1, 1, 1, 52)
std:  (1, 1, 1, 52)


Let's see how the normalization works for a sample of an example's inputs.

In [None]:
import torch

from weather.model import Normalization

normalization = Normalization(mean, std)

sample = lambda x: x[0, 0, 0, 10:15].detach().numpy()

print(f"mean: {sample(normalization.mean)}")
print(f"std:  {sample(normalization.std)}")
print("-" * 40)

example = dataset["train"][0]
example_inputs = torch.as_tensor([example["inputs"]])
normalized_inputs = normalization(example_inputs)
print(f"inputs:     {sample(example_inputs)}")
print(f"normalized: {sample(normalized_inputs)}")

mean: [2202.3132 2355.514  2328.052  2470.9158 2687.0806]
std:  [256.82922 324.5936  332.1437  480.68338 351.21927]
----------------------------------------
inputs:     [2295. 2514. 2534. 2774. 2957.]
normalized: [0.36088872 0.48826003 0.6200569  0.6305278  0.76852113]


After applying the `Normalization` layer, we get small numbers much closer to the range within -1 and 1, they don't have to be _exactly_ within the range, just close enough.

Another thing to note is that our data is in a channels-last format, like `(width, height, channels)`.
But PyTorch expects channels-first format in the convolutional layers, like `(channels, width, height)`.
We still want to pass our inputs in a channels-last format and want the predictions back as channels-last for convenience, but we must convert them to channels-first for PyTorch convolutional layers to work.

In [`weather/model.py`](../serving/weather-model/weather/model.py) we define the `MoveDim` layer, which works similar to [`torch.movedim`](https://pytorch.org/docs/stable/generated/torch.movedim.html) so the model can move the channels dimension as needed.


In [None]:
from weather.model import MoveDim

# We move the channels/last dimension (-1) to the second index (1),
# since the first (0) is for the batch dimension.
to_channels_first = MoveDim(-1, 1)
channels_first = to_channels_first(normalized_inputs)

print(f"normalized:     {normalized_inputs.shape}")
print(f"channels-first: {channels_first.shape}")

normalized:     torch.Size([1, 5, 5, 52])
channels-first: torch.Size([1, 52, 5, 5])


The model then passes the data through a
[2D Convolutional layer](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) for downsampling, and then through a
[2D DeConvolutional layer](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html) for upsampling, so we end up with images the same size as the input image.
We used a [`ReLU`](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) activation function inbetween all hidden layers since it's typically a good general purpose activation function.

The Conv2D and DeConv2D layers form a very simple Fully Convolutional Network architecture, and since we're using the same _kernel size_ for both we get the same `(width, height)` as outputs.

In [None]:
num_inputs = 52
num_hidden1 = 64
num_hidden2 = 128
kernel_size = (3, 3)

fully_convolutional_layers = torch.nn.Sequential(
    torch.nn.Conv2d(num_inputs, num_hidden1, kernel_size),
    torch.nn.ReLU(),
    torch.nn.ConvTranspose2d(num_hidden1, num_hidden2, kernel_size),
    torch.nn.ReLU(),
)

fcn_outputs = fully_convolutional_layers(channels_first)
print(f"FCN outputs: {fcn_outputs.shape}")

FCN outputs: torch.Size([1, 128, 5, 5])


Now, let's convert the results back into channels-last format with `MoveDim`.

In [None]:
to_channels_last = MoveDim(1, -1)
channels_last = to_channels_last(fcn_outputs)

print(f"channels-last: {channels_last.shape}")

channels-last: torch.Size([1, 5, 5, 128])


For the last layer, we use a [`Linear`](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layer with the number of outputs we want.
Since we can't have negative precipitation, we passed the model's outputs through a final `ReLU` activation function.

In [None]:
num_outputs = 2

linear = torch.nn.Linear(num_hidden2, num_outputs)
relu = torch.nn.ReLU()

with torch.no_grad():
    raw_predictions = linear(channels_last)
    predictions = relu(raw_predictions)

print(f"predictions: {predictions.shape}")
print(predictions[0, 0, 0])

predictions: torch.Size([1, 5, 5, 2])
tensor([0.0650, 0.0010])


In [`weather/model.py`](../serving/weather-model/weather/model.py) we defined the `WeatherModel` and `WeatherConfig` classes.

The `WeatherModel` class inherits from [`PreTrainedModel`](https://huggingface.co/docs/transformers/main/en/main_classes/model) to make it compatible with [ü§ó Transformers](https://huggingface.co/docs/transformers/main/en/index).

The model definition includes the loss function, so it knows how good or bad their predictions were.
We could use any regression loss function like [Mean Absolute Error (L1)](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html) or [Mean Squared Error (L2)](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html).
PyTorch provides a [Smooth L1 Loss](https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html), which chooses between L1 and L2 depending on a certain criteria.
It's less sensitive to outliers, so we'll use that.

To create a `WeatherModel`, we have to pass it a `WeatherConfig`.
The `WeatherConfig` contains all the model's hyperparameters, and we must also pass the _mean_ and _standard deviation_ from the training dataset for the normalization layer.
We defined `WeatherModel.create` which takes in the training dataset inputs and returns us a `WeatherModel` with the right `WeatherConfig`.

In [None]:
from weather.model import WeatherModel

model = WeatherModel.create(dataset["train"]["inputs"])
print(model)

WeatherModel(
  (layers): Sequential(
    (0): Normalization()
    (1): MoveDim()
    (2): Conv2d(52, 64, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): ConvTranspose2d(64, 128, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
    (6): MoveDim()
    (7): Linear(in_features=128, out_features=2, bias=True)
    (8): ReLU()
  )
)


The model outputs a `{'loss': torch.Tensor, 'logits': torch.Tensor}` dictionary during training, and a `{'logits': torch.Tensor}` dictionary during predictions.
This is what ü§ó Transformers expect for [model outputs](https://huggingface.co/docs/transformers/main/en/main_classes/output).

Remember that we _must_ pass a _batch_ of inputs to the model, not a single input.

In [None]:
example = dataset["test"]
inputs_batch = torch.as_tensor(example["inputs"][:1])
labels_batch = torch.as_tensor(example["labels"][:1])

# We pass the labels as well to get the loss, but it's optional.
# If we don't pass the labels, we simply won't get the loss.
# The predictions are under the 'logits' key.
with torch.no_grad():
    predictions = model(inputs_batch, labels_batch)

print(f"inputs:      {inputs_batch.shape}")
print(f"labels:      {labels_batch.shape}")
print(f"loss:        {predictions['loss']}")
print(f"predictions: {predictions['logits'].shape}")
print("-" * 40)
print(f"sample labels:      {labels_batch[0, 0, 0]}")
print(f"sample predictions: {predictions['logits'][0, 0, 0]}")

inputs:      torch.Size([1, 5, 5, 52])
labels:      torch.Size([1, 5, 5, 2])
loss:        0.009296745993196964
predictions: torch.Size([1, 5, 5, 2])
----------------------------------------
sample labels:      tensor([0., 0.])
sample predictions: tensor([0.0797, 0.0000])


These predictions don't look great because we haven't trained our model.
Fortunately, since we've made our model compatible with ü§ó Transformers, we can simply use [`Trainer`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer), which takes care of all the training steps, automatically optimizes the whole process, and uses accelerators like GPUs if available.

## üëü Train the model

We have to define the number of times we want the model to go through the training dataset, this is called the number of _epochs_.
We also have to define the _batch size_ we want to use during training and testing, this can have a big impact in how fast the model trains, as a rule of thumb the larger the better as long as it fits into memory.
We define all these parameters with [`TrainingArguments`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments).

Then we pass the model, the `TrainingArguments`, and the training and testing datasets into the `Trainer`.
Finally we can train the model with [`Trainer.train`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train).

In [None]:
from transformers import TrainingArguments, Trainer

epochs = 5
batch_size = 512

# Define our training job.
training_args = TrainingArguments(
    output_dir="checkpoints",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=epochs,
    logging_strategy="epoch",
    evaluation_strategy="epoch",
)
trainer = Trainer(
    model,
    training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

# Run the training job.
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 3069
  Num Epochs = 5
  Instantaneous batch size per device = 512
  Total train batch size (w. parallel, distributed & accumulation) = 512
  Gradient Accumulation steps = 1
  Total optimization steps = 30
  Number of trainable parameters = 104234
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Epoch,Training Loss,Validation Loss
1,1.2889,1.016647
2,1.2793,1.00968
3,1.2717,1.004657
4,1.2667,1.001499
5,1.2636,1.000306


***** Running Evaluation *****
  Num examples = 341
  Batch size = 512
***** Running Evaluation *****
  Num examples = 341
  Batch size = 512
***** Running Evaluation *****
  Num examples = 341
  Batch size = 512
***** Running Evaluation *****
  Num examples = 341
  Batch size = 512
***** Running Evaluation *****
  Num examples = 341
  Batch size = 512


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=30, training_loss=1.2740394274393718, metrics={'train_runtime': 23.7216, 'train_samples_per_second': 646.878, 'train_steps_per_second': 1.265, 'total_flos': 0.0, 'train_loss': 1.2740394274393718, 'epoch': 5.0})

> üí° Both losses should go down every epoch, and they should be roughly similar.
> If the training loss goes down, but the testing loss stays flat or goes up, it might be a sign that the model is [overfitting](https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting), meaning that it's memorizing the training dataset instead of learning to generalize.

## üíæ Save and load the model

After the model has finished training, we can save it with [`Trainer.save_model`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.save_model).



In [None]:
trainer.save_model("model")

!ls -lh model

Saving model checkpoint to model
Configuration saved in model/config.json
Model weights saved in model/pytorch_model.bin


total 420K
-rw-r--r-- 1 root root 3.4K Jan 11 21:33 config.json
-rw-r--r-- 1 root root 410K Jan 11 21:33 pytorch_model.bin
-rw-r--r-- 1 root root 3.4K Jan 11 21:33 training_args.bin


Now that we have a trained model, we can save it and load it anywhere else.
We can load a ü§ó Transformers model with [`PreTrainedModel.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained), in our case with `WeatherModel.from_pretrained`.
This loads all the model's hyperparameters as well as the _mean_ and _standard deviation_ for the normalization layer.

In [None]:
from weather.model import WeatherModel

model = WeatherModel.from_pretrained("model")
print(model)

loading configuration file model/config.json
Model config WeatherConfig {
  "architectures": [
    "WeatherModel"
  ],
  "kernel_size": [
    3,
    3
  ],
  "mean": [
    [
      [
        [
          0.965579092502594,
          2.3415911197662354,
          6.150100231170654,
          476.72564697265625,
          421.8377380371094,
          521.5245971679688,
          109.100830078125,
          300.76141357421875,
          262.6136474609375,
          5461.68310546875,
          2202.313232421875,
          2355.513916015625,
          2328.052001953125,
          2470.915771484375,
          2687.08056640625,
          2737.617919921875,
          2684.49365234375,
          2650.0927734375,
          2816.9892578125,
          509.75927734375,
          451.73077392578125,
          535.8512573242188,
          140.81637573242188,
          276.422607421875,
          257.4959411621094,
          4964.77197265625,
          2143.988037109375,
          2276.671630859375,
   

WeatherModel(
  (layers): Sequential(
    (0): Normalization()
    (1): MoveDim()
    (2): Conv2d(52, 64, kernel_size=(3, 3), stride=(1, 1))
    (3): ReLU()
    (4): ConvTranspose2d(64, 128, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
    (6): MoveDim()
    (7): Linear(in_features=128, out_features=2, bias=True)
    (8): ReLU()
  )
)


# ‚òÅÔ∏è Train the model in Vertex AI

> ‚ö†Ô∏è Training in Vertex AI doesn't currently work due to an underlying issue when using the HuggingFace `Trainer` API in Vertex AI. For more information, see [#9272](https://github.com/GoogleCloudPlatform/python-docs-samples/issues/9272).

For this example we're training on a very small dataset for a very small number of epochs.
This means we don't have a representative number of examples and the model hasn't seen the data enough times, so it won't perform very well.

Training on larger datasets for a large number of epochs can take a lot of time, so it might be a good idea to do the training in Cloud.
[Vertex AI](https://cloud.google.com/vertex-ai) is a great option, and even allows us to use hardware accelerators like GPUs.
There are [PyTorch pre-built containers](https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#pytorch) which include PyTorch and many common libraries, so we don't need to build a custom container.

The model and trainer are defined in the [`serving/weather-model`](../serving/weather-model) module.
To run it in Vertex AI, we must build the package, copy it to Cloud Storage, and launch a custom training job with [`CustomPythonPackageTrainingJob`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.CustomPythonPackageTrainingJob).

In [None]:
# Build the `weather-model` package.
!python -m build serving/weather-model

In [None]:
!ls -lh serving/weather-model/dist

total 16K
-rw-r--r-- 1 root root 5.9K Jan 11 18:29 weather_model-1.0.0-py3-none-any.whl
-rw-r--r-- 1 root root 4.3K Jan 11 18:29 weather-model-1.0.0.tar.gz


In [None]:
# Stage the `weather-model` package in Cloud Storage.
!gsutil cp serving/weather-model/dist/weather-model-1.0.0.tar.gz gs://{bucket}/weather/

In Vertex AI, we can access Cloud Storage files directly as if they were local files via Cloud Storage FUSE.
Cloud Storage files are available under `/gcs` followed by your bucket and file path.
To learn more, see the [Cloud Storage as a File System in AI Training](https://cloud.google.com/blog/products/ai-machine-learning/cloud-storage-file-system-ai-training) blog post.

In [None]:
from google.cloud import aiplatform

epochs = 100
timeout_min = 60  # 1 hour

# Cloud Storage paths.
data_path = f"/gcs/{bucket}/weather/data"
model_path = f"/gcs/{bucket}/weather/model"

aiplatform.init(project=project, location=location, staging_bucket=bucket)

# Launch the custom training job.
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name="weather-forecasting",
    python_package_gcs_uri=f"gs://{bucket}/weather/weather-model-1.0.0.tar.gz",
    python_module_name="weather.trainer",
    container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-2.py310:latest",
)
job.run(
    machine_type="n1-highmem-8",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    args=[
        f"--data-path={data_path}",
        f"--model-path={model_path}",
        f"--epochs={epochs}",
    ],
    timeout=timeout_min * 60,  # in seconds
)

> üí° Look at your Vertex AI training jobs: https://console.cloud.google.com/vertex-ai/training/custom-jobs

# ‚õ≥Ô∏è What's next?

* [![Open in Colab](https://github.com/googlecolab/open_in_colab/raw/main/images/icon16.png) **üîÆ Model predictions**](https://colab.research.google.com/github/GoogleCloudPlatform/python-docs-samples/blob/main/people-and-planet-ai/weather-forecasting/notebooks/4-predictions.ipynb):
  Get predictions from the model with data it has never seen before.