This notebook shows how to deploy a vision model from ðŸ¤— Transformers (written in TensorFlow) to [Vertex AI](https://cloud.google.com/vertex-ai). This is beneficial in many ways:

* Vertex AI provides support for autoscaling, authorization, and authentication out of the box.
* One can maintain multiple versions of a model and can control the traffic split very easily. 
* Purely serverless. 

This notebook uses code from [this official GCP example](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/vertex_endpoints/optimized_tensorflow_runtime/bert_optimized_online_prediction.ipynb).

This tutorial uses the following billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage.

## Initial setup

First authenticate yourself to provide Colab access to your GCP resources. 

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
# Storage bucket
GCS_BUCKET = "gs://[GCS-BUCKET-NAME]"
REGION = "us-central1"

In [None]:
!gsutil mb -l $REGION $GCS_BUCKET

Creating gs://hf-tf-vision/...


In [None]:
# Install Vertex AI SDK and transformers
!pip install --upgrade google-cloud-aiplatform transformers -q

## Initial imports

In [None]:
from transformers import ViTImageProcessor, TFViTForImageClassification
import tensorflow as tf
import tempfile
import requests
import base64
import json
import os

2022-07-17 05:01:47.593465: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [None]:
import transformers

print(tf.__version__)
print(transformers.__version__)

2.9.0-rc2
4.20.1


## Save the model locally

We will work with a [Vision Transformer B-16 model provided by ðŸ¤— Transformers](https://huggingface.co/docs/transformers/main/en/model_doc/vit). We will first initialize it, load the model weights, and then save it locally as a [SavedModel](https://www.tensorflow.org/guide/saved_model) resource. 

In [None]:
# the saved_model parameter is a flag to create a saved model version of the model
LOCAL_MODEL_DIR = "vit"
model = TFViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
model.save_pretrained(LOCAL_MODEL_DIR, saved_model=True)

All model checkpoint layers were used when initializing TFViTForImageClassification.

All the layers of TFViTForImageClassification were initialized from the model checkpoint at google/vit-base-patch16-224.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFViTForImageClassification for predictions without further training.


INFO:tensorflow:Assets written to: vit/saved_model/1/assets


INFO:tensorflow:Assets written to: vit/saved_model/1/assets


In [None]:
# Inspect the input and output signatures of the model
!saved_model_cli show --dir {LOCAL_MODEL_DIR}/saved_model/1 --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['pixel_values'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, -1, -1, -1)
        name: serving_default_pixel_values:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['logits'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1000)
        name: StatefulPartitionedCall:0
  Method name is: tensorflow/serving/predict

Concrete Functions:
  Function Name: '__call__'
    Option #1
      Callable with:
        Argument #1
          DType

## Embedding pre and post processing ops inside the model

ML models usually require some pre and post processing of the input data and predicted results. So, it's a good idea to ship an ML model that already has these supports. It also helps in reducing training/serving skew. 

For our model we need:

* Data normalization, resizing, and transposition as the preprocessing ops.
* Mapping the predicted logits to ImageNet-1k classes as the post-processing ops. 

In [None]:
processor = ViTImageProcessor()
processor

ViTFeatureExtractor {
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "ViTFeatureExtractor",
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "size": 224
}

In [None]:
CONCRETE_INPUT = "pixel_values"
SIZE = processor.size["height"]
INPUT_SHAPE = (SIZE, SIZE, 3)

In [None]:
def normalize_img(img, mean=processor.image_mean, std=processor.image_std):
    # Scale to the value range of [0, 1] first and then normalize.
    img = img / 255
    mean = tf.constant(mean)
    std = tf.constant(std)
    return (img - mean) / std

def preprocess(string_input):
    decoded = tf.io.decode_jpeg(string_input, channels=3)
    resized = tf.image.resize(decoded, size=(SIZE, SIZE))
    normalized = normalize_img(resized)
    normalized = tf.transpose(normalized, (2, 0, 1)) # Since HF models are channel-first.
    return normalized


@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def preprocess_fn(string_input):
    decoded_images = tf.map_fn(
        preprocess, string_input, fn_output_signature=tf.float32,
    )
    return {CONCRETE_INPUT: decoded_images}


def model_exporter(model: tf.keras.Model):
    m_call = tf.function(model.call).get_concrete_function(
        tf.TensorSpec(
            shape=[None, 3, SIZE, SIZE], dtype=tf.float32, name=CONCRETE_INPUT
        )
    )

    @tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
    def serving_fn(string_input):
        labels = tf.constant(
            list(model.config.id2label.values()), dtype=tf.string
        )
        images = preprocess_fn(string_input)

        predictions = m_call(**images)
        indices = tf.argmax(predictions.logits, axis=1)
        pred_source = tf.gather(params=labels, indices=indices)
        probs = tf.nn.softmax(predictions.logits, axis=1)
        pred_confidence = tf.reduce_max(probs, axis=1)
        return {"label": pred_source, "confidence": pred_confidence}

    return serving_fn

In [None]:
# To deploy the model on Vertex AI we must have the model in a storage bucket.
tf.saved_model.save(
    model,
    os.path.join(GCS_BUCKET, LOCAL_MODEL_DIR),
    signatures={"serving_default": model_exporter(model)},
)

Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.map_fn(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.map_fn(fn, elems))


Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.map_fn(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.map_fn(fn, elems))


Instructions for updating:
Use fn_output_signature instead


Instructions for updating:
Use fn_output_signature instead


INFO:tensorflow:Assets written to: gs://hf-tf-vision/vit/assets


INFO:tensorflow:Assets written to: gs://hf-tf-vision/vit/assets


**Notes on making the model accept string inputs**:

When dealing with images via REST or gRPC requests the size of the request payload can easily spiral up depending on the resolution of the images being passed. This is why, it is good practice to compress them reliably and then prepare the request payload.

## Deployment on Vertex AI

[This resource](https://cloud.google.com/vertex-ai/docs/general/general-concepts) shows some relevant concepts on Vertex AI. 

In [None]:
from google.cloud.aiplatform import gapic as aip

In [None]:
# Deployment hardware
DEPLOY_COMPUTE = "n1-standard-8"
DEPLOY_GPU = aip.AcceleratorType.NVIDIA_TESLA_T4
PROJECT_ID = "GCP-PROJECT-ID"

In [None]:
# Initialize clients.
API_ENDPOINT = f"{REGION}-aiplatform.googleapis.com"
PARENT = f"projects/{PROJECT_ID}/locations/{REGION}"

client_options = {"api_endpoint": API_ENDPOINT}
model_service_client = aip.ModelServiceClient(client_options=client_options)
endpoint_service_client = aip.EndpointServiceClient(client_options=client_options)
prediction_service_client = aip.PredictionServiceClient(client_options=client_options)

In [None]:
# Upload the model to Vertex AI. 
tf28_gpu_model_dict = {
    "display_name": "ViT Base TF2.8 GPU model",
    "artifact_uri": f"{GCS_BUCKET}/{LOCAL_MODEL_DIR}",
    "container_spec": {
        "image_uri": "us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-8:latest",
    },
}
tf28_gpu_model = (
    model_service_client.upload_model(parent=PARENT, model=tf28_gpu_model_dict)
    .result(timeout=180)
    .model
)
tf28_gpu_model

'projects/29880397572/locations/us-central1/models/7235960789184544768'

In [None]:
# Create an Endpoint for the model.
tf28_gpu_endpoint_dict = {
    "display_name": "ViT Base TF2.8 GPU endpoint",
}
tf28_gpu_endpoint = (
    endpoint_service_client.create_endpoint(
        parent=PARENT, endpoint=tf28_gpu_endpoint_dict
    )
    .result(timeout=300)
    .name
)
tf28_gpu_endpoint

'projects/29880397572/locations/us-central1/endpoints/7116769330687115264'

In [None]:
# Deploy the Endpoint. 
tf28_gpu_deployed_model_dict = {
    "model": tf28_gpu_model,
    "display_name": "ViT Base TF2.8 GPU deployed model",
    "dedicated_resources": {
        "min_replica_count": 1,
        "max_replica_count": 1,
        "machine_spec": {
            "machine_type": DEPLOY_COMPUTE,
            "accelerator_type": DEPLOY_GPU,
            "accelerator_count": 1,
        },
    },
}

tf28_gpu_deployed_model = endpoint_service_client.deploy_model(
    endpoint=tf28_gpu_endpoint,
    deployed_model=tf28_gpu_deployed_model_dict,
    traffic_split={"0": 100},
).result()
tf28_gpu_deployed_model

deployed_model {
  id: "5163311002082607104"
}

## Make a prediction request

In [None]:
# Generate sample data. 
import base64

image_path = tf.keras.utils.get_file(
    "image.jpg", "http://images.cocodataset.org/val2017/000000039769.jpg"
)
bytes = tf.io.read_file(image_path)
b64str = base64.b64encode(bytes.numpy()).decode("utf-8")

In [None]:
# Model input signature key name.
pushed_model_location = os.path.join(GCS_BUCKET, LOCAL_MODEL_DIR)
loaded = tf.saved_model.load(pushed_model_location)
serving_input = list(
    loaded.signatures["serving_default"].structured_input_signature[1].keys()
)[0]
print("Serving function input:", serving_input)

Serving function input: string_input


In [None]:
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value


def predict_image(image, endpoint, serving_input):
    # The format of each instance should conform to the deployed model's prediction input schema.
    instances_list = [{serving_input: {"b64": image}}]
    instances = [json_format.ParseDict(s, Value()) for s in instances_list]

    print(
        prediction_service_client.predict(
            endpoint=endpoint,
            instances=instances,
        )
    )


predict_image(b64str, tf28_gpu_endpoint, serving_input)

predictions {
  struct_value {
    fields {
      key: "confidence"
      value {
        number_value: 0.896659553
      }
    }
    fields {
      key: "label"
      value {
        string_value: "Egyptian cat"
      }
    }
  }
}
deployed_model_id: "5163311002082607104"
model: "projects/29880397572/locations/us-central1/models/7235960789184544768"
model_display_name: "ViT Base TF2.8 GPU model"



## Cleaning up of resources

In [None]:
def cleanup(endpoint, model_name, deployed_model_id):
    response = endpoint_service_client.undeploy_model(
        endpoint=endpoint, deployed_model_id=deployed_model_id
    )
    print("running undeploy_model operation:", response.operation.name)
    print(response.result())

    response = endpoint_service_client.delete_endpoint(name=endpoint)
    print("running delete_endpoint operation:", response.operation.name)
    print(response.result())

    response = model_service_client.delete_model(name=model_name)
    print("running delete_model operation:", response.operation.name)
    print(response.result())


cleanup(tf28_gpu_endpoint, tf28_gpu_model, tf28_gpu_deployed_model.deployed_model.id)

running undeploy_model operation: projects/29880397572/locations/us-central1/endpoints/7116769330687115264/operations/6837774371172384768

running delete_endpoint operation: projects/29880397572/locations/us-central1/operations/7182299742666227712

running delete_model operation: projects/29880397572/locations/us-central1/operations/1269073431928766464



In [None]:
!gsutil rm -r $GCS_BUCKET

Removing gs://hf-tf-vision/vit/#1658034189039614...
Removing gs://hf-tf-vision/vit/assets/#1658034196731689...                      
Removing gs://hf-tf-vision/vit/saved_model.pb#1658034197598270...               
Removing gs://hf-tf-vision/vit/variables/#1658034189325867...                   
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rm ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://hf-tf-vision/vit/variables/variables.data-00000-of-00001#1658034195624888...
Removing gs://hf-tf-vision/vit/variables/variables.index#1658034195904828...    
/ [6 objects]                                                                   
Operation completed over 6 objects.                                              
Removing gs://hf-tf-vision/...
