# 1.1 Quick Start: AlphaFold Inference with Vertex Pipelines

This quick start notebook demonstrates how to configure and run the inference pipeline using a multimer protein.

## Install and import required packages

In [None]:
%cd /home/jupyter/vertex-ai-alphafold-inference-pipeline/src
%pip install .
%cd /home/jupyter/vertex-ai-alphafold-inference-pipeline
%pip install -U -r requirements.txt

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
from google.cloud import aiplatform as vertex_ai
from kfp.v2 import compiler

from src.utils import compile_utils
from src.utils import fasta_utils

## Configure environment settings

Change the values of the following parameters to reflect your environment.

- `PROJECT_ID` - Project ID of your environment
- `ZONE`- GCP Zone where your resources will be deployed and are located.
- `BUCKET_NAME` - GCS bucket to use for Vertex AI staging. Must be in the same region of ZONE.
- `FILESTORE_ID` - Instance ID of your Filestore instance


In [None]:
PROJECT_ID = '<YOUR PROJECT ID>'  # Change to your project ID
ZONE = '<YOUR ZONE>'   # Change to your zone (example: us-central1-c)
BUCKET_NAME = '<YOUR BUCKET NAME>'  # Change to your bucket name
FILESTORE_ID = '<YOUR FILESTORE ID>' # Change to your Filestore ID

In [None]:
REGION = '-'.join(ZONE.split(sep='-')[:-1])
FILESTORE_IP, FILESTORE_NETWORK = compile_utils.get_filestore_info(
    project_id=PROJECT_ID, instance_id=FILESTORE_ID, location=ZONE)

If you set up the sandbox environment using the provided Terraform configuration you do not need to change the below settings. Otherwise make sure that they are consistent with your environment.

- `FILESTORE_SHARE` - Filestore share with AlphaFold reference databases
- `FILESTORE_MOUNT_PATH` - Mount path for Filestore fileshare
- `MODEL_PARAMS` - GCS location of AlphaFold model parameters. The pipelines are configured to retrieve the parameters from the `<MODEL_PARAMS>/params` folder.


In [None]:
FILESTORE_SHARE = '/datasets'
FILESTORE_MOUNT_PATH = '/mnt/nfs/alphafold'
MODEL_PARAMS = f'gs://{BUCKET_NAME}'
IMAGE_URI = f'{REGION}-docker.pkg.dev/{PROJECT_ID}/alpha-kfp/alphafold-components'

## Configure and run the Inference pipeline

There are two types of parameters that can be used to customize Vertex AI Pipelines: **compile time and runtime**.  
 - The compile time parameters must be set before compiling the pipeline code. These parameters are used to control settings like CPU/GPU configuration of compute nodes and the Filestore instance settings.
 - The runtime parameters can be supplied when starting a pipeline run. They include a sequence to fold, model presets (monomer or multimer), the maximum date for template searches and more.

The pipelines have been designed to retrieve compile time parameters from environment variables. This makes it easy to integrate a pipeline compilation step with CI/CD systems.

By default, the pipeline uses a `c2-standard-16` node to run the feature engineering step and  `g2-standard-12` nodes with NVIDIA L4 GPUs to run prediction and relaxation. For now, you will use the default settings. This hardware configuration is optimal for folding smaller proteins, roughly 1000 residues or fewer. 

### Set compile time parameters

At minimum you have to configure:
- the settings of your Filestore instance that hosts genetic databases, 
- the URI of the docker image that packages custom KFP components, and 
- the GCS location of AlphaFold parameters.

In [None]:
os.environ['ALPHAFOLD_COMPONENTS_IMAGE'] = IMAGE_URI
os.environ['NFS_SERVER'] = FILESTORE_IP
os.environ['NFS_PATH'] = FILESTORE_SHARE
os.environ['NETWORK'] = FILESTORE_NETWORK
os.environ['MODEL_PARAMS_GCS_LOCATION'] = MODEL_PARAMS

If you are working with larger proteins that demand GPUs with more memory/processing power, you can change the default settings and reconfigure the pipeline to use nodes with, for example, NVIDIA A100 GPU for prediction and relaxation.

If that's the case, please uncomment the following cell and redefine the default values (remember that the default is set to L4 GPUs).

To review how to properlly set these parameters, please refer to the following documentation:  
https://cloud.google.com/vertex-ai/docs/pipelines/machine-types

In [None]:
# # Instance (VM) configuration to run the data pipeline
# os.environ['DATA_PIPELINE_MACHINE_TYPE'] = 'c2-standard-16'

# # Instance (VM) configuration to run model prediction
# os.environ['PREDICT_MACHINE_TYPE'] = 'g2-standard-12' 
# os.environ['PREDICT_ACCELERATOR_TYPE'] = 'NVIDIA_L4' 

# # Instance (VM) configuration to run protein relaxation
# os.environ['RELAX_MACHINE_TYPE'] = 'g2-standard-12' 
# os.environ['RELAX_ACCELERATOR_TYPE'] = 'NVIDIA_L4'

Set the maximum number of prediction jobs (using GPU resources) that can be run in parallel.  
For multimers, if you have the necessary GPU quota, we recommend increasing this value to 25.

In [None]:
os.environ['PARALLELISM'] = '5'

### Compile the pipeline

In [None]:
from src.pipelines.alphafold_inference_pipeline import alphafold_inference_pipeline as pipeline

pipeline_name = 'universal-pipeline'
compiler.Compiler().compile(
    pipeline_func=pipeline,
    package_path=f'{pipeline_name}.json')

### Configure runtime parameters

At minimum you need to configure a GCS location of your sequence, the maximum date for template searches and a project and region where to run the pipeline. With the default settings, the pipeline will run monomer inference using the small version of BFD.

**Note about multimer sequences**: When processing multimer sequences, the `num_multimer_predictions_per_model` parameter controls how many predictions are run for each model. The default value has been set to 5, which is the same as in the [run_alphafold.py](https://github.com/deepmind/alphafold/blob/main/run_alphafold.py) script.

#### Copy the sample sequence to a GCS location

You can find a few sample sequences in the `sequences` folder.

In [None]:
sequence = '1S78.fasta' # Copy your FASTA file to the 'sequences' folder and reference its name here.

is_monomer, sequences = fasta_utils.validate_fasta_file(
    os.path.join('sequences', sequence))

In [None]:
gcs_sequence_path = f'gs://{BUCKET_NAME}/fasta/{sequence}'
! gsutil cp sequences/{sequence} {gcs_sequence_path}

#### Define Alphafold training parameters

In [None]:
max_template_date = '2030-01-01'
use_small_bfd = True    # 'True' will only use a portion of the BDF database. Set to 'False' if you want to use the full BFD database.
num_multimer_predictions_per_model = 5  # Number of predictions per model for multimer model preset
is_run_relax = 'relax'   # Wheather or not to run relaxation process. If you don't need to run the relaxation step, pass an empty string ''.

In [None]:
params = {
    'sequence_path': gcs_sequence_path,
    'max_template_date': max_template_date,
    'model_preset': 'monomer' if is_monomer else 'multimer',
    'project': PROJECT_ID,
    'region': REGION,
    'use_small_bfd': use_small_bfd,
    'num_multimer_predictions_per_model': num_multimer_predictions_per_model,
    'is_run_relax': is_run_relax
}

### Submit a pipeline run

We recommend annotating pipeline runs with at least two labels. The first label groups multiple pipeline runs into a single experiment. The second label identifies a given run within the experiment. Annotating with labels helps to discover and analyze pipeline runs in large scale settings. The third notebook that demonstrates how to analyze pipeline runs depends on the labels. 

You will be able to monitor the run using the link printed by executing the cell.

In [None]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=f'gs://{BUCKET_NAME}/staging'
)

In [None]:
experiment_id = '1S78-multimer-experiment'
labels = {'experiment_id': experiment_id.lower(), 'sequence_id': sequence.split(sep='.')[0].lower()}

pipeline_job = vertex_ai.PipelineJob(
    display_name=pipeline_name,
    template_path=f'{pipeline_name}.json',
    pipeline_root=f'gs://{BUCKET_NAME}/pipeline_runs/{pipeline_name}',
    parameter_values=params,
    enable_caching=True,
    labels=labels
)

pipeline_job.run(sync=False)
pipeline_job.wait_for_resource_creation()

In [None]:
# Check the state of the pipeline
pipeline_job.state