1-alphafold-quick-start.ipynb (404 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick Start: AlphaFold Inference with Vertex Pipelines\n",
"\n",
"This quick start notebook demonstrates how to configure and run the inference pipeline using a monomer protein."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install and import required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%cd /home/jupyter/vertex-ai-alphafold-inference-pipeline/src\n",
"%pip install .\n",
"%cd /home/jupyter/vertex-ai-alphafold-inference-pipeline\n",
"%pip install -U -r requirements.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from google.cloud import aiplatform as vertex_ai\n",
"from kfp.v2 import compiler\n",
"\n",
"from src.utils import compile_utils\n",
"from src.utils import fasta_utils"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure environment settings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Change the values of the following parameters to reflect your environment.\n",
"\n",
"- `PROJECT_ID` - Project ID of your environment\n",
"- `ZONE`- GCP Zone where your resources will be deployed and are located.\n",
"- `BUCKET_NAME` - GCS bucket to use for Vertex AI staging. Must be in the same region of ZONE.\n",
"- `FILESTORE_ID` - Instance ID of your Filestore instance\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_ID = '<YOUR PROJECT ID>' # Change to your project ID\n",
"ZONE = '<YOUR ZONE>' # Change to your zone (example: us-central1-c)\n",
"BUCKET_NAME = '<YOUR BUCKET NAME>' # Change to your bucket name\n",
"FILESTORE_ID = '<YOUR FILESTORE ID>' # Change to your Filestore ID\n",
"AR_REPO_NAME = '<YOUR ARTIFACT REGISTRY REPOSITORY NAME>' # Change to your Artifact Registry repository name"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"REGION = '-'.join(ZONE.split(sep='-')[:-1])\n",
"FILESTORE_IP, FILESTORE_NETWORK = compile_utils.get_filestore_info(\n",
" project_id=PROJECT_ID, instance_id=FILESTORE_ID, location=ZONE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you set up the sandbox environment using the provided Terraform configuration you do not need to change the below settings. Otherwise make sure that they are consistent with your environment.\n",
"\n",
"- `FILESTORE_SHARE` - Filestore share with AlphaFold reference databases\n",
"- `FILESTORE_MOUNT_PATH` - Mount path for Filestore fileshare\n",
"- `MODEL_PARAMS` - GCS location of AlphaFold model parameters. The pipelines are configured to retrieve the parameters from the `<MODEL_PARAMS>/params` folder.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"FILESTORE_SHARE = '/datasets'\n",
"FILESTORE_MOUNT_PATH = '/mnt/nfs/alphafold'\n",
"MODEL_PARAMS = f'gs://{BUCKET_NAME}'\n",
"IMAGE_URI = f'{REGION}-docker.pkg.dev/{PROJECT_ID}/{AR_REPO_NAME}/alphafold-components'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure and run the Inference pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two types of parameters that can be used to customize Vertex AI Pipelines: **compile time and runtime**. \n",
" - The compile time parameters must be set before compiling the pipeline code. These parameters are used to control settings like CPU/GPU configuration of compute nodes and the Filestore instance settings.\n",
" - The runtime parameters can be supplied when starting a pipeline run. They include a sequence to fold, model presets (monomer or multimer), the maximum date for template searches and more.\n",
"\n",
"The pipelines have been designed to retrieve compile time parameters from environment variables. This makes it easy to integrate a pipeline compilation step with CI/CD systems.\n",
"\n",
"By default, the pipeline uses a `c2-standard-16` node to run the feature engineering step and `g2-standard-12` nodes with NVIDIA L4 GPUs to run prediction and relaxation. For now, you will use the default settings. This hardware configuration is optimal for folding smaller proteins, roughly 1000 residues or fewer. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Set compile time parameters\n",
"\n",
"At minimum you have to configure:\n",
"- the settings of your Filestore instance that hosts genetic databases, \n",
"- the URI of the docker image that packages custom KFP components, and \n",
"- the GCS location of AlphaFold parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"os.environ['ALPHAFOLD_COMPONENTS_IMAGE'] = IMAGE_URI\n",
"os.environ['NFS_SERVER'] = FILESTORE_IP\n",
"os.environ['NFS_PATH'] = FILESTORE_SHARE\n",
"os.environ['NETWORK'] = FILESTORE_NETWORK\n",
"os.environ['MODEL_PARAMS_GCS_LOCATION'] = MODEL_PARAMS"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you are working with larger proteins that demand GPUs with more memory/processing power, you can change the default settings and reconfigure the pipeline to use nodes with, for example, NVIDIA A100 GPU for prediction and relaxation.\n",
"\n",
"If that's the case, please update the following cell and redefine the default values (remember that the default is set to L4 GPUs).\n",
"\n",
"To review how to properlly set these parameters, please refer to the following documentation: \n",
"https://cloud.google.com/vertex-ai/docs/pipelines/machine-types"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# # Instance (VM) configuration to run the data pipeline\n",
"os.environ['DATA_PIPELINE_MACHINE_TYPE'] = 'c2-standard-16'\n",
"\n",
"# Instance (VM) configuration to run model prediction\n",
"os.environ['PREDICT_MACHINE_TYPE'] = 'g2-standard-12' \n",
"os.environ['PREDICT_ACCELERATOR_TYPE'] = 'NVIDIA_L4' \n",
"os.environ['PREDICT_ACCELERATOR_COUNT'] = '1' \n",
"\n",
"# Instance (VM) configuration to run protein relaxation\n",
"os.environ['RELAX_MACHINE_TYPE'] = 'g2-standard-12' \n",
"os.environ['RELAX_ACCELERATOR_TYPE'] = 'NVIDIA_L4'\n",
"os.environ['RELAX_ACCELERATOR_COUNT'] = '1' \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set the maximum number of prediction jobs (using GPU resources) that can be run in parallel. \n",
"For multimers, if you have the necessary GPU quota, we recommend increasing this value to 25."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"os.environ['PARALLELISM'] = '5'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compile the pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from src.pipelines.alphafold_inference_pipeline import alphafold_inference_pipeline as pipeline\n",
"\n",
"pipeline_name = 'universal-pipeline'\n",
"compiler.Compiler().compile(\n",
" pipeline_func=pipeline,\n",
" package_path=f'{pipeline_name}.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure runtime parameters\n",
"\n",
"At minimum you need to configure a GCS location of your sequence, the maximum date for template searches and a project and region where to run the pipeline. With the default settings, the pipeline will run monomer inference using the small version of BFD.\n",
"\n",
"**Note about multimer sequences**: When processing multimer sequences, the `num_multimer_predictions_per_model` parameter controls how many predictions are run for each model. The default value has been set to 5, which is the same as in the [run_alphafold.py](https://github.com/deepmind/alphafold/blob/main/run_alphafold.py) script.\n",
"\n",
"#### Copy the sample sequence to a GCS location\n",
"\n",
"You can find a few sample sequences in the `sequences` folder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sequence = 'T1050.fasta' # Copy your FASTA file to the 'sequences' folder and reference its name here.\n",
"\n",
"is_monomer, sequences = fasta_utils.validate_fasta_file(\n",
" os.path.join('sequences', sequence))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"gcs_sequence_path = f'gs://{BUCKET_NAME}/fasta/{sequence}'\n",
"! gsutil cp sequences/{sequence} {gcs_sequence_path}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Alphafold training parameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"max_template_date = '2030-01-01'\n",
"use_small_bfd = True # 'True' will only use a portion of the BDF database. Set to 'False' if you want to use the full BFD database.\n",
"num_multimer_predictions_per_model = 5 # Number of predictions per model for multimer model preset\n",
"is_run_relax = 'relax' # Wheather or not to run relaxation process. If you don't need to run the relaxation step, pass an empty string ''."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"params = {\n",
" 'sequence_path': gcs_sequence_path,\n",
" 'max_template_date': max_template_date,\n",
" 'model_preset': 'monomer' if is_monomer else 'multimer',\n",
" 'project': PROJECT_ID,\n",
" 'region': REGION,\n",
" 'use_small_bfd': use_small_bfd,\n",
" 'num_multimer_predictions_per_model': num_multimer_predictions_per_model,\n",
" 'is_run_relax': is_run_relax\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Submit a pipeline run"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We recommend annotating pipeline runs with at least two labels. The first label groups multiple pipeline runs into a single experiment. The second label identifies a given run within the experiment. Annotating with labels helps to discover and analyze pipeline runs in large scale settings. The third notebook that demonstrates how to analyze pipeline runs depends on the labels. \n",
"\n",
"You will be able to monitor the run using the link printed by executing the cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vertex_ai.init(\n",
" project=PROJECT_ID,\n",
" location=REGION,\n",
" staging_bucket=f'gs://{BUCKET_NAME}/staging'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"experiment_id = 'T1050-experiment'\n",
"labels = {'experiment_id': experiment_id.lower(), 'sequence_id': sequence.split(sep='.')[0].lower()}\n",
"\n",
"pipeline_job = vertex_ai.PipelineJob(\n",
" display_name=pipeline_name,\n",
" template_path=f'{pipeline_name}.json',\n",
" pipeline_root=f'gs://{BUCKET_NAME}/pipeline_runs/{pipeline_name}',\n",
" parameter_values=params,\n",
" enable_caching=True,\n",
" labels=labels\n",
")\n",
"\n",
"pipeline_job.run(sync=False)\n",
"pipeline_job.wait_for_resource_creation()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check the state of the pipeline\n",
"pipeline_job.state"
]
}
],
"metadata": {
"environment": {
"kernel": "python3",
"name": "common-cpu.m92",
"type": "gcloud",
"uri": "gcr.io/deeplearning-platform-release/base-cpu:m92"
},
"kernelspec": {
"display_name": "Python 3.8.15 ('alphafold')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
},
"vscode": {
"interpreter": {
"hash": "5d4ba5cd9e03f1df519c5604fda4f32e7a6119f61e07e901bf30b393ad1ef277"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}