notebooks/generative_ai/finetuning/translation

{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copyright 2024 Google LLC\n", "#\n", "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fine-tuning Gemini to translate multiple languages to English using Vertex AI, Dataproc and Apache Iceberg\n", "\n", "This notebooks demonstrates how to fine-tune Gemini to perform translation tasks from multiple languages (DE,ES,FR,IT,PL,PT,RU,SV,UK,ZH) to English. \n", "It follows the steps: \n", " - Reads the dataset from a GCS bucket\n", " - Save it in the Iceberg format using Dataproc Serverless\n", " - Use Vertex AI Supervised fine-tuning job to fine-tune Gemini with the dataset\n", " - Register the model in Vertex AI Model Registry\n", "\n", "<br>\n", "<table align=\"left\">\n", "\n", " <td>\n", " <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/generative_ai/finetuning/translation_finetuning.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/generative_ai/finetuning/translation_finetuning.ipynb\">\n", " <img src=\"https://cloud.google.com/ml-engine/images/github-logo-32px.png\" alt=\"GitHub logo\">\n", " View on GitHub\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/generative_ai/finetuning/translation_finetuning.ipynb\">\n", " <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n", " Open in Vertex AI Workbench\n", " </a>\n", " </td>\n", " <td>\n", " <a href=\"https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/generative_ai/finetuning/translation_finetuning.ipynb\">\n", " <img src=\"https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s\" alt=\"BQ logo\" width=\"35\">\n", " Open in BQ Studio\n", " </a>\n", " </td>\n", "\n", "</table>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install google-spark-connect -q" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "from typing import List\n", "import pandas as pd\n", "import time\n", "import json\n", "\n", "from google.cloud import bigquery, storage\n", "from google.cloud.dataproc_v1 import Session, SparkConnectConfig\n", "from google.cloud.spark_connect import GoogleSparkSession\n", "from pyspark.sql.types import LongType, StringType, DoubleType\n", "\n", "type_mapping = {\n", " LongType: 'long',\n", " StringType: 'string',\n", " DoubleType: 'double'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Config" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "PROJECT_ID = \"PROJECT_ID\"\n", "REGION = \"REGION\"\n", "RUNTIME_TEMPLATE = \"RUNTIME_TEMPLATE\"\n", "\n", "JSON_FILES_GCS_URI = \"gs://dataproc-metastore-public-binaries/wikipedia_translated_clusters/*\"\n", "BUCKET_NAME = \"BUCKET\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run this notebook you must create a [Dataproc Serverless Runtime Template](https://cloud.google.com/dataproc-serverless/docs/quickstarts/jupyterlab-sessions) version 3.0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating Spark session. It may take few minutes.\n", "Interactive Session Detail View: https://console.cloud.google.com/dataproc/interactive/us-east1/sc-20250416-190912-rf2ni8?project=dataproc-notebook-demo\n" ] } ], "source": [ "session_config = Session()\n", "session_config.spark_connect_session = SparkConnectConfig()\n", "session_config.session_template = f'projects/{PROJECT_ID}/locations/{REGION}/sessionTemplates/{RUNTIME_TEMPLATE}'\n", "\n", "spark = GoogleSparkSession.builder.projectId(PROJECT_ID).location(REGION).googleSessionConfig(session_config).getOrCreate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read input dataset" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "raw_dataset = spark.read.json(JSON_FILES_GCS_URI)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Transform the dataset" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "from pyspark.sql.functions import explode, array, lit, col, struct, desc, concat\n", "from pyspark.sql.types import StringType\n", "\n", "# Step 1: Get all column names from the DataFrame\n", "columns = raw_dataset.columns\n", "\n", "# Step 2: Create an array of structs with the column name and its content\n", "exploded_df = raw_dataset.select(\n", " explode(\n", " array([\n", " struct(\n", " lit(column).alias(\"topic\"),\n", " col(f\"`{column}`.title\").alias(\"title\"),\n", " col(f\"`{column}`.intro\").alias(\"intro\"),\n", " col(f\"`{column}`.translated_intro\").alias(\"translated_intro\")\n", " )\n", " for column in columns\n", " ])\n", " ).alias(\"exploded\")\n", ")\n", "\n", "# Step 3: Extract the fields from the struct and add the prompt column\n", "transformed_df = exploded_df.select(\n", " col(\"exploded.title\").alias(\"title\"),\n", " lit(\"Translate this intro to english: \").alias(\"prompt\"),\n", " col(\"exploded.intro\").alias(\"intro\"),\n", " col(\"exploded.translated_intro\").alias(\"translated_intro\")\n", ")\n", "\n", "# Step 4: Drop rows where any value is null\n", "transformed_df = transformed_df.dropna(how=\"any\")\n", "\n", "# Step 5: Sort by title in descending order (Z to A)\n", "transformed_df = transformed_df.orderBy(desc(\"title\"))" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- title: string (nullable = true)\n", " |-- prompt: string (nullable = false)\n", " |-- intro: string (nullable = true)\n", " |-- translated_intro: string (nullable = true)\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "I0000 00:00:1744841625.386354 1156645 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+---------------------+---------------------------------+----------------------------------------------------------------------------+--------------------------------------------------+\n", "| title| prompt| intro| translated_intro|\n", "+---------------------+---------------------------------+----------------------------------------------------------------------------+--------------------------------------------------+\n", "| Ättestupa|Translate this intro to english: | Еттеступа (швед. Ättestupa) — обряд убивання ст...|Ättestupa (Swedish: Ättestupa) is a ritual of k...|\n", "| Ättestupa|Translate this intro to english: | Ättestupa (även ättestörta eller ättestapul) är...|Pedigree (also genitalia or pedigree) is a trad...|\n", "| Ättestupa|Translate this intro to english: | Аттеступа (швед. Ättestupa) — название ряда про...|Attestupa (Swedish Ättestupa) is the name of a ...|\n", "| Ättestupa|Translate this intro to english: | Ättestupa (término escandinavo para precipicio ...|Ättestupa (Scandinavian term for clan cliff) is...|\n", "| Ättestupa|Translate this intro to english: | Ättestupa est la pratique mythique du gérontici...|Ättestupa is the mythical practice of gerontici...|\n", "|Zumbo's Just Desserts|Translate this intro to english: |赞波的甜点大赛（英语：Zumbo's Just Desserts）是澳大利亚的七号电视网中关于...|Zumbo's Just Desserts (English: Zumbo's Just De...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Isabella Kravitz (ur. 1 grudnia 1988 w Los ...|Zoë Isabella Kravitz (born December 1, 1988 in ...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Isabella Kravitz, född 1 december 1988 i Lo...|Zoë Isabella Kravitz, born December 1, 1988 in ...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Isabella Kravitz (Los Angeles, 1º dicembre ...|Zoë Isabella Kravitz (Los Angeles, December 1, ...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Isabella Kravitz (* 1. Dezember 1988 in Los...|Zoë Isabella Kravitz (born December 1, 1988 in ...|\n", "| Zoë Kravitz|Translate this intro to english: | Зои Изабелла Кра́виц (англ. Zoë Isabella Kravit...|Zoë Isabella Kravitz (born December 1, 1988, Lo...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Isabella Kravitz (Los Ángeles, California; ...|Zoë Isabella Kravitz (Los Angeles, California; ...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Isabella Kravitz (Los Angeles, 1 de dezembr...|Zoë Isabella Kravitz (Los Angeles, December 1, ...|\n", "| Zoë Kravitz|Translate this intro to english: | 佐伊·伊莎贝拉·克拉维茨（英语：Zoë Isabella Kravitz，1988年12月1日...|Zoë Isabella Kravitz (English: Zoë Isabella Kra...|\n", "| Zoë Kravitz|Translate this intro to english: | Зої Ізабелла Кравітц (англ. Zoë Isabella Kravit...|Zoë Isabella Kravitz (born December 1, 1988 in ...|\n", "| Zoë Kravitz|Translate this intro to english: | Zoë Kravitz est une actrice, chanteuse et manne...|Zoë Kravitz is an American actress, singer and ...|\n", "| Zoya Akhtar|Translate this intro to english: | Zoya Akhtar (Mumbai, 14 ottobre 1972) è una reg...|Zoya Akhtar (Mumbai, 14 October 1972) is an Ind...|\n", "| Zoya Akhtar|Translate this intro to english: | Зоя Ахтар (англ. Zoya Akhtar; род. 9 января 197...|Zoya Akhtar (born January 9, 1974, Bombay) is a...|\n", "| Zoya Akhtar|Translate this intro to english: | Zoya Akhtar (Bombay, Maharashtra, 14 de octubre...|Zoya Akhtar (Bombay, Maharashtra, October 14, 1...|\n", "| Zoya Akhtar|Translate this intro to english: | Zoya Akhtar (en devanagari ज़ोया अख़्तर) est un...|Zoya Akhtar (in Devanagari ज़ोया अख़्तर) is an ...|\n", "| Zoroastrianism|Translate this intro to english: | Zaratusztrianizm (zaratustryzm, zoroastryzm, zo...|Zoroastrianism (Zoroastrianism, Zoroastrianism,...|\n", "| Zoroastrianism|Translate this intro to english: | Zoroastrism (även mazdaism efter namnet på gude...|Zoroastrianism (also Mazdaism after the name of...|\n", "| Zoroastrianism|Translate this intro to english: | Lo zoroastrismo (definito anche zoroastrianesim...|Zoroastrianism (also called Zoroastrianism or M...|\n", "| Zoroastrianism|Translate this intro to english: | Der Zoroastrismus bzw. Zarathustrismus (auch: M...|Zoroastrianism or Zoroastrianism (also: Mazdais...|\n", "| Zoroastrianism|Translate this intro to english: | Зороастри́зм (авест. vahvī- daēnā- māzdayasna- ...|Zoroastrianism (avest. Vahvī- daēnā- māzdayasna...|\n", "| Zoroastrianism|Translate this intro to english: | El zoroastrismo, por el nombre de su fundador e...|Zoroastrianism, by the name of its founder and ...|\n", "| Zoroastrianism|Translate this intro to english: | 祆教（波斯语：زرتشتی‌گری‎；英语：Zoroastrianism），音译为琐罗亚斯德教...|Zoroastrianism (Persian: زرتشتی‌گری‎; English: ...|\n", "| Zoroastrianism|Translate this intro to english: | O zoroastrismo, masdaísmo, masdeísmo/mazdeísmo ...|Zoroastrianism, Masdaism, Masdeism / Mazdeism o...|\n", "| Zoroastrianism|Translate this intro to english: | Зороастри́зм або Маздеїзм, іноді Заратустріанст...|Zoroastrianism or Mazdeism, sometimes Zarathust...|\n", "| Zoroastrianism|Translate this intro to english: | Le zoroastrisme est une religion qui tire son n...|Zoroastrianism is a religion that takes its nam...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Claire Deschanel (ur. 17 stycznia 1980 w ...|Zooey Claire Deschanel (born January 17, 1980 i...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Claire Deschanel, född 17 januari 1980 i ...|Zooey Claire Deschanel, born January 17, 1980 i...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Claire Deschanel (AFI: [ˈzoʊ.iː klɛə(ɹ) d...|Zooey Claire Deschanel (AFI: [ˈzoʊ.iː klɛə (ɹ) ...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Claire Deschanel (* 17. Januar 1980 in Lo...|Zooey Claire Deschanel (born January 17, 1980 i...|\n", "| Zooey Deschanel|Translate this intro to english: | Зо́уи Клэр Дешане́ль (англ. Zooey Claire Descha...|Zooey Claire Deschanel (born January 17, 1980, ...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Claire Deschanel (Los Ángeles, California...|Zooey Claire Deschanel (born January 17, 1980) ...|\n", "| Zooey Deschanel|Translate this intro to english: | 佐伊·克莱儿·丹斯切尔（英语：Zooey Claire Deschanel，1980年1月17...|Zoey Claire Deschanel (English: Zooey Claire De...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Claire Deschanel (Los Angeles, Califórnia...|Zooey Claire Deschanel (Los Angeles, Calif., Ja...|\n", "| Zooey Deschanel|Translate this intro to english: | Зоуї Клер Дешанель (англ. Zooey Claire Deschane...|Zooey Claire Deschanel (born 17 January 1980) i...|\n", "| Zooey Deschanel|Translate this intro to english: | Zooey Deschanel [ˈzoʊi ˌdeɪʃəˈnɛl] est une actr...|Zooey Deschanel [ˈzoʊi ˌdeɪʃəˈnɛl] is an Americ...|\n", "+---------------------+---------------------------------+----------------------------------------------------------------------------+--------------------------------------------------+\n", "only showing top 40 rows\n", "\n" ] } ], "source": [ "# Display the schema of the resulting DataFrame\n", "transformed_df.printSchema()\n", "\n", "# Show a sample of the resulting DataFrame\n", "transformed_df.show(40, 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Apache Iceberg table in catalog" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ICEBERG_CATALOG = \"CATALOG\"\n", "ICEBERG_SCHEMA = \"default\"\n", "ICERBERG_TABLE = \"wikipedia_translated_docs\"" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'(title string, prompt string, intro string, translated_intro string)'" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset_schema = f\"({', '.join([f'{field.name} {type_mapping.get(type(field.dataType), str(field.dataType))}' for field in transformed_df.schema.fields])})\"\n", "dataset_schema" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transformed_df.write.format(\"iceberg\").save(f\"{ICEBERG_CATALOG}.{ICEBERG_SCHEMA}.{ICERBERG_TABLE}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read Apache Iceberg table" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------------------+----------------------------------+--------------------+\n", "| title| prompt| intro| translated_intro|\n", "+--------------------+--------------------+----------------------------------+--------------------+\n", "| Ättestupa|Translate this in...| Еттеступа (швед. ...|Ättestupa (Swedis...|\n", "| Ättestupa|Translate this in...| Ättestupa est la ...|Ättestupa is the ...|\n", "| Ättestupa|Translate this in...| Ättestupa (även ä...|Pedigree (also ge...|\n", "| Ättestupa|Translate this in...| Аттеступа (швед. ...|Attestupa (Swedis...|\n", "| Ättestupa|Translate this in...| Ättestupa (términ...|Ättestupa (Scandi...|\n", "|Zumbo's Just Dess...|Translate this in...| 赞波的甜点大赛（英语：Zumbo'...|Zumbo's Just Dess...|\n", "| Zoë Kravitz|Translate this in...| Зої Ізабелла Крав...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Zoë Isabella Krav...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Zoë Kravitz est u...|Zoë Kravitz is an...|\n", "| Zoë Kravitz|Translate this in...| Zoë Isabella Krav...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Zoë Isabella Krav...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Зои Изабелла Кра́...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...|佐伊·伊莎贝拉·克拉维茨（英语：Z...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Zoë Isabella Krav...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Zoë Isabella Krav...|Zoë Isabella Krav...|\n", "| Zoë Kravitz|Translate this in...| Zoë Isabella Krav...|Zoë Isabella Krav...|\n", "| Zoya Akhtar|Translate this in...| Zoya Akhtar (en d...|Zoya Akhtar (in D...|\n", "| Zoya Akhtar|Translate this in...| Зоя Ахтар (англ. ...|Zoya Akhtar (born...|\n", "| Zoya Akhtar|Translate this in...| Zoya Akhtar (Bomb...|Zoya Akhtar (Bomb...|\n", "| Zoya Akhtar|Translate this in...| Zoya Akhtar (Mumb...|Zoya Akhtar (Mumb...|\n", "+--------------------+--------------------+----------------------------------+--------------------+\n", "only showing top 20 rows\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "I0000 00:00:1744842967.222738 1156653 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers\n" ] } ], "source": [ "iceberg_df = spark.read.table(f\"{ICEBERG_CATALOG}.{ICEBERG_SCHEMA}.{ICERBERG_TABLE}\")\n", "iceberg_df.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate dataset for finetuning" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "I0000 00:00:1744843154.616600 1156649 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", "| input_prompt| expected_model_output|\n", "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", "|Translate this intro to english: Еттеступа (швед. Ättestupa) — обряд убивання старих членів племе...| Ättestupa (Swedish: Ättestupa) is a ritual of killing old members of the tribe among the Vikings.|\n", "|Translate this intro to english: Ättestupa est la pratique mythique du géronticide à l'époque de ...|Ättestupa is the mythical practice of geronticide at the time of Nordic prehistory, during which ...|\n", "|Translate this intro to english: Ättestupa (även ättestörta eller ättestapul) är en tradition ell...|Pedigree (also genitalia or pedigree) is a tradition or legend that says that the elderly in Nord...|\n", "|Translate this intro to english: Аттеступа (швед. Ättestupa) — название ряда пропастей на террито...|Attestupa (Swedish Ättestupa) is the name of a number of abysses in Scandinavia. Also, this is th...|\n", "|Translate this intro to english: Ättestupa (término escandinavo para precipicio del clan) es un t...|Ättestupa (Scandinavian term for clan cliff) is a term associated with multiple cliffs in Sweden,...|\n", "+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import concat, col\n", "\n", "finetune_dataset = iceberg_df.select(\n", " concat(col(\"prompt\"), col(\"intro\")).alias(\"input_prompt\"),\n", " col(\"translated_intro\").alias(\"expected_model_output\")\n", ")\n", "\n", "finetune_dataset.show(5, 100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "finetune_dataset_pandas = finetune_dataset.toPandas()\n", "\n", "train_set = finetune_dataset_pandas.sample(frac=0.8, random_state=42) \n", "test_set = finetune_dataset_pandas.drop(train_set.index)" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "def generate_records(df: pd.DataFrame) -> List:\n", " \n", " records = []\n", "\n", " for index, row in df.iterrows():\n", "\n", " input_prompt = row['input_prompt']\n", " expected_model_output = row['expected_model_output']\n", "\n", " record = {\n", " \"contents\": [\n", " { \"role\": \"user\", \"parts\": [ { \"text\": input_prompt } ] },\n", " { \"role\": \"model\", \"parts\": [ { \"text\": expected_model_output } ] } ] \n", " }\n", "\n", " records.append(record)\n", " \n", " return records" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_records = generate_records(train_set)[:1000] # Select 1000 training records for fine tuning training\n", "val_records = generate_records(test_set)[:300] # Select 300 eval records for fine tuning evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TRAIN_FILE_NAME = \"wikipedia_translated/records/fine-tuning-train-dataset.jsonl\"\n", "VAL_FILE_NAME = \"wikipedia_translated/records/fine-tuning-val-dataset.jsonl\"" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [], "source": [ "def upload_gcs(records: List, file_name: str, bucket_name: str = BUCKET_NAME, project_id: str = PROJECT_ID) -> str:\n", " \n", " storage_client = storage.Client(project=project_id)\n", " bucket = storage_client.bucket(bucket_name)\n", " blob = bucket.blob(file_name)\n", "\n", " jsonl_data = \"\\n\".join(json.dumps(item) for item in records)\n", " blob.upload_from_string(jsonl_data)\n", " \n", " uri = f\"gs://{bucket_name}/{file_name}\"\n", "\n", " return uri" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "uri_train = upload_gcs(train_records, TRAIN_FILE_NAME)\n", "uri_val = upload_gcs(val_records, VAL_FILE_NAME)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Run finetuning job on Vertex AI" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "from google import genai\n", "from google.genai import types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "client = genai.Client(\n", " vertexai=True,\n", " project=PROJECT_ID,\n", " location=REGION\n", ")" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "GEMINI_MODEL = \"gemini-2.0-flash-001\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuned_model_name = \"tuned_gemini_translation\"\n", "\n", "tuning_job = client.tunings.tune(\n", " base_model = GEMINI_MODEL,\n", " training_dataset = types.TuningDataset(gcs_uri = uri_train),\n", " config=types.CreateTuningJobConfig(\n", " epoch_count= 8,\n", " tuned_model_display_name=tuned_model_name,\n", " adapter_size = \"ADAPTER_SIZE_FOUR\",\n", " learning_rate_multiplier = 0.8,\n", " validation_dataset = types.TuningDataset(gcs_uri=uri_val)\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while not tuning_job.has_ended:\n", " time.sleep(30)\n", " tuning_job = client.tunings.get(name=tuning_job.name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(tuning_job.tuned_model.model)\n", "print(tuning_job.tuned_model.endpoint)\n", "print(tuning_job.experiment)" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "def predict(prompt, model):\n", "\n", " contents = [\n", " types.Content(\n", " role=\"user\",\n", " parts=[\n", " types.Part.from_text(text=prompt)\n", " ]\n", " )\n", " ]\n", "\n", " generate_content_config = types.GenerateContentConfig(\n", " temperature = 0.5,\n", " max_output_tokens = 128,\n", " response_mime_type = \"text/plain\",\n", " safety_settings = [types.SafetySetting(\n", " category = 'HARM_CATEGORY_UNSPECIFIED',\n", " threshold = 'BLOCK_ONLY_HIGH',\n", " )]\n", " )\n", " \n", " response = client.models.generate_content(model=model,\n", " contents=contents,\n", " config=generate_content_config)\n", " return response.text" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "input_prompt = \"\"\"Translate this intro to english: You (Estilizado como YOU - Sendo nomeado no Brasil como Você, em Portugal como Tu) é uma série de televisão americana de suspense psicológico desenvolvida por Greg Berlanti e Sera Gamble.\n", "Produzido pela Warner Horizon Television, em associação com Alloy Entertainment e A&E Studios. A série é baseada no romance de 2014 de mesmo nome de Caroline Kepnes.\n", "A primeira temporada segue Joe Goldberg, gerente de uma livraria de Nova York e um serial killer que se apaixona por uma cliente chamada Guinevere Beck e rapidamente desenvolve uma obsessão extrema, tóxica e delirante.\n", "A segunda temporada segue Joe enquanto ele se muda para Los Angeles e se apaixona por Love Quinn, chef e sócia de uma rede de produtos naturais.\n", "A primeira temporada, que foi lançada em 2018, é estrelada por Penn Badgley, Elizabeth Lail, Luca Padovan, Zach Cherry e Shay Mitchell.\n", "Para a segunda temporada, Ambyr Childers foi promovida a regular na série, juntando-se aos recém-escalados Victoria Pedretti, James Scully, Jenna Ortega e Carmela Zumbado.\n", "A série estreou na Lifetime em 9 de setembro de 2018 nos Estados Unidos e transmitida internacionalmente pela Netflix em 26 de dezembro de 2018.\n", "A série atraiu um público limitado na Lifetime antes de se tornar mais popular e um sucesso crítico para a Netflix, com mais de 43 milhões de espectadores tendo transmitido a primeira temporada após sua estreia no serviço de streaming.\n", "A Lifetime anunciou que You foi renovada para uma segunda temporada baseada no romance seguinte de Kepnes, Hidden Bodies, em 26 de julho de 2018, antes da estreia da série.\n", "Em dezembro de 2018, foi anunciado que a série mudaria para a Netflix como um título Original Netflix.\n", "A segunda temporada foi lançada exclusivamente na Netflix em 26 de dezembro de 2019.\n", "Em janeiro de 2020, a série foi renovada para uma terceira temporada pela Netflix, que conta com Badgley e Pedretti reprisando seus papéis.\n", "No dia 30 de agosto de 2021, foi confirmado que a terceira temporada irá estrear dia 15 de outubro de 2021.\n", "Em outubro de 2021, antes da estreia da terceira temporada, a série foi renovada para uma quarta temporada.\"\"\"" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:AFC is enabled with max remote calls: 10.\n", "INFO:root:AFC remote call 1 is done.\n" ] }, { "data": { "text/plain": [ "'You (Stylized as YOU - Named in Brazil as Você, in Portugal as Tu) is an American psychological thriller television series developed by Greg Berlanti and Sera Gamble.\\nProduced by Warner Horizon Television, in association with Alloy Entertainment and A&E Studios. The series is based on the 2014 novel of the same name by Caroline Kepnes.\\nThe first season follows Joe Goldberg, manager of a New York bookstore and serial killer who falls in love with a customer named Guinevere Beck and quickly develops an extreme, toxic and delusional obsession.\\nThe second season follows Joe as he moves to Los Angeles and falls in'" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict(input_prompt, tuning_job.tuned_model.endpoint)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 2 }

notebooks/generative_ai/finetuning/translation_finetuning.ipynb (734 lines of code) (raw):