In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Analyze a codebase with Gemini in Vertex AI

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fgemini%2Fuse-cases%2Fcode%2Fanalyze_codebase.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>    
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/gemini/use-cases/code/analyze_codebase.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo"><br> Open in Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://goo.gle/40yTNki">
      <img width="32px" src="https://cdn.qwiklabs.com/assets/gcp_cloud-e3a77215f0b8bfa9b3f611c0d2208c7e8708ed31.svg" alt="Google Cloud logo"><br> Open in  Cloud Skills Boost
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/code/analyze_codebase.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>            

| | |
|-|-|
|Author(s) | [Eric Dong](https://github.com/gericdong), [Holt Skinner](https://github.com/holtskinner), [Aakash Gouda](https://github.com/aksstar)|

## Overview

Gemini features a breakthrough long context window of up to 1 million tokens that can help seamlessly analyze, classify and summarize large amounts of content within a given prompt.

With its long-context reasoning, Gemini can analyze an entire codebase for deeper insights.

In this tutorial, you learn how to analyze an entire codebase with Gemini 2.0 and prompt the model to:

- **Analyze**: Summarize codebases effortlessly.
- **Guide**: Generate clear developer getting-started documentation.
- **Debug**: Uncover critical bugs and provide fixes.
- **Enhance**: Implement new features and improve reliability and security.


## Getting Started

### Install Google Gen AI SDK for Python and other libraries

In addition to the [Google Gen AI SDK for Python](https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview), we will be using [Gitingest](https://gitingest.com/) to load the repository into the prompt.


In [None]:
%pip install --upgrade --quiet google-genai gitingest gitpython PyGithub

### Restart runtime (Colab only)

To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which restarts the current kernel.

The restart might take a minute or longer. After it's restarted, continue to the next step.

In [None]:
import sys

if "google.colab" in sys.modules:
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

<div class="alert alert-block alert-warning">
<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>
</div>


### Authenticate your notebook environment (Colab only)

If you are running this notebook on Google Colab, run the following cell to authenticate your environment.


In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and create client

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

In [None]:
import os

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

from google import genai

client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

### Import libraries

In [2]:
import os
import shutil

from IPython.core.interactiveshell import InteractiveShell
from IPython.display import Markdown, display
import git
from github import Github
from gitingest import ingest

InteractiveShell.ast_node_interactivity = "all"

from google.genai.types import CreateCachedContentConfig, GenerateContentConfig
import nest_asyncio

nest_asyncio.apply()

## Cloning a codebase

You will use the repo [Online Boutique](https://github.com/GoogleCloudPlatform/microservices-demo) as an example in this notebook.the Online Boutique is a cloud-first microservices demo application. The application is a web-based e-commerce app where users can browse items, add them to the cart, and purchase them. This application consists of 11 microservices across multiple languages.

In [3]:
# The GitHub repository URL
repo_url = "https://github.com/GoogleCloudPlatform/microservices-demo"  # @param {type:"string"}

# The location to clone the repo
repo_dir = "./repo"

#### Define helper functions for processing GitHub repository

In [32]:
def clone_repo(repo_url: str, repo_dir: str) -> None:
    """Shallow clone a GitHub repository."""
    if os.path.exists(repo_dir):
        shutil.rmtree(repo_dir)
    os.makedirs(repo_dir, exist_ok=True)
    git.Repo.clone_from(repo_url, repo_dir, depth=2)


def get_github_issue(owner: str, repo: str, issue_number: int) -> str | None:
    """
    Fetch the contents of a GitHub issue.

    Args:
        owner (str): The owner of the repository.
        repo (str): The name of the repository.
        issue_number (int): The issue number to retrieve.

    Returns:
        str | None: The issue body if found, otherwise None.

    Raises:
        Exception: If an error occurs while fetching the issue.
    """
    g = Github()

    try:
        repository = g.get_repo(f"{owner}/{repo}")
        issue = repository.get_issue(number=issue_number)
        return issue.body
    except Exception as error:
        print(f"Error fetching issue: {error}")
        return None


def get_git_diff(repo_dir: str) -> str:
    """Fetches commit IDs from a local Git repository on a specified branch."""
    repo = git.Repo(repo_dir)
    branch_name = "main"

    # A list of commit IDs (SHA-1 hashes) in reverse chronological order (newest first)
    commit_ids = [commit.hexsha for commit in repo.iter_commits(branch_name)]
    if len(commit_ids) >= 2:
        return repo.git.diff(commit_ids[0], commit_ids[1])
    return ""

### Create an index and extract content of a codebase

Clone the repo and create an index and extract content of code/text files.

Gitingest will extract all of the contents of the files into a long string and create a directory tree of the files.

In [None]:
clone_repo(repo_url, repo_dir)

_, tree, content = ingest(repo_dir)

## Analyzing the codebase with Gemini

With its long-context reasoning, Gemini can process the codebase and answer questions about the codebase.

### Load the Gemini model

Learn more about the [Gemini API models on Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models).

In [19]:
MODEL_ID = "gemini-2.0-flash-001"  # @param {type:"string"}

### Create a context cache

We will create a [context cache](https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview) of the codebase so we don't have to send the entire context with every request, saving processing time and cost.

**Note**: Context caching is only available for stable models with fixed versions (for example, `gemini-2.0-flash-001`). You must include the version postfix (for example, the `-001`).

For more information, see [Available Gemini stable model versions](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versioning#stable-versions-available).

In [None]:
system_instruction = "You are a coding expert. Your mission is to answer all code related questions with given context and instructions."

contents = [
    """
    Context:
    - The entire codebase is provided below.
    - Here is directory tree of all of the files in the codebase:
    """,
    tree,
    """
    - Then each of the files are concatenated together. You will find all of the code you need:
    """,
    content,
]

cached_content = client.caches.create(
    model=MODEL_ID,
    config=CreateCachedContentConfig(
        contents=contents,
        system_instruction=system_instruction,
        ttl="3600s",
    ),
)

### 1. Summarizing the codebase

Generate a summary of the codebase.

In [None]:
question = """
  Give me a summary of this codebase, and tell me the top 3 things that I can learn from it.
"""

# Generate text using non-streaming method
response = client.models.generate_content(
    model=MODEL_ID,
    contents=question,
    # Use the cached content
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

# Print generated text and usage metadata
display(Markdown(response.text))
print(response.usage_metadata)

### 2. Creating a developer getting started guide

Generate a getting started guide for developers. This sample uses the streaming option to generate the content.

In [None]:
question = """
  Provide a getting started guide to onboard new developers to the codebase.
"""


# Generate text using streaming method
responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 3. Finding bugs

Find the top 3 most severe issues in the codebase.

In [None]:
question = """
  Find the top 3 most severe issues in the codebase.
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 4. Fixing bug

Find the most severe issue in the codebase that can be fixed and provide a code fix for it.


In [None]:
question = """
  Find the most severe bug in the codebase that you can provide a code fix for.
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 5. Implementing a feature request using Function Calling

Get the feature request text from a GitHub Issue URL.

We will use [Function Calling](https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling) to extract the feature request data from the prompt, then call the GitHub API to retrieve the contents.

Note: We can't use the previously created cached content, because tools cannot be added at runtime when using cached content, and the other prompts in this notebook do not need this tool.

In [None]:
FEATURE_REQUEST_URL = (
    "https://github.com/GoogleCloudPlatform/microservices-demo/issues/2205"
)

question = f"What is the feature request of the following {FEATURE_REQUEST_URL}"

response = client.models.generate_content(
    model=MODEL_ID,
    contents=question,
    # Use the function as a tool
    config=GenerateContentConfig(
        tools=[get_github_issue],
    ),
)

issue_description = response.text
display(Markdown(f"# Feature Request\n{issue_description}"))

Use the GitHub Issue text to implement the feature request.

In [None]:
# Combine feature request content and cached code content
question = f"""Implement the following feature request
{issue_description}
"""

response = client.models.generate_content(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

# Generate code response
display(Markdown(response.text))

### 6. Creating a troubleshooting guide

Create a troubleshooting guide to help resolve common issues.

In [None]:
question = """
    Provide a troubleshooting guide to help resolve common issues.
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 7. Making the app more reliable

Recommend best practices to make the application more reliable.


In [None]:
question = """
  How can I make this application more reliable? Consider best practices from https://www.r9y.dev/
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 8. Making the app more secure

Recommend best practices to make the application more secure.

In [None]:
question = """
  How can you secure the application?
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 9. Learning the codebase

Create a quiz about the concepts used in the codebase.

In [None]:
question = """
  Create a quiz about the concepts used in the codebase to help me solidify my understanding.
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 10. Creating a quickstart tutorial

Create an end-to-end quickstart tutorial for a specific component.


In [None]:
question = """
  Please write an end-to-end quickstart tutorial that introduces AlloyDB,
  shows how to configure it with the CartService,
  and highlights key capabilities of AlloyDB in context of the Online Boutique application.
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

### 11. Creating a Git Changelog Generator

Understanding changes made between Git commits and highlighting the most important aspects of the changes.

In [None]:
diff_text = get_git_diff(repo_dir)
question = f"""
    Given the below git diff output, Summarize the important changes made.
```diff
{diff_text}
```
"""

responses = client.models.generate_content_stream(
    model=MODEL_ID,
    contents=question,
    config=GenerateContentConfig(
        cached_content=cached_content.name,
    ),
)

for response in responses:
    print(response.text, end="")

## Conclusion

In this tutorial, you've learned how to use Gemini to analyze a codebase and prompt the model to:

- Summarize codebases effortlessly.
- Generate clear developer getting-started documentation.
- Uncover critical bugs and provide fixes.
- Implement new features and improve reliability and security.
- Understanding changes made between Git commits