Yes, you absolutely can, and it's generally recommended for getting the best context and generative answers from Vertex AI Search when dealing with software projects.
The key is to consider how Vertex AI Search (and the underlying LLM) will interpret the relationship between your source code and its accompanying documentation.
Here are the main strategies, from simplest to most effective:
This is the most straightforward approach.
- Store all files in GCS: Upload your source code files (
.py,.java,.cpp, etc.) and your markdown documentation files (.md,README.md,DESIGN.md, etc.) to the same Google Cloud Storage (GCS) bucket, maintaining their directory structure from your repository.- Example GCS structure:
gs://your-repo-docs-bucket/ ├── src/ │ ├── feature_a/ │ │ ├── __init__.py │ │ └── main.py │ └── common/ │ └── helpers.py ├── docs/ │ ├── architecture.md │ └── deployment.md ├── README.md ├── DESIGN.md └── requirements.txt
- Example GCS structure:
- Create a Data Store: Point your Vertex AI Search Data Store to this GCS bucket, enabling "Generative AI Features."
How it works:
Vertex AI Search will ingest both the code and the markdown files, chunk them, and create embeddings for all of them. When a user queries, it will perform a semantic search across all these ingested chunks and retrieve the most relevant ones, regardless of whether they came from a .py file or a .md file. The LLM will then use all retrieved relevant chunks (code and docs) as context to generate an answer.
Pros:
- Extremely simple to set up.
- No pre-processing required.
Cons:
- Contextual gaps: If a query is very specific to a code function, the search might only retrieve the code and miss a crucial explanation in a separate markdown file, even if they are fundamentally related. The system doesn't explicitly know that
src/feature_a/main.pyis explained bydocs/feature_a_design.mdunless their semantic content is very similar within the retrieved chunks. - The LLM gets individual chunks, not necessarily the entire related code and doc set together.
This strategy involves a pre-processing step where you intelligently combine related source code and markdown documentation into single, richer files. This ensures that when a relevant "logical unit" (e.g., a feature, a module, a microservice) is retrieved, the LLM has access to both its code and documentation simultaneously.
The Core Idea: Create "synthetic documents" for Vertex AI Search, where each synthetic document represents a cohesive unit that includes both code and its corresponding markdown.
Steps:
-
Define Logical Units: Decide how you want to combine your code and docs. Common logical units are:
- Feature/Module: All code files for
feature_Aalong with itsfeature_A.mdorREADME.mdfrom its directory. - Microservice: All code and documentation related to a specific microservice.
- Repository Section: For smaller repos, you might combine several related files.
- Feature/Module: All code files for
-
Write a Pre-processing Script: Develop a script (e.g., in Python) that:
- Iterates through your repository/source tree.
- Identifies logical units.
- For each unit:
- Reads relevant source code files.
- Reads relevant markdown files that specifically document that code (e.g.,
README.mdin the same directory, or adesign.mdlinked to it). - Combines their content into a single file. Use clear delimiters or headings within this combined file to help the LLM distinguish between code and prose. Markdown's code blocks are excellent for this.
-
Example of combined content for a logical unit:
# Feature X - User Authentication Module ## Overview Documentation This module handles user authentication, including login, logout, and session management. It integrates with the `external_auth_provider` for federated identity. Key components: `auth_api.py`, `session_manager.py`. For more details, see the overall `docs/architecture.md`. ## auth_api.py (Source Code) ```python # src/auth/auth_api.py import session_manager import external_auth_provider def login(username, password): # ... authentication logic ... token = external_auth_provider.authenticate(username, password) session = session_manager.create_session(token) return session # ... other functions ...
# src/auth/session_manager.py import datetime def create_session(token): # ... creates and stores session ... return {"session_id": "...", "expires": datetime.datetime.now() + datetime.timedelta(hours=1)} # ...
-
- Saves this combined content as a new
.md,.txt, or.htmlfile (e.g.,feature_x_auth_module.md) to a designated GCS "ingestion" bucket.
-
Ingest into Vertex AI Search: Point your Vertex AI Search Data Store to this new GCS bucket containing your combined files.
Pros:
- Stronger contextual understanding: When a user queries about a specific feature, the retrieved chunk is more likely to contain both the relevant code and its direct documentation, providing a richer context to the LLM.
- Reduced hallucination: The LLM is less likely to guess or make inferences if it has the direct explanation of the code available in the same context.
- Improved relevance: Semantic search is more effective when related information is grouped.
Cons:
- Requires initial scripting and ongoing maintenance of the pre-processing pipeline.
- Need to decide on the best "logical unit" for combining. If units are too large, they might exceed Vertex AI Search's internal chunking limits for individual documents, or the LLM's context window.
You can combine elements of Strategy 1 and Strategy 2:
- Ingest Raw Code and Main Documentation: Use Strategy 1 for your entire raw codebase and broad architectural markdown files (e.g.,
docs/architecture.md). This provides a general knowledge base. - Ingest Combined Feature Documents: Use Strategy 2 for specific, tightly coupled feature/module-level code and their dedicated
README.mdor design docs. This provides deep, linked context for specific components. - Combine Data Stores (if needed): You can potentially create multiple data stores in Vertex AI Search (one for raw, one for combined documents), then connect them all to a single Search Application. Vertex AI Search and its LLM will then try to draw information from all connected data sources.
Pros:
- Offers the flexibility of semantic search over all raw code, while providing deep, linked context for specific, important components.
Cons:
- More complex to set up and manage multiple pipelines and data stores.
- Chunking: Vertex AI Search automatically chunks documents. For code, try to ensure your pre-processed combined documents don't get broken up in ways that separate a code block from its immediate explanation. Using Markdown headings and code blocks helps the internal chunking process.
- Source Code Quality: Well-commented code, good naming conventions, and docstrings will significantly improve the quality of answers, as the LLM will draw heavily from natural language within the code.
- Updating your KB: Set up automated pipelines (e.g., Cloud Build, Cloud Functions, or even just a cron job on a VM) to run your pre-processing script and re-upload files to GCS whenever your source code or documentation changes significantly. Vertex AI Search can then be configured to re-index.
- Security: Ensure strict IAM permissions on your GCS buckets containing sensitive source code and documentation.
In summary, for most projects where understanding the relationship between code and its docs is critical, Strategy 2 (Pre-process and Combine for Enhanced Context) is the recommended path. It strikes a good balance between ease of use and contextual accuracy for the LLM.