Skip to content

Instantly share code, notes, and snippets.

@Adefful
Last active January 19, 2024 11:52
Show Gist options
  • Save Adefful/2901b5327ff2a238f9e43ae14957536e to your computer and use it in GitHub Desktop.
Save Adefful/2901b5327ff2a238f9e43ae14957536e to your computer and use it in GitHub Desktop.
Название Ссылка на Статью Модель Описание Краткое Ссылка на Документ
Repository-Level Prompt Generation for Large Language Models of Code Ссылка Codex Разработка системы для генерации контекстно-специфичных промптов для LLM, используемых в автодополнении кода, на основе информации из всего репозитория. HTML
RepoFusion: Training Code Models to Understand Your Repository Ссылка CodeGen-16B-multi Использование информации уровня репозитория для улучшения понимания и генерации кода. HTML
CrossCodeEval Ссылка CodeGen, SantaCoder, StarCoder, GPT-3.5-turbo Датасет, фокусирующийся на использовании контекста между файлами для улучшения качества генерации кода. HTML
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation Ссылка GPT-3.5-Turbo, CodeGen, UniXcoder Итеративный подход к извлечению и генерации кода на уровне репозитория. PDF
CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context Ссылка CodeGen-350M-Mono Python Генерация кода с учетом как внутрифайлового, так и межфайлового контекста, используя графовую модель. HTML
Entity-Augmented Code Generation Ссылка T5, LAMMA Генерация кода, адаптированная к конкретному проекту, с акцентом на сбор и использование сущностей из кодовой базы. HTML
DevEval: Evaluating Code Generation in Practical Software Projects Ссылка gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder The paper proposes a new benchmark named DevEval, which contains 2,690 samples from 119 practical projects and covers 10 domains. DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. The authors assess five popular LLMs on DevEval and reveal their actual abilities in code generation. The paper aims to facilitate the development of code generation in practical projects PDF
SKCODER: A Sketch-based Approach for Automatic Code Generation [Ссылка](https://arxiv.org/pdf/2302.06144.pdf) GraphCodeBERT, CodeT5-base This article is an approach to automatic code generation based on sketches, i.e. to extract relevant parts from similar code and edit them according to the natural language description. The authors motivate their approach by mimicking the behavior of developers when reusing code. PDF
CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation Ссылка PYCODEGPT-CERT, CODEGEN-CERT, GPT-3, Codex Непрерывное предварительное обучение для генерации кода, ориентированного на библиотеки, используя скетчи. PDF

Concept Overview

This approach involves the creation of a dataset comprising pairs of queries and their relevant chunks (code snippets or documentation), which serve as the foundation for training a Cross-Encoder model like BERT for relevance scoring. The aim is to enhance the accuracy of code generation tasks by leveraging these relevance scores.

Query and Relevant Chunk Pairs

  • Data Preparation: Construct pairs of queries and corresponding relevant chunks. These chunks can be code snippets, documentation excerpts, or any other relevant textual content.
  • Dataset Role: These pairs form the dataset on which the Cross-Encoder model will be trained. The dataset is designed to capture the relevance of each chunk to its associated query.

Cross-Encoder for Relevance Scoring

  • Model Configuration: Utilize a Cross-Encoder architecture, such as BERT, for processing the query-chunk pairs. Each pair is concatenated using special tokens, following the format [CLS] query [SEP] chunk [SEP].

  • Processing Input: The Cross-Encoder processes this combined input, encoding the relationship between the query and the chunk within the model's architecture.

Generating Relevance Scores

  • Score Extraction: The output of the model corresponding to the [CLS] token is used to derive a relevance score. This score is obtained by passing the output through one or more fully connected layers, often coupled with an activation function.
  • Interpretation of Scores: The relevance score quantifies how relevant the model perceives the chunk to be in relation to the query. It's a measure of suitability for the code generation task based on the query.

Relevance Score Approaches

  1. LLM-Generated Code Evaluation: The relevance score can be based on how effectively an LLM generates the final function code using the context provided by the chunk.
  2. LLM-Assisted Chunk Selection (Distillation): Alternatively, an LLM can be tasked with selecting relevant chunks, where the relevance score depends on the LLM’s selection. This can be implemented through prompting the LLM to rank the usefulness of a chunk on a scale of 1 to 5 or through pairwise ranking, determining which chunk is more relevant for implementing the function as per the query.

Training BERT for Retrieval

  • Dataset Utilization: Train the BERT Cross-Encoder on the dataset, where chunks are associated with their relevance scores. These scores reflect how closely the LLM-generated function aligns with the 'ground truth' or the intended output.
  • Training Approaches:
    1. Separate Training: Initially, pre-train the BERT model using the dataset with established relevance scores.
    2. Integrated Training Cycle: Alternatively, implement a training loop where BERT assigns relevance scores to each chunk. Following this, an LLM generates the function code using these chunks, and the deviation from the expected output (loss) is used for backpropagation in BERT (with LLM being frozen).

Notes on the Training Process

  • The model can be initially trained focusing on the direct association of chunks and their relevance (approach 1), followed by a more integrated approach where the model's predictions are directly used to guide code generation and its optimization (approach 2).
graph 
    A[Start] --> B[Prepare Query-Chunks Pairs]
    B --> C["Format Pairs for BERT<br> [CLS] query [SEP] chunk [SEP]"]
    C --> D[Train BERT as Cross-Encoder<br>on Prepared Pairs]
    D --> E{Relevance Score<br>Approach}
    E -->|LLM-Generated Code Evaluation| F[Use LLM to Generate Code<br>Based on Chunk Context]
    E -->|LLM-Assisted Chunk Selection| G[LLM Ranks Chunks<br>Scale 1-5 or Pairwise Ranking]
    F --> H[Calculate Relevance Score<br>Based on Code Quality]
    G --> I[Assign Relevance Score<br>Based on LLM Selection]
    H --> J[Backpropagation<br>to Optimize BERT]
    I --> J
    J --> K{Re-trained BERT<br>Ready for Deployment}
    K --> L[End]

Loading
  1. Challenge: It highlights the limitations of LLMs in their tendency to hallucinate and difficulties in utilizing external information sources.

  2. Novel Task: An example scenario where the model generates a function body based on a docstring and a set of project-level functions.

  3. Issue with Similar Entity Names: The paper addresses the challenge of distinguishing between entities with similar names, which can lead to incorrect token generation.

  4. Proposed Solution: This includes integrating an entity retriever directly into the LLM decoder. This approach enables scalable entity retrieval while minimizing the risk of context contamination.

Exmaple

Figure 1:We propose a novel task of code generation using external entities. In this example, the LLM generates code based on the function name and existing project functions.image

Arch

Figure 3:The proposed model architecture. The retriever produces embeddings of all the entities associated with the sample using cross-attention with the last input token representations from the generator. These entity embeddings augment generator vocabulary and form so-called dynamic vocabulary. The generator produces the output token by token using this dynamic vocabulary, deciding at each step either to generate a regular token or to insert an entity.

image

Our work’s key contributions can be summarized as follows:

• We introduce a novel architecture for entity-augmented decoding with scalable inference and plug-and-play design. • We rigorously study the proposed model performance. • We publish a new dataset for project-level code generation.

Project-Aware Code Generation Task

Objective: To generate a function within a project that is most relevant to a given query, based on similarity or API matching metrics to a target function within the same project.

Inputs:

  1. Function Description (Query) $Q$: A detailed description or specification of the function that needs to be generated. (This could be a docstring, function signature, or a partially completed function code)
  2. Project Context $C$: The entire codebase of the project, excluding the function that needs to be generated. ( Reference to one or more existing functions within the project that are relevant to the requested function. )

Output:

  • Generated Function Code $G$: A code that fulfills the function request and aligns with the style, conventions, and existing functionalities of the project.

Similarity Metrics:

  • Functional Similarity: The degree to which the generated function performs a similar task or achieves a similar outcome as the target function(s).
  • API Matching: The extent to which the generated code effectively uses existing APIs and project-specific libraries.
  • Code Style Consistency: The similarity in coding style, including naming conventions, formatting, and structure, between the generated code and the existing project codebase. (not necessarily)

Mathematical Formulation:

Let $( Q )$ represent the function description (query), $( C )$ represent the project context, and $( G )$ be the generated function code. The task involves optimizing $( G )$ via $( C )$ such that its relevance to $( Q )$ within the context $( C )$ is maximized based on the metric $( M )$. This is formally represented as:

$$ \max_{G} M(G, Q, C) $$

where $M(G, Q, C)$ is a metric that computes the relevance score of $( G )$ with respect to $( Q )$.

Metrics

  1. BLEU (Bilingual Evaluation Understudy):

    • Usage:machine translation, text generation task, code generation.
    • Strengths: Good at measuring the n-gram overlap between generated text and reference text, providing a quantitative measure of similarity.
    • Limitations: Lacks sensitivity to the meaning or semantics of the text; primarily focuses on surface-level textual similarities.
  2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

    • Usage: Commonly used in summarization tasks, and adaptable for evaluating code generation by measuring overlap with reference code.
    • Strengths: Offers multiple variations (e.g., ROUGE-N, ROUGE-L) to assess different aspects of the generated text, like recall and precision.
    • Limitations: Like BLEU, it focuses more on surface-level text similarities and may not capture semantic accuracy effectively.
  3. BERTScore:

    • Usage: Used in evaluating text generation tasks where semantic understanding is critical, including code generation.
    • Strengths: Leverages contextual embeddings from BERT, allowing it to assess semantic similarity more effectively than BLEU or ROUGE.
    • Limitations: Can be computationally intensive and may require fine-tuning for specific domains or types of text.
  4. METEOR (Metric for Evaluation of Translation with Explicit Ordering):

    • Usage: Used in machine translation and increasingly in other text generation contexts.
    • Strengths: Combines exact word matches with stemming and synonymy, leading to more nuanced evaluation than BLEU.
    • Limitations: More complex to compute and can be sensitive to the choice of linguistic resources (e.g., synonym databases).
  5. Perplexity:

    • Usage: Widely used in language modeling, including in the generation of both code and natural language texts.
    • Strengths: Measures how well a probability model predicts a sample, useful for evaluating the fluency of generated text.
    • Limitations: Does not directly measure the semantic accuracy or relevance of the generated text.
  6. Edit Distance (Levenshtein Distance):

    • Usage: Applicable in scenarios where the exactness of generated text to a reference is crucial, including code.
    • Strengths: Provides a clear numerical value indicating how many edits are needed to match the generated text to a reference.
    • Limitations: More focused on surface-level accuracy and does not account for semantic meaning.
  7. CodeBLEU:

    • Usage: Specifically designed for code generation tasks.
    • Strengths: Adapts the concept of BLEU to code by considering syntactic and semantic aspects, as well as code-specific features.
    • Limitations: Relatively new and might require further validation across different programming languages and contexts.

Dataset

Languages: Python, Java, C++, JavaScript, ...

Dataset Description:

This dataset must contain all project. We can use it for testing.

Project Filtering:

Projects included in the dataset undergo rigorous filtering based on the following criteria:

  • Star Count: Projects with a minimum number of stars, indicating popularity and community interest.
  • Fork Count: Projects with a minimum number of forks, signifying collaboration and contributions.
  • Commit Regularity: Projects with consistent commit activity over time, ensuring relevance.
  • Post-Cutoff Date: Projects that have had commits made after a specified cutoff date, ensuring up-to-date content.

Function Selection for Generation (Masking):

The dataset identifies specific functions within the selected projects that are suitable candidates for code generation. These functions are chosen based on their relevance to the project's objectives and alignment with the function request.

API Utilization Analysis:

For each project in the dataset, an analysis is performed to assess the utilization of APIs (Application Programming Interfaces). This analysis involves:

  • Counting the number of API functions used within the project.
  • Identifying and categorizing the API functions based on their relevance to the project's functionality.

Task of repository-level code completion

The paper proposes RepoCoder, a system for repository-level code completion that combines iterative retrieval and generation. It includes a benchmark (RepoEval) and employs a pre-trained LLM and a similarity-based retriever.

  • Experimental results indicate that RepoCoder significantly improves the In-File completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach

Incorporating the Retrieval-Augmented Generation image

Problems:

Problem of Traditional Tools: Traditional code completion tools focus mainly on the current file, ignoring the broader context of the entire repository. RepoCoder addresses this by considering inter-file dependencies and conventions

  1. Customized Conventions Challenge: While each code repository often has unique naming conventions and coding styles aiding readability and maintainability, creating repository-level code completion tools that can adapt to these diverse and customized conventions remains a challenging, unresolved issue.

  2. Limitations of Static Analysis: Tools based on static code analysis and heuristic rules, although effective in parsing specific repository contexts, fall short in providing versatile code completion across different file sections, particularly for varying-length completions.

  3. Generalization Hurdle in Language Models: While language models trained on labeled data show promise in specific scenarios, they struggle to generalize effectively to new, unseen repositories without undergoing retraining.

Example:

image

  1. Contextual Information Use: RepoCoder effectively utilizes scattered, relevant information across different files in a repository, overcoming a key limitation of conventional code completion tools.

  2. RepoCoder's Framework Details: RepoCoder's framework combines a similarity-based retriever, which identifies relevant code snippets in the repository, with a sophisticated LLM for code generation. This combination significantly enhances code completion by incorporating broader repository context.

Pipeline

illustrates the iterative retrieval-generation process used in RepoCoder. It starts with an incomplete code snippet and follows these steps:

  1. First Iteration:

    • Use the incomplete code snippet to retrieve relevant code from the repository.
    • Generate a completion based on this retrieval.
  2. Second Iteration:

    • Combine the generated completion from the first iteration with the original code snippet to create a new, enriched prompt.
    • Use this new prompt to retrieve more relevant code and generate a refined completion.
  3. Subsequent Iterations:

    • Repeat the process, each time enhancing the prompt with the newly generated code from the previous iteration.
    • This leads to progressively improved code completions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment