Skip to content

DeepSoftwareAnalytics/D-2Code

Repository files navigation

D-2Code

SVG Banners

D²-Code is a context optimization framework designed for repository-level code generation tasks.
It adopts a two-stage pipeline:

  • 🧩 Phase I — Diversification: Selects relevant yet diverse code context snippets from the initially retrieved candidates using K-means clustering and an Upper Confidence Bound (UCB) strategy to reduce redundancy.
  • 🧹 Phase II — DeNoising: Removes irrelevant and noisy code fragments at the block level based on Transformer attention weights, producing clean and concise contexts to improve generation quality.

📁 Project Structure

d2code/
├── __init__.py          # Package initializer
├── utils.py             # Utility functions
├── clustering.py        # K-means clustering implementation
├── diversify.py         # Phase I: Diversification 
├── denoise.py           # Phase II: Denoiser
├── d2code_pipeline.py  # Main pipeline
└── README.md           # Documentation

🧩 Core Components

Architecture Overview

🔹 DiversifySelector

A diversification selector based on the UCB algorithm, using a penalty reward method.

Penalty method logic:

  • Compute similarity between documents and the query
  • Subtract the maximum similarity with already selected contexts (penalize redundancy)
  • Balance relevance and diversity

🔹 Denoiser

A code denoiser based on Transformer attention, supporting block granularity.

Block granularity characteristics:

  • Split code into blocks by blank lines
  • Compute block-level attention weights
  • Remove blocks with low attention weights

Top-k aggregation:

  • Select the top 10% attention weights
  • Average them as the block's score
  • Focus more on important tokens

⚡ Usage

🔁 Batch processing

# Batch processing
results = pipeline.process_batch(data_list, diversify_config, denoise_config, 'output.jsonl')

📂 File-based processing

# Read from file and process
results = pipeline.process_from_file('input.jsonl', 'output.jsonl', diversify_config, denoise_config)

⚙️ Processing Flow

🧩 Stage 1: Diversification

  1. Cluster input contexts using K-means
  2. Use the UCB algorithm with the penalty reward method to select the optimal set of contexts
  3. Return a diversified list of contexts

🧹 Stage 2: Denoising

  1. Select top-k contexts from diversified_contexts
  2. Extract context texts and the query
  3. Analyze relevance using Transformer attention
  4. Perform denoising based on block granularity and top-k aggregation
  5. Construct the final prompt

Data Formats

📥 Input format

{
    "contexts": [
        {
            "context": "code snippet",
            "score": 0.95,
            "data": [{"embedding": [0.1, 0.2, ...]}]
        }
    ],
    "query": "query text",
    "metadata": {...}
}

📤 Output format

{
    "original_contexts": [...],
    "diversified_contexts": [...],
    "topk_contexts": [...],
    "denoised_contexts": [...],
    "final_prompt": "final prompt",
    "query": "query text",
    "metadata": {...}
}

✨ D²-Code Highlights

Feature Description
🧠 Diversification Ensures semantic diversity of selected contexts via UCB + penalty reward
🧽 Denoising Removes irrelevant code using attention with block granularity & top-k aggregation
🧩 Unified pipeline End-to-end processing flow combining both stages

⚠️ Notes

  1. Ensure sufficient GPU memory to load the model
  2. You can adjust configuration parameters based on your needs
  3. For large-scale processing, it is recommended to use batch mode

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages