D-2Code

D²-Code is a context optimization framework designed for repository-level code generation tasks.
It adopts a two-stage pipeline:

🧩 Phase I — Diversification: Selects relevant yet diverse code context snippets from the initially retrieved candidates using K-means clustering and an Upper Confidence Bound (UCB) strategy to reduce redundancy.
🧹 Phase II — DeNoising: Removes irrelevant and noisy code fragments at the block level based on Transformer attention weights, producing clean and concise contexts to improve generation quality.

📁 Project Structure

d2code/
├── __init__.py          # Package initializer
├── utils.py             # Utility functions
├── clustering.py        # K-means clustering implementation
├── diversify.py         # Phase I: Diversification 
├── denoise.py           # Phase II: Denoiser
├── d2code_pipeline.py  # Main pipeline
└── README.md           # Documentation

🧩 Core Components

🔹 DiversifySelector

A diversification selector based on the UCB algorithm, using a penalty reward method.

Penalty method logic:

Compute similarity between documents and the query
Subtract the maximum similarity with already selected contexts (penalize redundancy)
Balance relevance and diversity

🔹 Denoiser

A code denoiser based on Transformer attention, supporting block granularity.

Block granularity characteristics:

Split code into blocks by blank lines
Compute block-level attention weights
Remove blocks with low attention weights

Top-k aggregation:

Select the top 10% attention weights
Average them as the block's score
Focus more on important tokens

⚡ Usage

🔁 Batch processing

# Batch processing
results = pipeline.process_batch(data_list, diversify_config, denoise_config, 'output.jsonl')

📂 File-based processing

# Read from file and process
results = pipeline.process_from_file('input.jsonl', 'output.jsonl', diversify_config, denoise_config)

⚙️ Processing Flow

🧩 Stage 1: Diversification

Cluster input contexts using K-means
Use the UCB algorithm with the penalty reward method to select the optimal set of contexts
Return a diversified list of contexts

🧹 Stage 2: Denoising

Select top-k contexts from diversified_contexts
Extract context texts and the query
Analyze relevance using Transformer attention
Perform denoising based on block granularity and top-k aggregation
Construct the final prompt

Data Formats

📥 Input format

{
    "contexts": [
        {
            "context": "code snippet",
            "score": 0.95,
            "data": [{"embedding": [0.1, 0.2, ...]}]
        }
    ],
    "query": "query text",
    "metadata": {...}
}

📤 Output format

{
    "original_contexts": [...],
    "diversified_contexts": [...],
    "topk_contexts": [...],
    "denoised_contexts": [...],
    "final_prompt": "final prompt",
    "query": "query text",
    "metadata": {...}
}

✨ D²-Code Highlights

Feature	Description
🧠 Diversification	Ensures semantic diversity of selected contexts via UCB + penalty reward
🧽 Denoising	Removes irrelevant code using attention with block granularity & top-k aggregation
🧩 Unified pipeline	End-to-end processing flow combining both stages

⚠️ Notes

Ensure sufficient GPU memory to load the model
You can adjust configuration parameters based on your needs
For large-scale processing, it is recommended to use batch mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-2Code

📁 Project Structure

🧩 Core Components

🔹 DiversifySelector

🔹 Denoiser

⚡ Usage

🔁 Batch processing

📂 File-based processing

⚙️ Processing Flow

🧩 Stage 1: Diversification

🧹 Stage 2: Denoising

Data Formats

📥 Input format

📤 Output format

✨ D²-Code Highlights

⚠️ Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset		asset
evaluation		evaluation
.DS_Store		.DS_Store
README.md		README.md
__init__.py		__init__.py
clustering.py		clustering.py
d2code_pipeline.py		d2code_pipeline.py
denoise.py		denoise.py
diversify.py		diversify.py
environment.yml		environment.yml
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

D-2Code

📁 Project Structure

🧩 Core Components

🔹 DiversifySelector

🔹 Denoiser

⚡ Usage

🔁 Batch processing

📂 File-based processing

⚙️ Processing Flow

🧩 Stage 1: Diversification

🧹 Stage 2: Denoising

Data Formats

📥 Input format

📤 Output format

✨ D²-Code Highlights

⚠️ Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages