D²-Code is a context optimization framework designed for repository-level code generation tasks.
It adopts a two-stage pipeline:
- 🧩 Phase I — Diversification: Selects relevant yet diverse code context snippets from the initially retrieved candidates using K-means clustering and an Upper Confidence Bound (UCB) strategy to reduce redundancy.
- 🧹 Phase II — DeNoising: Removes irrelevant and noisy code fragments at the block level based on Transformer attention weights, producing clean and concise contexts to improve generation quality.
d2code/
├── __init__.py # Package initializer
├── utils.py # Utility functions
├── clustering.py # K-means clustering implementation
├── diversify.py # Phase I: Diversification
├── denoise.py # Phase II: Denoiser
├── d2code_pipeline.py # Main pipeline
└── README.md # Documentation
A diversification selector based on the UCB algorithm, using a penalty reward method.
Penalty method logic:
- Compute similarity between documents and the query
- Subtract the maximum similarity with already selected contexts (penalize redundancy)
- Balance relevance and diversity
A code denoiser based on Transformer attention, supporting block granularity.
Block granularity characteristics:
- Split code into blocks by blank lines
- Compute block-level attention weights
- Remove blocks with low attention weights
Top-k aggregation:
- Select the top 10% attention weights
- Average them as the block's score
- Focus more on important tokens
# Batch processing
results = pipeline.process_batch(data_list, diversify_config, denoise_config, 'output.jsonl')# Read from file and process
results = pipeline.process_from_file('input.jsonl', 'output.jsonl', diversify_config, denoise_config)- Cluster input contexts using K-means
- Use the UCB algorithm with the penalty reward method to select the optimal set of contexts
- Return a diversified list of contexts
- Select top-k contexts from
diversified_contexts - Extract context texts and the query
- Analyze relevance using Transformer attention
- Perform denoising based on block granularity and top-k aggregation
- Construct the final prompt
{
"contexts": [
{
"context": "code snippet",
"score": 0.95,
"data": [{"embedding": [0.1, 0.2, ...]}]
}
],
"query": "query text",
"metadata": {...}
}{
"original_contexts": [...],
"diversified_contexts": [...],
"topk_contexts": [...],
"denoised_contexts": [...],
"final_prompt": "final prompt",
"query": "query text",
"metadata": {...}
}| Feature | Description |
|---|---|
| 🧠 Diversification | Ensures semantic diversity of selected contexts via UCB + penalty reward |
| 🧽 Denoising | Removes irrelevant code using attention with block granularity & top-k aggregation |
| 🧩 Unified pipeline | End-to-end processing flow combining both stages |
- Ensure sufficient GPU memory to load the model
- You can adjust configuration parameters based on your needs
- For large-scale processing, it is recommended to use batch mode
