Skip to content

phillijm/TransformerAssistedLLMCodeSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer-Assisted LLM-Based Source Code Summarisation: to Enable More Secure Software Development

Transformer-assisted LLM-Based Source Code Summarisation, using CodeSumBART to provide one-shot examples to prompt LLMs to generate source code summaries.

This work was persented at NLPAICS 2026.

We trained a CodeSumBART model, using the Funcom, cleaned using an updated version of JavaDatasetCleaner. We then used this model to generate method-summary predictions for a 10% evaluation split from this dataset. We use this data to prompt a Large Language Model to generate improved summaries.

Useful Files

  • CodeSumBART-ForGeneration.py: script to use CodeSumBART to generate outputs in the format needed for our LLM prompting script.
  • full_csb.ckpt: a CodeSumBART model trained on the full Funcom dataset. You will need to download or generate this separately, due to the model's size.
  • getResults.py: a script which turns all of the LLM's output TSV files to an Excel spreadsheet.
  • run.py: the script which uses an LLM to summarise source code.
  • runWithShortResponses.py: as above, with with the model prompted to only generate human-length responses. getAverageLengthsOfHumanSummaries.py: a script to find the average length of sumaries writen by humans in our dataset.
  • getBleu.py: a script to get the BLEU-4 score for summaries generated by our model.
  • run.sh: a template script to call the run.py script.
  • requirements.txt: the python package requirements for running the model.

Machine Requirements

This code is designed to run on UCREL's Hex HPC, using Slurm. Recommended System Requirements:

  • Nvidia A5000, 24GB
  • 128GB RAM
  • 128GB Storage (SSD or HDD)
  • 2GHz minimum multi-core CPU
@misc{UcrelHex,
    title        = {{UCREL - Hex}; A shared, hybrid multiprocessor system},
    author       = {Vidler, John AND Rayson, Paul},
    abstract     = {Hex is a collection of GPU equipped hosts onto which single- multi-
                    or GPU-processor jobs can be executed hosted at Lancaster University,
                    UK as part of the School of Computing and Communications and the
                    UCREL group.},
    howpublished = {\url{https://github.com/UCREL/hex}},
    note         = {Accessed: 2026}
}

About

Replication package for "Transformer-Assisted LLM-Based Source Code Summarisation: to Enable More Secure Software Development", presented at NLPAICS 2026

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors