An end-to-end pipeline for building food metagenome-derived protein sequence databases and performing metaproteomic database searches.
- 1. Data Acquisition
- 2. Source-Specific Preprocessing
- 3. Gene Prediction
- 4. Functional Annotation
- 5. Database Construction
- 6. Metaproteomic Database Search
- 7. References
cFMD data can be downloaded both from Zenodo (https://zenodo.org/records/13285428) and GitHub (https://github.com/SegataLab/cFMD.git)
MiFoDB data can be downloaded from Zenodo (https://zenodo.org/records/10881265)
MGnify data can be downloaded from their website (https://www.ebi.ac.uk/metagenomics)
Uniprot data can be downloaded with the script: uniprot_search.py
python uniprot_search.py
python uniprot_search.py --output-dir ./data/uniprotNotes
- The primary categorical framework of this study is structured around cFMD data.
- From MGnify studies, only contig files belonging to the following categories were downloaded: dairy, fermented_beverages, and fermented_vegetables
- Duplicate studies across cFMD, MiFoDB, and MGnify data sources were identified and removed to ensure data integrity.
- Search parameters for the UniProt database script were configured to align with the metadata parameters of the cFMD dataset.
cFMD and MiFoDB data is already classified into eukaryotic and prokaryotic genomes and requires no additional preprocessing. MGnify need source-specific steps before entering the common analysis pipeline.
MGnify contigs are unclassified and require EukRep for euk/prok separation. Eukaryotic contigs additionally need MMseqs2 to determine organism identity for Augustus mapping.
Installing EukRep:
conda create -y -n eukrep -c bioconda python=3.7.16
conda activate eukrep
pip install scikit-learn==0.19.2
pip install EukRepUsing EukRep:
EukRep -i input.fasta -o euk_output.fasta --prokarya prok_output.fastaNotes
- -i: Input FASTA file (metagenomic contigs)
- -o: Output FASTA file for eukaryotic contigs
- --prokarya: Output FASTA file for prokaryotic contigs
Installing MMseqs2:
conda install -c conda-forge -c bioconda mmseqs2Using MMseqs2:
mmseqs easy-taxonomy query.fasta target_db output_prefix tmp_dir \
--threads 16 \
--split-memory-limit 110G \
--lca-mode 3 \
--search-type 3Notes
- query.fasta: Input FASTA file (eukaryotic contigs)
- target_db: Reference database for taxonomy assignment (UniRef50 in this case)
- output_prefix: Prefix for output files (generates _report and _tophit_report)
- tmp_dir: Temporary directory for intermediate files
- --threads: Number of CPU threads
- --split-memory-limit: Maximum memory per split (prevents crashes on large databases)
- --lca-mode 3: LCA algorithm mode: top-hit based approach for taxonomic assignment
- --search-type 3: Nucleotide-to-protein search (translated search)
After preprocessing, all three sources follow the same two-track pipeline: Prodigal + smORFinder for prokaryotic data, Augustus for eukaryotic data. The commands below are run independently for each source (cFMD, MiFoDB, MGnify).
Installing Prodigal:
git clone https://github.com/hyattpd/Prodigal.git
cd ~/Prodigal
make installUsing Prodigal:
prodigal -i input.fasta \
-o output.genes.gbk \
-a output.proteins.faa \
-d output.genes.fna \
-p single -c -mNotes
- -i: Input FASTA file
- -o: Gene coordinates output in GenBank format (.gbk)
- -a: Predicted protein sequences (.faa)
- -d: Predicted gene nucleotide sequences (.fna)
- -p single: Single genome mode (assumes each file is one organism)
- -c: Report only closed (complete) ORFs with both start and stop codons
- -m: Do not include upstream sequence before the start codon
Installing smORFinder:
conda create -n smorfinder python=3.8.20
conda activate smorfinder
pip install smorfinderUsing smORFinder:
smorf single input.fastaNotes
- single: Single genome mode (one genome per run)
- input.fasta: Input FASTA file (nucleotide sequences)
For cFMD and MiFoDB, organism models are already known. For MGnify, the MMseqs2 results from step 2.2 are used to determine the appropriate species model.
Installing Augustus:
sudo apt install augustus augustus-data augustus-docUsing Augustus:
augustus --species=SPECIES_MODEL \
--strand=both \
--genemodel=complete \
--gff3=on \
--protein=on \
--codingseq=on \
input.fastaNotes
- --species: Augustus species model for gene prediction (e.g., saccharomyces_cerevisiae_S288C, aspergillus_nidulans). Determined by taxonomy mapping (MMseqs2 for MGnify, pre-existing mapping for cFMD/MiFoDB).
- --strand=both: Predict genes on both forward and reverse strands
- --genemodel=complete: Report only complete gene models (with start and stop codons)
- --gff3=on: Output in GFF3 format--protein=onInclude predicted protein sequences in the output
- --codingseq=on: Include coding nucleotide sequences in the output
Extract protein sequences from GFF output:
perl getAnnoFasta.pl augustus_output.gff3Notes
- augustus_output.gff3: Augustus GFF3 output file containing gene predictions with embedded protein and coding sequences
- outputs: .aa -> Predicted protein sequences (renamed to .faa), .codingseq -> Coding nucleotide sequences (renamed to .cds.fna)
Installing EggNOG-mapper:
conda create -n emapper -c bioconda -c conda-forge python=3.10.19
conda activate emapper
conda install eggnog-mapperInstalling necessary databases:
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.taxa.tar.gzUsing EggNOG-mapper:
emapper.py \
-i input.faa \
-o output_prefix \
--output_dir output_dir \
--data_dir eggnog_db_dir \
--scratch_dir tmp_dir \
--cpu 8 \
-m diamond \
--override \
--dmnd_ignore_warningsNotes
- -i: Input protein FASTA file
- -o: Output file prefix
- --output_dir: Directory for output files
- --data_dir: Path to EggNOG database files (eggnog.db, eggnog_proteins.dmnd, eggnog.taxa.db)
- --scratch_dir: Temporary directory for intermediate files
- --cpu: Number of CPU threads
- -m diamond: Search method: use Diamond for homology search
- --override: Overwrite existing output files
- --dmnd_ignore_warnings: Tolerate ambiguous amino acids (e.g., X) in input sequences
- Created output files: Functional annotation results (GO, KEGG, COG, etc.) -> .emapper.annotations, Diamond search hits -> .emapper.hits, Best matching seed orthologs -> .emapper.seed_orthologs
Three scripts for building, annotating, and summarising per-category FASTA databases.
All scripts accept the same --list argument: a plain text file (.txt) or Excel file (.xlsx) where each line / first-column cell contains the path to one input directory.
The last component of each path is used as the category name.
/data/fermented_dairy
/data/fermented_vegetables
/data/meat
To rename a category (e.g. merge two folder names into one), edit the CATEGORY_MAP dict at the top of each script.
Merges all .faa / .faa.gz files within each category into a single deduplicated FASTA file.
Every header is prefixed with the source filename so sequences can be traced back.
Output: <output_dir>/<category>_db.faa
Header format: >samplename_proteins__original_header
python build_category_dbs.py --list dirs.txt --output db/
python build_category_dbs.py --list dirs.xlsx --output db/ --gzipNotes
- Deduplication is exact sequence match (MD5, case-insensitive), across all files within a category.
.xlsxinput requirespip install pandas openpyxl; plain.txthas no extra dependencies.--gzipproduces.faa.gzoutput files.
Reads the *_db.faa files produced in step 01 and resolves per-sequence metadata
(source, cell type, sequence type) by matching header prefixes against the original directory list.
Output:
<output_dir>/metadata/<category>_metadata.tsv<output_dir>/metadata/all_metadata.tsv<output_dir>/summary.xlsx(pivot-table report; skipped with--no-excel)
python build_metadata.py --list dirs.txt --db db/ --output results/
python build_metadata.py --list dirs.txt --db db/ --output results/ --no-excel --resumeNotes
- Metadata inference rules (source, cell_type, type) are defined in the
parse_path()function — edit it to match your directory structure. --resumeskips categories whose TSV already exists, useful for large datasets.--no-excelskips the summary report entirely..xlsxinput and Excel output requirepip install pandas openpyxl.
Merges eggNOG-mapper output files (.emapper.annotations) into per-category TSV files.
Query IDs are prefixed to match the FASTA headers from step 01.
Optionally filters records to keep only sequences present in the deduplicated databases.
Output: <output_dir>/<category>_annotations.tsv
# Without filtering
python build_annotation_dbs.py --list dirs.txt --output annotation_dbs/
# With filtering against step-01 databases
python build_annotation_dbs.py --list dirs.txt --output annotation_dbs/ --db db/ --gzip --resumeNotes
- When
--dbis provided, only annotations whose prefixed query ID exists in the corresponding*_db.faaare written; unmatched records are counted and reported. --resumeskips categories whose output file already exists.--gzipproduces.tsv.gzoutput files..xlsxinput requirespip install pandas openpyxl; no other extra dependencies.
Download ProteoWizard MSConvert from https://proteowizard.sourceforge.io/download.html
- We downloaded Pu-erh tea samples from https://www.iprox.cn//page/subproject.html?id=IPX0001388001 that belongs to Zhao et al. (2019).
- We tested two of our databases: all_db and fermented_beverages, and two different databases: a general database prepared from UniProt with uniprot_general_db.py, and a database with metagenomic origin prepared from UniProt with uniprot_metagenomic_db.py.
To prepare uniprot_general database:
python uniprot_general_db.py
python uniprot_general_db.py --output food_db.faa
python uniprot_general_db.py --output food_db.faa --include-unreviewed
python uniprot_general_db.py --chunk-size 30To prepare uniprot_metagenomic database:
python puerh_uniprot_db.py
python puerh_uniprot_db.py --include-unreviewed
python puerh_uniprot_db.py --output custom_name.faaTo convert raw files to mgf files:
Choose your raw files
Select your output format as mgf
On the Filters part: choose Peak Picking
Add the filter and click startSince our databases are too big for MS-GF+, we needed to split them. Download Fasta-File-Splitter from https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter/releases/.
In the Fasta_File_Splitter directory:
./FastaFileSplitter /I:input.faa /O:./output/ /MB:400Notes
- /I: Input - source .faa / .fasta file path
- /O: Output - directory where split files will be saved
- /MB: Megabytes - split by target file size (Sine MS-GF+ works with maximum of 500 MB, we chose to split our database file 400 MB parts.)
Installing MS-GF+:
conda create -n msgf -c bioconda python=3.14.
conda activate msgf
conda install bioconda::msgf_plusBuild Suffix Array (DB indexing):
msgf_plus edu.ucsd.msjava.msdbsearch.BuildSA -d protein_db.fastaNotes
- -d: Protein FASTA database to index
Database Search: Indexes the database before search. Run once per FASTA part.
msgf_plus -s spectra.mgf \
-d protein_db.fasta \
-t 6ppm \
-tda 1 \
-m 1 \
-inst 1 \
-e 1 \
-o results.mzidNotes
- -s: Input spectrum file (MGF format)
- -d: Protein FASTA database
- -t 6ppm: Precursor mass tolerance: 6 ppm
- -tda 1: Target-decoy search: concatenated (automatic decoy generation)
- -m 1: Fragmentation method: CID
- -inst 1: Instrument type: Orbitrap/FTICR/Lumos
- -e 1: Enzyme: Trypsin
- -o: Output file (.mzid format)
To merge mzid files, download the MzidMerger from https://github.com/PNNL-Comp-Mass-Spec/MzidMerger/releases.
Merge split mzid results: Merges per-part mzid files into a single mzid per sample (needed because the database was split into parts).
dotnet MzidMerger.dll \
-inDir mzid_dir \
-filter "sample__db_chunk_*.mzid" \
-out merged.mzidConvert mzid to TSV: Converts mzid output to readable TSV format.
java -Xmx20G -cp MSGFPlus.jar \
edu.ucsd.msjava.ui.MzIDToTsv \
-i input.mzid \
-o output.tsv \
-showDecoy 1Convert TSV files to PSM files:
tsv_to_inference.py: Splits MSGF+ .tsv output into target and decoy PSM files formatted for Percolator input.
python tsv_to_inference.py <input.tsv> <output_dir>For protein inference, we used Py Protein Inference.
conda create -n pyprotein python=3.10
conda activate pyprotein
pip install pyproteininferenceUsage:
protein_inference_cli.py \
-t tmp_dir/sample_target.txt \
-d tmp_dir/sample_decoy.txt \
-y params_inference.yaml \
-l output/sample_inference.csvNotes
- Inference settings are controlled via params_inference.yaml. Key parameters: FDR threshold, parsimony solver, PSM scoring strategy, and decoy symbol.
- Carlino, N., Blanco-Míguez, A., Punčochář, M., Mengoni, C., Pinto, F., Tatti, A., ... & Yap, M. (2024). Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome. Cell, 187(20), 5775-5795. Github: https://github.com/SegataLab/cFMD
- Caffrey, E. B., Olm, M. R., Kothe, C. I., Wastyk, H. C., Evans, J. D., & Sonnenburg, J. L. (2025). MiFoDB, a workflow for microbial food metagenomic characterization, enables high-resolution analysis of fermented food microbial dynamics. Msystems, 10(9), e00141-25. Website: https://mifodb.readthedocs.io/en/latest/
- Richardson, L., Allen, B., Baldi, G., Beracochea, M., Bileschi, M. L., Burdett, T., ... & Finn, R. D. (2023). MGnify: the microbiome sequence data analysis resource in 2023. Nucleic acids research, 51(D1), D753-D759. Website: https://www.ebi.ac.uk/metagenomics
- UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515.
- West, P. T., Probst, A. J., Grigoriev, I. V., Thomas, B. C., & Banfield, J. F. (2018). Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome research, 28(4), 569-580. Github: https://github.com/patrickwest/EukRep
- Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026-1028. Github: https://github.com/soedinglab/mmseqs2
- Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11(1), 119. Github: https://github.com/hyattpd/Prodigal
- Durrant, M. G., & Bhatt, A. S. (2021). Automated prediction and annotation of small open reading frames in microbial genomes. Cell host & microbe, 29(1), 121-131. Github: https://github.com/bhattlab/SmORFinder
- Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics, 24(5), 637-644. Github: https://github.com/Gaius-Augustus/Augustus
- Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular biology and evolution, 38(12), 5825-5829. Github: https://github.com/eggnogdb/eggnog-mapper
- Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., ... & Bork, P. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research, 47(D1), D309-D314.
- Buchfink, B., Reuter, K., & Drost, H. G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods, 18(4), 366-368.
- Adusumilli, R., & Mallick, P. (2017). Data conversion with ProteoWizard msConvert. In Proteomics: methods and protocols (pp. 339-368). New York, NY: Springer New York. Website: https://proteowizard.sourceforge.io/index.html
- Zhao, M., Su, X. Q., Nian, B., Chen, L. J., Zhang, D. L., Duan, S. M., ... & Ma, Y. (2019). Integrated meta-omics approaches to understand the microbiome of spontaneous fermentation of traditional Chinese pu-erh tea. Msystems, 4(6), 10-1128.
- Monroe, Matthew. (2018, February 21). Fasta File Splitter. [Computer software]. https://doi.org/10.11578/dc.20180319.27. Github: https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter
- Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5(1), 5277. Github: https://github.com/MSGFPlus/msgfplus
- Gibbons, Brayson & Monroe, Matthew. Mzid Merger. [Computer software]. Github: https://github.com/PNNL-Comp-Mass-Spec/MzidMerger
- Hinkle, T. B., & Bakalarski, C. E. (2025). Comprehensive Protein Inference Analysis with PyProteinInference Elucidates Biological Understanding of Tandem Mass Spectrometry Data. Journal of proteome research, 24(4), 2135–2140. https://doi.org/10.1021/acs.jproteome.4c00734. Github: https://github.com/thinkle12/pyproteininference?tab=readme-ov-file