FoodProt Analysis Pipeline

An end-to-end pipeline for building food metagenome-derived protein sequence databases and performing metaproteomic database searches.

1. Data Acquisition

cFMD data can be downloaded both from Zenodo (https://zenodo.org/records/13285428) and GitHub (https://github.com/SegataLab/cFMD.git)

MiFoDB data can be downloaded from Zenodo (https://zenodo.org/records/10881265)

MGnify data can be downloaded from their website (https://www.ebi.ac.uk/metagenomics)

Uniprot data can be downloaded with the script: uniprot_search.py

python uniprot_search.py
python uniprot_search.py --output-dir ./data/uniprot

Notes

The primary categorical framework of this study is structured around cFMD data.
From MGnify studies, only contig files belonging to the following categories were downloaded: dairy, fermented_beverages, and fermented_vegetables
Duplicate studies across cFMD, MiFoDB, and MGnify data sources were identified and removed to ensure data integrity.
Search parameters for the UniProt database script were configured to align with the metadata parameters of the cFMD dataset.

2. Source-Specific Preprocessing

cFMD and MiFoDB data is already classified into eukaryotic and prokaryotic genomes and requires no additional preprocessing. MGnify need source-specific steps before entering the common analysis pipeline.

2.1 MGnify Preprocessing

MGnify contigs are unclassified and require EukRep for euk/prok separation. Eukaryotic contigs additionally need MMseqs2 to determine organism identity for Augustus mapping.

Installing EukRep:

conda create -y -n eukrep -c bioconda python=3.7.16
conda activate eukrep
pip install scikit-learn==0.19.2
pip install EukRep

Using EukRep:

EukRep -i input.fasta -o euk_output.fasta --prokarya prok_output.fasta

Notes

-i: Input FASTA file (metagenomic contigs)
-o: Output FASTA file for eukaryotic contigs
--prokarya: Output FASTA file for prokaryotic contigs

Installing MMseqs2:

conda install -c conda-forge -c bioconda mmseqs2

Using MMseqs2:

mmseqs easy-taxonomy query.fasta target_db output_prefix tmp_dir \
       --threads 16 \
       --split-memory-limit 110G \
       --lca-mode 3 \
       --search-type 3

Notes

query.fasta: Input FASTA file (eukaryotic contigs)
target_db: Reference database for taxonomy assignment (UniRef50 in this case)
output_prefix: Prefix for output files (generates _report and _tophit_report)
tmp_dir: Temporary directory for intermediate files
--threads: Number of CPU threads
--split-memory-limit: Maximum memory per split (prevents crashes on large databases)
--lca-mode 3: LCA algorithm mode: top-hit based approach for taxonomic assignment
--search-type 3: Nucleotide-to-protein search (translated search)

3. Gene Prediction

After preprocessing, all three sources follow the same two-track pipeline: Prodigal + smORFinder for prokaryotic data, Augustus for eukaryotic data. The commands below are run independently for each source (cFMD, MiFoDB, MGnify).

3.1 Prokaryotic: Prodigal

Installing Prodigal:

git clone https://github.com/hyattpd/Prodigal.git
cd ~/Prodigal
make install

Using Prodigal:

prodigal -i input.fasta \
         -o output.genes.gbk \
         -a output.proteins.faa \
         -d output.genes.fna \
         -p single -c -m

Notes

-i: Input FASTA file
-o: Gene coordinates output in GenBank format (.gbk)
-a: Predicted protein sequences (.faa)
-d: Predicted gene nucleotide sequences (.fna)
-p single: Single genome mode (assumes each file is one organism)
-c: Report only closed (complete) ORFs with both start and stop codons
-m: Do not include upstream sequence before the start codon

3.2 Prokaryotic: smORFinder

Installing smORFinder:

conda create -n smorfinder python=3.8.20
conda activate smorfinder
pip install smorfinder

Using smORFinder:

smorf single input.fasta

Notes

single: Single genome mode (one genome per run)
input.fasta: Input FASTA file (nucleotide sequences)

3.3 Eukaryotic: Augustus

For cFMD and MiFoDB, organism models are already known. For MGnify, the MMseqs2 results from step 2.2 are used to determine the appropriate species model.

Installing Augustus:

sudo apt install augustus augustus-data augustus-doc

Using Augustus:

augustus --species=SPECIES_MODEL \
        --strand=both \
        --genemodel=complete \
        --gff3=on \
        --protein=on \
        --codingseq=on \
        input.fasta

Notes

--species: Augustus species model for gene prediction (e.g., saccharomyces_cerevisiae_S288C, aspergillus_nidulans). Determined by taxonomy mapping (MMseqs2 for MGnify, pre-existing mapping for cFMD/MiFoDB).
--strand=both: Predict genes on both forward and reverse strands
--genemodel=complete: Report only complete gene models (with start and stop codons)
--gff3=on: Output in GFF3 format--protein=onInclude predicted protein sequences in the output
--codingseq=on: Include coding nucleotide sequences in the output

Extract protein sequences from GFF output:

perl getAnnoFasta.pl augustus_output.gff3

Notes

augustus_output.gff3: Augustus GFF3 output file containing gene predictions with embedded protein and coding sequences
outputs: .aa -> Predicted protein sequences (renamed to .faa), .codingseq -> Coding nucleotide sequences (renamed to .cds.fna)

4. Functional Annotation

Installing EggNOG-mapper:

conda create -n emapper -c bioconda -c conda-forge python=3.10.19
conda activate emapper
conda install eggnog-mapper

Installing necessary databases:

aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.taxa.tar.gz

Using EggNOG-mapper:

emapper.py \
    -i input.faa \
    -o output_prefix \
    --output_dir output_dir \
    --data_dir eggnog_db_dir \
    --scratch_dir tmp_dir \
    --cpu 8 \
    -m diamond \
    --override \
    --dmnd_ignore_warnings

Notes

-i: Input protein FASTA file
-o: Output file prefix
--output_dir: Directory for output files
--data_dir: Path to EggNOG database files (eggnog.db, eggnog_proteins.dmnd, eggnog.taxa.db)
--scratch_dir: Temporary directory for intermediate files
--cpu: Number of CPU threads
-m diamond: Search method: use Diamond for homology search
--override: Overwrite existing output files
--dmnd_ignore_warnings: Tolerate ambiguous amino acids (e.g., X) in input sequences
Created output files: Functional annotation results (GO, KEGG, COG, etc.) -> .emapper.annotations, Diamond search hits -> .emapper.hits, Best matching seed orthologs -> .emapper.seed_orthologs

5. Database Construction

Three scripts for building, annotating, and summarising per-category FASTA databases.

Input: directory list file

All scripts accept the same --list argument: a plain text file (.txt) or Excel file (.xlsx) where each line / first-column cell contains the path to one input directory. The last component of each path is used as the category name.

/data/fermented_dairy
/data/fermented_vegetables
/data/meat

To rename a category (e.g. merge two folder names into one), edit the CATEGORY_MAP dict at the top of each script.

01 - Build category databases

Merges all .faa / .faa.gz files within each category into a single deduplicated FASTA file. Every header is prefixed with the source filename so sequences can be traced back.

Output: <output_dir>/<category>_db.faa
Header format: >samplename_proteins__original_header

python build_category_dbs.py --list dirs.txt --output db/
python build_category_dbs.py --list dirs.xlsx --output db/ --gzip

Notes

Deduplication is exact sequence match (MD5, case-insensitive), across all files within a category.
.xlsx input requires pip install pandas openpyxl; plain .txt has no extra dependencies.
--gzip produces .faa.gz output files.

02 - Build metadata

Reads the *_db.faa files produced in step 01 and resolves per-sequence metadata (source, cell type, sequence type) by matching header prefixes against the original directory list.

Output:

<output_dir>/metadata/<category>_metadata.tsv
<output_dir>/metadata/all_metadata.tsv
<output_dir>/summary.xlsx (pivot-table report; skipped with --no-excel)

python build_metadata.py --list dirs.txt --db db/ --output results/
python build_metadata.py --list dirs.txt --db db/ --output results/ --no-excel --resume

Notes

Metadata inference rules (source, cell_type, type) are defined in the parse_path() function — edit it to match your directory structure.
--resume skips categories whose TSV already exists, useful for large datasets.
--no-excel skips the summary report entirely.
.xlsx input and Excel output require pip install pandas openpyxl.

03 - Build annotation databases

Merges eggNOG-mapper output files (.emapper.annotations) into per-category TSV files. Query IDs are prefixed to match the FASTA headers from step 01. Optionally filters records to keep only sequences present in the deduplicated databases.

Output: <output_dir>/<category>_annotations.tsv

# Without filtering
python build_annotation_dbs.py --list dirs.txt --output annotation_dbs/
 
# With filtering against step-01 databases
python build_annotation_dbs.py --list dirs.txt --output annotation_dbs/ --db db/ --gzip --resume

Notes

When --db is provided, only annotations whose prefixed query ID exists in the corresponding *_db.faa are written; unmatched records are counted and reported.
--resume skips categories whose output file already exists.
--gzip produces .tsv.gz output files.
.xlsx input requires pip install pandas openpyxl; no other extra dependencies.

6. Metaproteomic Database Search

Download ProteoWizard MSConvert from https://proteowizard.sourceforge.io/download.html

We downloaded Pu-erh tea samples from https://www.iprox.cn//page/subproject.html?id=IPX0001388001 that belongs to Zhao et al. (2019).
We tested two of our databases: all_db and fermented_beverages, and two different databases: a general database prepared from UniProt with uniprot_general_db.py, and a database with metagenomic origin prepared from UniProt with uniprot_metagenomic_db.py.

To prepare uniprot_general database:

python uniprot_general_db.py
python uniprot_general_db.py --output food_db.faa
python uniprot_general_db.py --output food_db.faa --include-unreviewed
python uniprot_general_db.py --chunk-size 30

To prepare uniprot_metagenomic database:

python puerh_uniprot_db.py
python puerh_uniprot_db.py --include-unreviewed
python puerh_uniprot_db.py --output custom_name.faa

To convert raw files to mgf files:

Choose your raw files
Select your output format as mgf
On the Filters part: choose Peak Picking
Add the filter and click start

Since our databases are too big for MS-GF+, we needed to split them. Download Fasta-File-Splitter from https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter/releases/.

In the Fasta_File_Splitter directory:

./FastaFileSplitter /I:input.faa /O:./output/ /MB:400

Notes

/I: Input - source .faa / .fasta file path
/O: Output - directory where split files will be saved
/MB: Megabytes - split by target file size (Sine MS-GF+ works with maximum of 500 MB, we chose to split our database file 400 MB parts.)

Installing MS-GF+:

conda create -n msgf -c bioconda python=3.14.
conda activate msgf
conda install bioconda::msgf_plus

Build Suffix Array (DB indexing):

msgf_plus edu.ucsd.msjava.msdbsearch.BuildSA -d protein_db.fasta

Notes

-d: Protein FASTA database to index

Database Search: Indexes the database before search. Run once per FASTA part.

msgf_plus -s spectra.mgf \
          -d protein_db.fasta \
          -t 6ppm \
          -tda 1 \
          -m 1 \
          -inst 1 \
          -e 1 \
          -o results.mzid

Notes

-s: Input spectrum file (MGF format)
-d: Protein FASTA database
-t 6ppm: Precursor mass tolerance: 6 ppm
-tda 1: Target-decoy search: concatenated (automatic decoy generation)
-m 1: Fragmentation method: CID
-inst 1: Instrument type: Orbitrap/FTICR/Lumos
-e 1: Enzyme: Trypsin
-o: Output file (.mzid format)

To merge mzid files, download the MzidMerger from https://github.com/PNNL-Comp-Mass-Spec/MzidMerger/releases.

Merge split mzid results: Merges per-part mzid files into a single mzid per sample (needed because the database was split into parts).

dotnet MzidMerger.dll \
       -inDir mzid_dir \
       -filter "sample__db_chunk_*.mzid" \
       -out merged.mzid

Convert mzid to TSV: Converts mzid output to readable TSV format.

java -Xmx20G -cp MSGFPlus.jar \
     edu.ucsd.msjava.ui.MzIDToTsv \
     -i input.mzid \
     -o output.tsv \
     -showDecoy 1

Convert TSV files to PSM files: tsv_to_inference.py: Splits MSGF+ .tsv output into target and decoy PSM files formatted for Percolator input.

python tsv_to_inference.py <input.tsv> <output_dir>

For protein inference, we used Py Protein Inference.

conda create -n pyprotein python=3.10
conda activate pyprotein
pip install pyproteininference

Usage:

protein_inference_cli.py \
    -t tmp_dir/sample_target.txt \
    -d tmp_dir/sample_decoy.txt \
    -y params_inference.yaml \
    -l output/sample_inference.csv

Notes

Inference settings are controlled via params_inference.yaml. Key parameters: FDR threshold, parsimony solver, PSM scoring strategy, and decoy symbol.

7. References

Carlino, N., Blanco-Míguez, A., Punčochář, M., Mengoni, C., Pinto, F., Tatti, A., ... & Yap, M. (2024). Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome. Cell, 187(20), 5775-5795. Github: https://github.com/SegataLab/cFMD
Caffrey, E. B., Olm, M. R., Kothe, C. I., Wastyk, H. C., Evans, J. D., & Sonnenburg, J. L. (2025). MiFoDB, a workflow for microbial food metagenomic characterization, enables high-resolution analysis of fermented food microbial dynamics. Msystems, 10(9), e00141-25. Website: https://mifodb.readthedocs.io/en/latest/
Richardson, L., Allen, B., Baldi, G., Beracochea, M., Bileschi, M. L., Burdett, T., ... & Finn, R. D. (2023). MGnify: the microbiome sequence data analysis resource in 2023. Nucleic acids research, 51(D1), D753-D759. Website: https://www.ebi.ac.uk/metagenomics
UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515.
West, P. T., Probst, A. J., Grigoriev, I. V., Thomas, B. C., & Banfield, J. F. (2018). Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome research, 28(4), 569-580. Github: https://github.com/patrickwest/EukRep
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026-1028. Github: https://github.com/soedinglab/mmseqs2
Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11(1), 119. Github: https://github.com/hyattpd/Prodigal
Durrant, M. G., & Bhatt, A. S. (2021). Automated prediction and annotation of small open reading frames in microbial genomes. Cell host & microbe, 29(1), 121-131. Github: https://github.com/bhattlab/SmORFinder
Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics, 24(5), 637-644. Github: https://github.com/Gaius-Augustus/Augustus
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular biology and evolution, 38(12), 5825-5829. Github: https://github.com/eggnogdb/eggnog-mapper
Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., ... & Bork, P. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research, 47(D1), D309-D314.
Buchfink, B., Reuter, K., & Drost, H. G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods, 18(4), 366-368.
Adusumilli, R., & Mallick, P. (2017). Data conversion with ProteoWizard msConvert. In Proteomics: methods and protocols (pp. 339-368). New York, NY: Springer New York. Website: https://proteowizard.sourceforge.io/index.html
Zhao, M., Su, X. Q., Nian, B., Chen, L. J., Zhang, D. L., Duan, S. M., ... & Ma, Y. (2019). Integrated meta-omics approaches to understand the microbiome of spontaneous fermentation of traditional Chinese pu-erh tea. Msystems, 4(6), 10-1128.
Monroe, Matthew. (2018, February 21). Fasta File Splitter. [Computer software]. https://doi.org/10.11578/dc.20180319.27. Github: https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter
Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5(1), 5277. Github: https://github.com/MSGFPlus/msgfplus
Gibbons, Brayson & Monroe, Matthew. Mzid Merger. [Computer software]. Github: https://github.com/PNNL-Comp-Mass-Spec/MzidMerger
Hinkle, T. B., & Bakalarski, C. E. (2025). Comprehensive Protein Inference Analysis with PyProteinInference Elucidates Biological Understanding of Tandem Mass Spectrometry Data. Journal of proteome research, 24(4), 2135–2140. https://doi.org/10.1021/acs.jproteome.4c00734. Github: https://github.com/thinkle12/pyproteininference?tab=readme-ov-file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FoodProt Analysis Pipeline

Table of Contents

1. Data Acquisition

2. Source-Specific Preprocessing

2.1 MGnify Preprocessing

3. Gene Prediction

3.1 Prokaryotic: Prodigal

3.2 Prokaryotic: smORFinder

3.3 Eukaryotic: Augustus

4. Functional Annotation

5. Database Construction

Input: directory list file

01 - Build category databases

02 - Build metadata

03 - Build annotation databases

6. Metaproteomic Database Search

7. References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
build_annotation_dbs.py		build_annotation_dbs.py
build_category_dbs.py		build_category_dbs.py
build_metadata.py		build_metadata.py
params_inference.yaml		params_inference.yaml
tsv_to_inference.py		tsv_to_inference.py
uniprot_general_db.py		uniprot_general_db.py
uniprot_metagenomic_db.py		uniprot_metagenomic_db.py
uniprot_search.py		uniprot_search.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

FoodProt Analysis Pipeline

Table of Contents

1. Data Acquisition

2. Source-Specific Preprocessing

2.1 MGnify Preprocessing

3. Gene Prediction

3.1 Prokaryotic: Prodigal

3.2 Prokaryotic: smORFinder

3.3 Eukaryotic: Augustus

4. Functional Annotation

5. Database Construction

Input: directory list file

01 - Build category databases

02 - Build metadata

03 - Build annotation databases

6. Metaproteomic Database Search

7. References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages