Skip to content

arikanlab/FoodProt

Repository files navigation

FoodProt Analysis Pipeline

An end-to-end pipeline for building food metagenome-derived protein sequence databases and performing metaproteomic database searches.

Table of Contents


1. Data Acquisition

cFMD data can be downloaded both from Zenodo (https://zenodo.org/records/13285428) and GitHub (https://github.com/SegataLab/cFMD.git)

MiFoDB data can be downloaded from Zenodo (https://zenodo.org/records/10881265)

MGnify data can be downloaded from their website (https://www.ebi.ac.uk/metagenomics)

Uniprot data can be downloaded with the script: uniprot_search.py

python uniprot_search.py
python uniprot_search.py --output-dir ./data/uniprot

Notes

  • The primary categorical framework of this study is structured around cFMD data.
  • From MGnify studies, only contig files belonging to the following categories were downloaded: dairy, fermented_beverages, and fermented_vegetables
  • Duplicate studies across cFMD, MiFoDB, and MGnify data sources were identified and removed to ensure data integrity.
  • Search parameters for the UniProt database script were configured to align with the metadata parameters of the cFMD dataset.

2. Source-Specific Preprocessing

cFMD and MiFoDB data is already classified into eukaryotic and prokaryotic genomes and requires no additional preprocessing. MGnify need source-specific steps before entering the common analysis pipeline.

2.1 MGnify Preprocessing

MGnify contigs are unclassified and require EukRep for euk/prok separation. Eukaryotic contigs additionally need MMseqs2 to determine organism identity for Augustus mapping.

Installing EukRep:

conda create -y -n eukrep -c bioconda python=3.7.16
conda activate eukrep
pip install scikit-learn==0.19.2
pip install EukRep

Using EukRep:

EukRep -i input.fasta -o euk_output.fasta --prokarya prok_output.fasta

Notes

  • -i: Input FASTA file (metagenomic contigs)
  • -o: Output FASTA file for eukaryotic contigs
  • --prokarya: Output FASTA file for prokaryotic contigs

Installing MMseqs2:

conda install -c conda-forge -c bioconda mmseqs2

Using MMseqs2:

mmseqs easy-taxonomy query.fasta target_db output_prefix tmp_dir \
       --threads 16 \
       --split-memory-limit 110G \
       --lca-mode 3 \
       --search-type 3

Notes

  • query.fasta: Input FASTA file (eukaryotic contigs)
  • target_db: Reference database for taxonomy assignment (UniRef50 in this case)
  • output_prefix: Prefix for output files (generates _report and _tophit_report)
  • tmp_dir: Temporary directory for intermediate files
  • --threads: Number of CPU threads
  • --split-memory-limit: Maximum memory per split (prevents crashes on large databases)
  • --lca-mode 3: LCA algorithm mode: top-hit based approach for taxonomic assignment
  • --search-type 3: Nucleotide-to-protein search (translated search)

3. Gene Prediction

After preprocessing, all three sources follow the same two-track pipeline: Prodigal + smORFinder for prokaryotic data, Augustus for eukaryotic data. The commands below are run independently for each source (cFMD, MiFoDB, MGnify).

3.1 Prokaryotic: Prodigal

Installing Prodigal:

git clone https://github.com/hyattpd/Prodigal.git
cd ~/Prodigal
make install

Using Prodigal:

prodigal -i input.fasta \
         -o output.genes.gbk \
         -a output.proteins.faa \
         -d output.genes.fna \
         -p single -c -m

Notes

  • -i: Input FASTA file
  • -o: Gene coordinates output in GenBank format (.gbk)
  • -a: Predicted protein sequences (.faa)
  • -d: Predicted gene nucleotide sequences (.fna)
  • -p single: Single genome mode (assumes each file is one organism)
  • -c: Report only closed (complete) ORFs with both start and stop codons
  • -m: Do not include upstream sequence before the start codon

3.2 Prokaryotic: smORFinder

Installing smORFinder:

conda create -n smorfinder python=3.8.20
conda activate smorfinder
pip install smorfinder

Using smORFinder:

smorf single input.fasta

Notes

  • single: Single genome mode (one genome per run)
  • input.fasta: Input FASTA file (nucleotide sequences)

3.3 Eukaryotic: Augustus

For cFMD and MiFoDB, organism models are already known. For MGnify, the MMseqs2 results from step 2.2 are used to determine the appropriate species model.

Installing Augustus:

sudo apt install augustus augustus-data augustus-doc

Using Augustus:

augustus --species=SPECIES_MODEL \
        --strand=both \
        --genemodel=complete \
        --gff3=on \
        --protein=on \
        --codingseq=on \
        input.fasta

Notes

  • --species: Augustus species model for gene prediction (e.g., saccharomyces_cerevisiae_S288C, aspergillus_nidulans). Determined by taxonomy mapping (MMseqs2 for MGnify, pre-existing mapping for cFMD/MiFoDB).
  • --strand=both: Predict genes on both forward and reverse strands
  • --genemodel=complete: Report only complete gene models (with start and stop codons)
  • --gff3=on: Output in GFF3 format--protein=onInclude predicted protein sequences in the output
  • --codingseq=on: Include coding nucleotide sequences in the output

Extract protein sequences from GFF output:

perl getAnnoFasta.pl augustus_output.gff3

Notes

  • augustus_output.gff3: Augustus GFF3 output file containing gene predictions with embedded protein and coding sequences
  • outputs: .aa -> Predicted protein sequences (renamed to .faa), .codingseq -> Coding nucleotide sequences (renamed to .cds.fna)

4. Functional Annotation

Installing EggNOG-mapper:

conda create -n emapper -c bioconda -c conda-forge python=3.10.19
conda activate emapper
conda install eggnog-mapper

Installing necessary databases:

aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog_proteins.dmnd.gz
aria2c -c -x 8 -s 8 -k 1M http://eggnog5.embl.de/download/emapperdb-5.0.2/eggnog.taxa.tar.gz

Using EggNOG-mapper:

emapper.py \
    -i input.faa \
    -o output_prefix \
    --output_dir output_dir \
    --data_dir eggnog_db_dir \
    --scratch_dir tmp_dir \
    --cpu 8 \
    -m diamond \
    --override \
    --dmnd_ignore_warnings

Notes

  • -i: Input protein FASTA file
  • -o: Output file prefix
  • --output_dir: Directory for output files
  • --data_dir: Path to EggNOG database files (eggnog.db, eggnog_proteins.dmnd, eggnog.taxa.db)
  • --scratch_dir: Temporary directory for intermediate files
  • --cpu: Number of CPU threads
  • -m diamond: Search method: use Diamond for homology search
  • --override: Overwrite existing output files
  • --dmnd_ignore_warnings: Tolerate ambiguous amino acids (e.g., X) in input sequences
  • Created output files: Functional annotation results (GO, KEGG, COG, etc.) -> .emapper.annotations, Diamond search hits -> .emapper.hits, Best matching seed orthologs -> .emapper.seed_orthologs

5. Database Construction

Three scripts for building, annotating, and summarising per-category FASTA databases.

Input: directory list file

All scripts accept the same --list argument: a plain text file (.txt) or Excel file (.xlsx) where each line / first-column cell contains the path to one input directory. The last component of each path is used as the category name.

/data/fermented_dairy
/data/fermented_vegetables
/data/meat

To rename a category (e.g. merge two folder names into one), edit the CATEGORY_MAP dict at the top of each script.

01 - Build category databases

Merges all .faa / .faa.gz files within each category into a single deduplicated FASTA file. Every header is prefixed with the source filename so sequences can be traced back.

Output: <output_dir>/<category>_db.faa
Header format: >samplename_proteins__original_header

python build_category_dbs.py --list dirs.txt --output db/
python build_category_dbs.py --list dirs.xlsx --output db/ --gzip

Notes

  • Deduplication is exact sequence match (MD5, case-insensitive), across all files within a category.
  • .xlsx input requires pip install pandas openpyxl; plain .txt has no extra dependencies.
  • --gzip produces .faa.gz output files.

02 - Build metadata

Reads the *_db.faa files produced in step 01 and resolves per-sequence metadata (source, cell type, sequence type) by matching header prefixes against the original directory list.

Output:

  • <output_dir>/metadata/<category>_metadata.tsv
  • <output_dir>/metadata/all_metadata.tsv
  • <output_dir>/summary.xlsx (pivot-table report; skipped with --no-excel)
python build_metadata.py --list dirs.txt --db db/ --output results/
python build_metadata.py --list dirs.txt --db db/ --output results/ --no-excel --resume

Notes

  • Metadata inference rules (source, cell_type, type) are defined in the parse_path() function — edit it to match your directory structure.
  • --resume skips categories whose TSV already exists, useful for large datasets.
  • --no-excel skips the summary report entirely.
  • .xlsx input and Excel output require pip install pandas openpyxl.

03 - Build annotation databases

Merges eggNOG-mapper output files (.emapper.annotations) into per-category TSV files. Query IDs are prefixed to match the FASTA headers from step 01. Optionally filters records to keep only sequences present in the deduplicated databases.

Output: <output_dir>/<category>_annotations.tsv

# Without filtering
python build_annotation_dbs.py --list dirs.txt --output annotation_dbs/
 
# With filtering against step-01 databases
python build_annotation_dbs.py --list dirs.txt --output annotation_dbs/ --db db/ --gzip --resume

Notes

  • When --db is provided, only annotations whose prefixed query ID exists in the corresponding *_db.faa are written; unmatched records are counted and reported.
  • --resume skips categories whose output file already exists.
  • --gzip produces .tsv.gz output files.
  • .xlsx input requires pip install pandas openpyxl; no other extra dependencies.

6. Metaproteomic Database Search

Download ProteoWizard MSConvert from https://proteowizard.sourceforge.io/download.html

  • We downloaded Pu-erh tea samples from https://www.iprox.cn//page/subproject.html?id=IPX0001388001 that belongs to Zhao et al. (2019).
  • We tested two of our databases: all_db and fermented_beverages, and two different databases: a general database prepared from UniProt with uniprot_general_db.py, and a database with metagenomic origin prepared from UniProt with uniprot_metagenomic_db.py.

To prepare uniprot_general database:

python uniprot_general_db.py
python uniprot_general_db.py --output food_db.faa
python uniprot_general_db.py --output food_db.faa --include-unreviewed
python uniprot_general_db.py --chunk-size 30

To prepare uniprot_metagenomic database:

python puerh_uniprot_db.py
python puerh_uniprot_db.py --include-unreviewed
python puerh_uniprot_db.py --output custom_name.faa

To convert raw files to mgf files:

Choose your raw files
Select your output format as mgf
On the Filters part: choose Peak Picking
Add the filter and click start

Since our databases are too big for MS-GF+, we needed to split them. Download Fasta-File-Splitter from https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter/releases/.

In the Fasta_File_Splitter directory:

./FastaFileSplitter /I:input.faa /O:./output/ /MB:400

Notes

  • /I: Input - source .faa / .fasta file path
  • /O: Output - directory where split files will be saved
  • /MB: Megabytes - split by target file size (Sine MS-GF+ works with maximum of 500 MB, we chose to split our database file 400 MB parts.)

Installing MS-GF+:

conda create -n msgf -c bioconda python=3.14.
conda activate msgf
conda install bioconda::msgf_plus

Build Suffix Array (DB indexing):

msgf_plus edu.ucsd.msjava.msdbsearch.BuildSA -d protein_db.fasta

Notes

  • -d: Protein FASTA database to index

Database Search: Indexes the database before search. Run once per FASTA part.

msgf_plus -s spectra.mgf \
          -d protein_db.fasta \
          -t 6ppm \
          -tda 1 \
          -m 1 \
          -inst 1 \
          -e 1 \
          -o results.mzid

Notes

  • -s: Input spectrum file (MGF format)
  • -d: Protein FASTA database
  • -t 6ppm: Precursor mass tolerance: 6 ppm
  • -tda 1: Target-decoy search: concatenated (automatic decoy generation)
  • -m 1: Fragmentation method: CID
  • -inst 1: Instrument type: Orbitrap/FTICR/Lumos
  • -e 1: Enzyme: Trypsin
  • -o: Output file (.mzid format)

To merge mzid files, download the MzidMerger from https://github.com/PNNL-Comp-Mass-Spec/MzidMerger/releases.

Merge split mzid results: Merges per-part mzid files into a single mzid per sample (needed because the database was split into parts).

dotnet MzidMerger.dll \
       -inDir mzid_dir \
       -filter "sample__db_chunk_*.mzid" \
       -out merged.mzid

Convert mzid to TSV: Converts mzid output to readable TSV format.

java -Xmx20G -cp MSGFPlus.jar \
     edu.ucsd.msjava.ui.MzIDToTsv \
     -i input.mzid \
     -o output.tsv \
     -showDecoy 1

Convert TSV files to PSM files: tsv_to_inference.py: Splits MSGF+ .tsv output into target and decoy PSM files formatted for Percolator input.

python tsv_to_inference.py <input.tsv> <output_dir>

For protein inference, we used Py Protein Inference.

conda create -n pyprotein python=3.10
conda activate pyprotein
pip install pyproteininference

Usage:

protein_inference_cli.py \
    -t tmp_dir/sample_target.txt \
    -d tmp_dir/sample_decoy.txt \
    -y params_inference.yaml \
    -l output/sample_inference.csv

Notes

  • Inference settings are controlled via params_inference.yaml. Key parameters: FDR threshold, parsimony solver, PSM scoring strategy, and decoy symbol.

7. References

  • Carlino, N., Blanco-Míguez, A., Punčochář, M., Mengoni, C., Pinto, F., Tatti, A., ... & Yap, M. (2024). Unexplored microbial diversity from 2,500 food metagenomes and links with the human microbiome. Cell, 187(20), 5775-5795. Github: https://github.com/SegataLab/cFMD
  • Caffrey, E. B., Olm, M. R., Kothe, C. I., Wastyk, H. C., Evans, J. D., & Sonnenburg, J. L. (2025). MiFoDB, a workflow for microbial food metagenomic characterization, enables high-resolution analysis of fermented food microbial dynamics. Msystems, 10(9), e00141-25. Website: https://mifodb.readthedocs.io/en/latest/
  • Richardson, L., Allen, B., Baldi, G., Beracochea, M., Bileschi, M. L., Burdett, T., ... & Finn, R. D. (2023). MGnify: the microbiome sequence data analysis resource in 2023. Nucleic acids research, 51(D1), D753-D759. Website: https://www.ebi.ac.uk/metagenomics
  • UniProt Consortium. (2019). UniProt: a worldwide hub of protein knowledge. Nucleic acids research, 47(D1), D506-D515.
  • West, P. T., Probst, A. J., Grigoriev, I. V., Thomas, B. C., & Banfield, J. F. (2018). Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome research, 28(4), 569-580. Github: https://github.com/patrickwest/EukRep
  • Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026-1028. Github: https://github.com/soedinglab/mmseqs2
  • Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11(1), 119. Github: https://github.com/hyattpd/Prodigal
  • Durrant, M. G., & Bhatt, A. S. (2021). Automated prediction and annotation of small open reading frames in microbial genomes. Cell host & microbe, 29(1), 121-131. Github: https://github.com/bhattlab/SmORFinder
  • Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics, 24(5), 637-644. Github: https://github.com/Gaius-Augustus/Augustus
  • Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular biology and evolution, 38(12), 5825-5829. Github: https://github.com/eggnogdb/eggnog-mapper
  • Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernández-Plaza, A., Forslund, S. K., Cook, H., ... & Bork, P. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research, 47(D1), D309-D314.
  • Buchfink, B., Reuter, K., & Drost, H. G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods, 18(4), 366-368.
  • Adusumilli, R., & Mallick, P. (2017). Data conversion with ProteoWizard msConvert. In Proteomics: methods and protocols (pp. 339-368). New York, NY: Springer New York. Website: https://proteowizard.sourceforge.io/index.html
  • Zhao, M., Su, X. Q., Nian, B., Chen, L. J., Zhang, D. L., Duan, S. M., ... & Ma, Y. (2019). Integrated meta-omics approaches to understand the microbiome of spontaneous fermentation of traditional Chinese pu-erh tea. Msystems, 4(6), 10-1128.
  • Monroe, Matthew. (2018, February 21). Fasta File Splitter. [Computer software]. https://doi.org/10.11578/dc.20180319.27. Github: https://github.com/PNNL-Comp-Mass-Spec/Fasta-File-Splitter
  • Kim, S., & Pevzner, P. A. (2014). MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications, 5(1), 5277. Github: https://github.com/MSGFPlus/msgfplus
  • Gibbons, Brayson & Monroe, Matthew. Mzid Merger. [Computer software]. Github: https://github.com/PNNL-Comp-Mass-Spec/MzidMerger
  • Hinkle, T. B., & Bakalarski, C. E. (2025). Comprehensive Protein Inference Analysis with PyProteinInference Elucidates Biological Understanding of Tandem Mass Spectrometry Data. Journal of proteome research, 24(4), 2135–2140. https://doi.org/10.1021/acs.jproteome.4c00734. Github: https://github.com/thinkle12/pyproteininference?tab=readme-ov-file

About

FoodProt: food-oriented protein sequence databases and scripts for food metaproteomics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages