Metagenomics.knit

METHODS

Quality Control and De Novo Assembly Raw paired-end metagenomic reads were processed using BBDuk v38.90 (Bushnell, 2014) to remove adapter sequences, contaminants, and low-quality bases. Quality-filtered reads from all samples were then co-assembled using MEGAHIT v1.2.9 (Li et al., 2015).

Gene Prediction and Abundance Estimation Open reading frames (ORFs) were predicted from assembled contigs using Prodigal v2.6.3 (Hyatt et al., 2010) in metagenomic mode. Abundance estimates for predicted genes and contigs were obtained by aligning sample reads back to the co-assembly using Bowtie2 v2.4.4 (Langmead & Salzberg, 2012), with coverage-based quantification.

Taxonomic and Functional Annotation Predicted protein sequences were annotated taxonomically by aligning against the NCBI NR database using DIAMOND v2.1.8 (Buchfink et al., 2021), followed by lowest common ancestor (LCA) assignment at the contig level. Functional annotations were assigned using multiple curated databases: KEGG (Kanehisa et al., 2016), COG, Pfam, eggNOG (Huerta-Cepas et al., 2019), CAZy (Lombard et al., 2014), BacMet (Pal et al., 2014), ARG-OAP (Yang et al., 2016), CARD (Jia et al., 2017), NCycDB (Tu et al., 2019).

Metagenomic Binning and Genome Refinement Contigs were binned into metagenome-assembled genomes (MAGs) using three complementary tools: MaxBin v2.2.7 (Wu et al., 2016), MetaBAT2 v2.15 (Kang et al., 2019), and CONCOCT v1.1.0 (Alneberg et al., 2014). Binning results were integrated using DAS Tool v1.1.5 (Sieber et al., 2018). The quality of recovered MAGs was assessed using CheckM v1.2.2 (Parks et al., 2015), based on lineage-specific marker gene completeness and contamination.

Metabolic Pathway Reconstruction Functional genes were mapped to metabolic pathways using MinPath v1.4 (Ye & Doak, 2009), enabling parsimonious pathway inference based on KEGG and MetaCyc identifiers. Statistical Analyses and Visualization Taxonomic and functional profiles were subjected to ecological and statistical analyses. Alpha-diversity and beta-diversity metrics were calculated for community structure assessment. Visualization included barplots, heatmaps, and correlation matrices. Differential abundance testing was performed using Metastats (White et al., 2009), Wilcoxon rank-sum test, t-test, DESeq2 v1.38.3 (Love et al., 2014), LEfSe (Segata et al., 2011), and PERMANOVA for multivariate comparisons.

CITATION

Alneberg J, Bjarnason BS, de Bruijn I, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11:1144–6.
Buchfink B, Reuter K, Drost HG. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18:366–8.
Bushnell B. BBMap: a fast, accurate, splice-aware aligner. Lawrence Berkeley National Lab. 2014.
Hyatt D, Chen GL, LoCascio PF, et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119.
Jia B, Raphenya AR, Alcock B, et al. CARD 2017: comprehensive antibiotic resistance database. Nucleic Acids Res. 2017;45:D566–73.
Kang DD, Li F, Kirton E, et al. MetaBAT 2: adaptive binning algorithm for genome reconstruction. PeerJ. 2019;7:e7359.
Kanehisa M, Sato Y, Kawashima M, et al. KEGG as a reference for gene/protein annotation. Nucleic Acids Res. 2016;44:D457–62.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.
Li D, Liu CM, Luo R, et al. MEGAHIT: ultra-fast metagenomic assembly. Bioinformatics. 2015;31:1674–6.
Liu B, Zheng D, Jin Q, et al. VFDB 2019: comparative pathogenomic platform. Nucleic Acids Res. 2019;47:D687–92.
Lombard V, Golaconda Ramulu H, Drula E, et al. The CAZy database. Nucleic Acids Res. 2014;42:D490–5.
Love MI, Huber W, Anders S. DESeq2: fold change estimation for RNA-seq. Genome Biol. 2014;15:550.
Pal C, Bengtsson-Palme J, Rensing C, et al. BacMet: biocide and metal resistance genes database. Nucleic Acids Res. 2014;42:D737–43.
Parks DH, Imelfort M, Skennerton CT, et al. CheckM: quality assessment of microbial genomes. Genome Res. 2015;25:1043–55.
Segata N, Izard J, Waldron L, et al. Metagenomic biomarker discovery. Genome Biol. 2011;12:R60.
Sieber CMK, Probst AJ, Sharrar A, et al. Genome recovery from metagenomes using DAS Tool. Nat Microbiol. 2018;3:836–43.
Tamames J, Puente-Sánchez F. SqueezeMeta pipeline for metagenomic analysis. Front Microbiol. 2019;9:3349.
Tu Q, Lin L, Cheng L, et al. NCycDB: nitrogen cycling gene database. Bioinformatics. 2019;35:1040–8.
Urban M, Cuzick A, Seager J, et al. PHI-base: pathogen–host interactions database. Nucleic Acids Res. 2020;48:D613–20.
White JR, Nagarajan N, Pop M. Statistical methods for differentially abundant features. PLoS Comput Biol. 2009;5:e1000352.
Wu YW, Simmons BA, Singer SW. MaxBin 2.0: binning algorithm for metagenomes. Bioinformatics. 2016;32:605–7.
Yang Y, Jiang X, Chai B, et al. ARGs-OAP: pipeline for ARG detection in metagenomes. Bioinformatics. 2016;32:2346–51.
Ye Y, Doak TG. Parsimony approach to pathway inference. PLoS Comput Biol. 2009;5:e1000465.

RESULTS INTERRETATION

📁 0.AnalysisFiles This folder contains the core genomic outputs following assembly and gene prediction. Project.fna, Project.fasta, and Project.faa represent the nucleotide, assembled contig, and protein sequences respectively. Project.gff provides feature annotations mapping ORFs and gene locations. The QUAST subdirectory includes genome assembly quality metrics such as N50, GC content, and contig statistics, offering an overview of assembly completeness and reliability.

📁 1.Statistics This directory summarizes overall mapping and annotation efficiency. Mapping_Stats.xls and mapping_summary.html report how well sequencing reads aligned to assembled contigs. ORF_Stats.xls details open reading frame counts per sample, while Annotation_summary.html provides a high-level overview of how sequences matched functional databases such as KEGG and COG.

📁 2.Taxa The 2.Taxa folder provides taxonomic classification results from kingdom down to species. Each rank is represented by .html and .png plots, along with legends and abundance tables (relative_abundance_*.csv). The Rank_Specific_Abundance subfolder holds rank-wise abundance matrices. Notably, Krona.html offers an interactive circular taxonomic chart, enabling intuitive browsing of microbial community composition.

📁 3.KEGG and 📁 4.COG These folders house functional profiling results. KEGG and COG directories each contain raw count matrices (*_Raw_Count.xls), normalized expression tables (RPKM, TPM), and corresponding interactive .html summaries. These data represent the predicted functional potential of the microbiome, with annotations mapped to KEGG orthologs or COG categories, useful for pathway reconstruction and metabolic inference.

📁 5.Alpha-diversity This folder contains within-sample (alpha) diversity metrics for taxonomic and functional profiles. Diversity indices (e.g., Shannon, Simpson) are provided in .xls files, with .png and .html visualizations for COG, KEGG, and Taxa datasets. These help assess richness and evenness of the microbial and functional communities in each sample.

📁 6.Rarefaction Rarefaction curves for KEGG, COG, and Taxa datasets are stored here in .html format. These plots assess sequencing depth sufficiency, ensuring that observed diversity is not skewed by insufficient sampling. It provides a quality control check for diversity metrics and comparative analysis.

📁 7.Ordination This folder includes beta diversity and multivariate analyses such as PCoA (2D and 3D), PERMANOVA, and pairwise comparisons for KEGG, COG, and Taxa data. It helps identify patterns in community composition across groups, with .xls outputs for eigenvalues and distance matrices, and .html visualizations for principal coordinate plots.

📁 8.Correlation This section provides correlation heatmaps for functional profiles. Pearson correlation analyses are visualized in .html and .pdf formats, highlighting co-occurrence and similarity patterns between KEGG and COG functions across samples.

📁 9.Heatmap Here, functional abundance heatmaps display the top 10, 20, 50, and 100 most abundant KEGG and COG features. Both normalized abundance tables (*_Normalized.xls) and high-resolution plots (.html, .pdf) are included, allowing visual comparison of key functional features across all samples.

📁 10.Venn This folder contains Venn diagrams comparing shared and unique features across sample groups for COG, KEGG, and TAXA. Data is available as .xls tables and .html visualizations, offering a clear overview of group-specific and overlapping microbial or functional signatures.