Genome Annotation

RESULTS

METHODS

Genome Assembly and Annotation

Raw reads were subjected to quality control based on their sequencing platform. Illumina reads were processed using BBDuk v38.90 (Bushnell, 2014) to remove adapters and low-quality sequences. Nanopore reads, when available, were additionally trimmed and demultiplexed using Porechop v0.2.4. For datasets containing both Illumina and PacBio reads, SPAdes v3.15.3 (Bankevich et al., 2012) was run in hybrid assembly mode to generate contigs and scaffolds. In cases where only long-read data (PacBio or Nanopore) were provided, assemblies were generated using Flye v2.9 (Kolmogorov et al., 2019). All assemblies were polished using Pilon v1.24 (Walker et al., 2014) with mapped Illumina reads, when available.

Open reading frames (ORFs) were predicted from assembled scaffolds using Prodigal v2.6.3 (Hyatt et al., 2010). Functional annotation was performed using eggNOG-mapper v2 (Cantalapiedra et al., 2021), referencing multiple databases including KEGG Orthology (KO), COG and Pfam. Additionally, DIAMOND BLASTP was used to align predicted proteins against specialized databases such as VFDB (Liu et al., 2019), CARD (Jia et al., 2017), and MvirDB (Zhou et al., 2007) to identify virulence factors, antimicrobial resistance genes, and mobile elements. Secretion system protein prediction was carried out using MacSyFinder TXSScan (Abby et al., 2016). Genome completeness was evaluated using BUSCO v5.4.6 (Manni et al., 2021) with appropriate lineage datasets. Genomic architecture was visualized using Circos v0.69-9 (Krzywinski et al., 2009). Plasmid prediction was performed using Plasmer (Liu et al., 2023), a deep learning-based classifier for plasmid contigs. Biological pathway reconstructions based on KEGG and MetaCyc (Caspi et al., 2020) using protein family predictions was done using MinPath (Ye & Doak, 2009).

Citations

Bushnell B. BBMap: fast, accurate aligner. Lawrence Berkeley Natl Lab. 2014.
Bankevich A, et al. SPAdes: genome assembly algorithm. J Comput Biol. 2012;19:455–77.
Kolmogorov M, et al. Flye: long-read genome assembler. Nat Biotechnol. 2019;37:540–6.
Walker BJ, et al. Pilon: assembly improvement tool. PLoS One. 2014;9:e112963.
Hyatt D, et al. Prodigal: gene prediction for prokaryotes. BMC Bioinformatics. 2010;11:119.
Cantalapiedra CP, et al. eggNOG-mapper v2. Mol Biol Evol. 2021;38:5825–9.
Liu B, et al. VFDB 2019. Nucleic Acids Res. 2019;47:D687–92.
Jia B, et al. CARD 2017. Nucleic Acids Res. 2017;45:D566–73.
Zhou CE, et al. MvirDB: virulence database. Nucleic Acids Res. 2007;35:D391–4.
Abby SS, et al. MacSyFinder: secretion system modeling. PLoS One. 2016;11:e0155559.
Manni M, et al. BUSCO: genome completeness tool. Nat Protoc. 2021;16:566–80.
Krzywinski M, et al. Circos: circular genome visualization. Genome Res. 2009;19:1639–45.
Liu Z, et al. Plasmer: plasmid classifier. Bioinformatics. 2023;39:btad380.
Caspi, R. et al. (2020). The MetaCyc database of metabolic pathways and enzymes—2019 update. Nucleic Acids Res, 48(D1), D445–D453.
Ye, Y. & Doak, T. G. (2009). A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol, 5(8), e1000465.

RESULTS SUMMARY

QUAST RESULTS

PLASMID DETECTION

Plasmer Predicted Class
contig_2	chromosome
contig_1	chromosome

## No plasmid is detected.

GENOME VISUALIZATION

ANNOTATION

Prediction Data Summary
Category	Value
contigs	2
bases	4534017
CDS	4685
rRNA	36
tRNA	116
tmRNA	1

MINPATH(KEGG)

fam_total:
The total number of gene families expected in a complete version of the pathway (according to the model). This number represents the theoretical full complement of families required for that pathway.

fam_found:
The number of gene families that were actually detected in your dataset. These are the families from your annotation that matched parts of the pathway model.
Ratio (fam_found/fam_total):
This ratio indicates the completeness of the pathway as represented in your data. A ratio near 1 means that most of the pathway’s gene families were found, suggesting a likely complete pathway. A lower ratio implies the pathway might be incomplete or missing key components.
predicted:
This column indicates whether MinPath has predicted the pathway as present in your dataset (Yes if the minimal criteria are met, No otherwise).

Caveats:

The presence of many gene families does not guarantee that the metabolic pathway is fully functional or active.
Incomplete genome assemblies or suboptimal annotation may reduce the detected family counts, thereby reducing the ratio.
Therefore, these metrics should be interpreted in the context of other biological evidence.

MINPATH(METACYC)

KRAKEN2 RESULTS

Kraken2 Results
Percentage	Reads	Taxon_ID	Rank	NCBI_ID	Taxonomy
100	2	0	R	1	root
100	2	0	1	131567	cellular organisms
100	2	0	NA	2	Bacteria
100	2	0	D1	1783272	Terrabacteria group
100	2	0	P	1239	Bacillota
100	2	0	C	91061	Bacilli
100	2	0	O	1385	Bacillales
100	2	0	F	186817	Bacillaceae
100	2	0	G	2837508	Rossellomorea
50	1	1	S	218284	Rossellomorea vietnamensis
50	1	0	G1	2837526	unclassified Rossellomorea
50	1	1	S	3118174	Rossellomorea sp. y25

Percentage: This column indicates the percentage of reads classified at a particular taxonomic level relative to the total number of reads in the dataset. A higher percentage indicates a greater proportion of reads assigned to that taxon.

Reads: This column represents the number of reads (sequences) that were classified under the corresponding taxon. It shows the raw count of sequences attributed to each taxonomic category.

Taxon_ID: This is an internal identifier used by Kraken2 to track how many unique sequences were assigned to each taxonomic rank. It counts the number of distinct reads associated with the taxon.

Rank: The rank specifies the taxonomic level at which the reads were classified. Common ranks include:

R for root, D for domain, P for phylum, C for class, O for order, F for family, G for genus, S for species. NCBI_ID: This column contains the unique identifier from the NCBI Taxonomy Database for the taxon. Each taxonomic group is associated with a specific NCBI Taxonomy ID, which can be used to find more detailed information about the organism in NCBI’s taxonomy database.

Taxonomy: This column lists the full taxonomic name of the organism (e.g., Salmonella enterica subsp. enterica). It helps in identifying the specific taxon to which the reads have been assigned.

BUSCO RESULTS

___