Genome Annotation


RESULTS

METHODS

Genome Assembly and Annotation

Raw reads were subjected to quality control based on their sequencing platform. Illumina reads were processed using BBDuk v38.90 (Bushnell, 2014) to remove adapters and low-quality sequences. Nanopore reads, when available, were additionally trimmed and demultiplexed using Porechop v0.2.4. For datasets containing both Illumina and PacBio reads, SPAdes v3.15.3 (Bankevich et al., 2012) was run in hybrid assembly mode to generate contigs and scaffolds. In cases where only long-read data (PacBio or Nanopore) were provided, assemblies were generated using Flye v2.9 (Kolmogorov et al., 2019). All assemblies were polished using Pilon v1.24 (Walker et al., 2014) with mapped Illumina reads, when available.

Open reading frames (ORFs) were predicted from assembled scaffolds using Prodigal v2.6.3 (Hyatt et al., 2010). Functional annotation was performed using eggNOG-mapper v2 (Cantalapiedra et al., 2021), referencing multiple databases including KEGG Orthology (KO), COG and Pfam. Additionally, DIAMOND BLASTP was used to align predicted proteins against specialized databases such as VFDB (Liu et al., 2019), CARD (Jia et al., 2017), and MvirDB (Zhou et al., 2007) to identify virulence factors, antimicrobial resistance genes, and mobile elements. Secretion system protein prediction was carried out using MacSyFinder TXSScan (Abby et al., 2016). Genome completeness was evaluated using BUSCO v5.4.6 (Manni et al., 2021) with appropriate lineage datasets. Genomic architecture was visualized using Circos v0.69-9 (Krzywinski et al., 2009). Plasmid prediction was performed using Plasmer (Liu et al., 2023), a deep learning-based classifier for plasmid contigs.  Biological pathway reconstructions based on KEGG and MetaCyc (Caspi et al., 2020) using protein family predictions was done using MinPath (Ye & Doak, 2009).

Citations

  1. Bushnell B. BBMap: fast, accurate aligner. Lawrence Berkeley Natl Lab. 2014.
  2. Bankevich A, et al. SPAdes: genome assembly algorithm. J Comput Biol. 2012;19:455–77.
  3. Kolmogorov M, et al. Flye: long-read genome assembler. Nat Biotechnol. 2019;37:540–6.
  4. Walker BJ, et al. Pilon: assembly improvement tool. PLoS One. 2014;9:e112963.
  5. Hyatt D, et al. Prodigal: gene prediction for prokaryotes. BMC Bioinformatics. 2010;11:119.
  6. Cantalapiedra CP, et al. eggNOG-mapper v2. Mol Biol Evol. 2021;38:5825–9.
  7. Liu B, et al. VFDB 2019. Nucleic Acids Res. 2019;47:D687–92.
  8. Jia B, et al. CARD 2017. Nucleic Acids Res. 2017;45:D566–73.
  9. Zhou CE, et al. MvirDB: virulence database. Nucleic Acids Res. 2007;35:D391–4.
  10. Abby SS, et al. MacSyFinder: secretion system modeling. PLoS One. 2016;11:e0155559.
  11. Manni M, et al. BUSCO: genome completeness tool. Nat Protoc. 2021;16:566–80.
  12. Krzywinski M, et al. Circos: circular genome visualization. Genome Res. 2009;19:1639–45.
  13. Liu Z, et al. Plasmer: plasmid classifier. Bioinformatics. 2023;39:btad380.
  14. Caspi, R. et al. (2020). The MetaCyc database of metabolic pathways and enzymes—2019 update. Nucleic Acids Res, 48(D1), D445–D453.
  15. Ye, Y. & Doak, T. G. (2009). A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol, 5(8), e1000465.

RESULTS SUMMARY

QUAST RESULTS

PLASMID DETECTION

Plasmer Predicted Class
contig_2 chromosome
contig_1 chromosome
## No plasmid is detected.

GENOME VISUALIZATION

ANNOTATION

Prediction Data Summary
Category Value
contigs 2
bases 4534017
CDS 4685
rRNA 36
tRNA 116
tmRNA 1

MINPATH(KEGG)

fam_total:
The total number of gene families expected in a complete version of the pathway (according to the model). This number represents the theoretical full complement of families required for that pathway.

  • fam_found:
    The number of gene families that were actually detected in your dataset. These are the families from your annotation that matched parts of the pathway model.

  • Ratio (fam_found/fam_total):
    This ratio indicates the completeness of the pathway as represented in your data. A ratio near 1 means that most of the pathway’s gene families were found, suggesting a likely complete pathway. A lower ratio implies the pathway might be incomplete or missing key components.

  • predicted:
    This column indicates whether MinPath has predicted the pathway as present in your dataset (Yes if the minimal criteria are met, No otherwise).

Caveats:

  • The presence of many gene families does not guarantee that the metabolic pathway is fully functional or active.
  • Incomplete genome assemblies or suboptimal annotation may reduce the detected family counts, thereby reducing the ratio.
  • Therefore, these metrics should be interpreted in the context of other biological evidence.

MINPATH(METACYC)

KRAKEN2 RESULTS

Kraken2 Results
Percentage Reads Taxon_ID Rank NCBI_ID Taxonomy
100 2 0 R 1 root
100 2 0 1 131567 cellular organisms
100 2 0 NA 2 Bacteria
100 2 0 D1 1783272 Terrabacteria group
100 2 0 P 1239 Bacillota
100 2 0 C 91061 Bacilli
100 2 0 O 1385 Bacillales
100 2 0 F 186817 Bacillaceae
100 2 0 G 2837508 Rossellomorea
50 1 1 S 218284 Rossellomorea vietnamensis
50 1 0 G1 2837526 unclassified Rossellomorea
50 1 1 S 3118174 Rossellomorea sp. y25

Percentage: This column indicates the percentage of reads classified at a particular taxonomic level relative to the total number of reads in the dataset. A higher percentage indicates a greater proportion of reads assigned to that taxon.

Reads: This column represents the number of reads (sequences) that were classified under the corresponding taxon. It shows the raw count of sequences attributed to each taxonomic category.

Taxon_ID: This is an internal identifier used by Kraken2 to track how many unique sequences were assigned to each taxonomic rank. It counts the number of distinct reads associated with the taxon.

Rank: The rank specifies the taxonomic level at which the reads were classified. Common ranks include:

R for root, D for domain, P for phylum, C for class, O for order, F for family, G for genus, S for species. NCBI_ID: This column contains the unique identifier from the NCBI Taxonomy Database for the taxon. Each taxonomic group is associated with a specific NCBI Taxonomy ID, which can be used to find more detailed information about the organism in NCBI’s taxonomy database.

Taxonomy: This column lists the full taxonomic name of the organism (e.g., Salmonella enterica subsp. enterica). It helps in identifying the specific taxon to which the reads have been assigned.

BUSCO RESULTS

___