Genome Assembly and Annotation
Raw reads were subjected to quality control based on their sequencing platform. Illumina reads were processed using BBDuk v38.90 (Bushnell, 2014) to remove adapters and low-quality sequences. Nanopore reads, when available, were additionally trimmed and demultiplexed using Porechop v0.2.4. For datasets containing both Illumina and PacBio reads, SPAdes v3.15.3 (Bankevich et al., 2012) was run in hybrid assembly mode to generate contigs and scaffolds. In cases where only long-read data (PacBio or Nanopore) were provided, assemblies were generated using Flye v2.9 (Kolmogorov et al., 2019). All assemblies were polished using Pilon v1.24 (Walker et al., 2014) with mapped Illumina reads, when available.
Open reading frames (ORFs) were predicted from assembled scaffolds using Prodigal v2.6.3 (Hyatt et al., 2010). Functional annotation was performed using eggNOG-mapper v2 (Cantalapiedra et al., 2021), referencing multiple databases including KEGG Orthology (KO), COG and Pfam. Additionally, DIAMOND BLASTP was used to align predicted proteins against specialized databases such as VFDB (Liu et al., 2019), CARD (Jia et al., 2017), and MvirDB (Zhou et al., 2007) to identify virulence factors, antimicrobial resistance genes, and mobile elements. Secretion system protein prediction was carried out using MacSyFinder TXSScan (Abby et al., 2016). Genome completeness was evaluated using BUSCO v5.4.6 (Manni et al., 2021) with appropriate lineage datasets. Genomic architecture was visualized using Circos v0.69-9 (Krzywinski et al., 2009). Plasmid prediction was performed using Plasmer (Liu et al., 2023), a deep learning-based classifier for plasmid contigs. Biological pathway reconstructions based on KEGG and MetaCyc (Caspi et al., 2020) using protein family predictions was done using MinPath (Ye & Doak, 2009).
Citations
contig_2 | chromosome |
contig_1 | chromosome |
## No plasmid is detected.
Category | Value |
---|---|
contigs | 2 |
bases | 4534017 |
CDS | 4685 |
rRNA | 36 |
tRNA | 116 |
tmRNA | 1 |
fam_total:
The total number of gene families expected in a complete version of the
pathway (according to the model). This number represents the theoretical
full complement of families required for that pathway.
fam_found:
The number of gene families that were actually detected in your dataset.
These are the families from your annotation that matched parts of the
pathway model.
Ratio (fam_found/fam_total):
This ratio indicates the completeness of the pathway as represented in
your data. A ratio near 1 means that most of the pathway’s gene families
were found, suggesting a likely complete pathway. A lower ratio implies
the pathway might be incomplete or missing key components.
predicted:
This column indicates whether MinPath has predicted the pathway as
present in your dataset (Yes if the minimal criteria are met, No
otherwise).
Caveats:
Percentage | Reads | Taxon_ID | Rank | NCBI_ID | Taxonomy |
---|---|---|---|---|---|
100 | 2 | 0 | R | 1 | root |
100 | 2 | 0 | 1 | 131567 | cellular organisms |
100 | 2 | 0 | NA | 2 | Bacteria |
100 | 2 | 0 | D1 | 1783272 | Terrabacteria group |
100 | 2 | 0 | P | 1239 | Bacillota |
100 | 2 | 0 | C | 91061 | Bacilli |
100 | 2 | 0 | O | 1385 | Bacillales |
100 | 2 | 0 | F | 186817 | Bacillaceae |
100 | 2 | 0 | G | 2837508 | Rossellomorea |
50 | 1 | 1 | S | 218284 | Rossellomorea vietnamensis |
50 | 1 | 0 | G1 | 2837526 | unclassified Rossellomorea |
50 | 1 | 1 | S | 3118174 | Rossellomorea sp. y25 |
Percentage: This column indicates the percentage of reads classified at a particular taxonomic level relative to the total number of reads in the dataset. A higher percentage indicates a greater proportion of reads assigned to that taxon.
Reads: This column represents the number of reads (sequences) that were classified under the corresponding taxon. It shows the raw count of sequences attributed to each taxonomic category.
Taxon_ID: This is an internal identifier used by Kraken2 to track how many unique sequences were assigned to each taxonomic rank. It counts the number of distinct reads associated with the taxon.
Rank: The rank specifies the taxonomic level at which the reads were classified. Common ranks include:
R for root, D for domain, P for phylum, C for class, O for order, F for family, G for genus, S for species. NCBI_ID: This column contains the unique identifier from the NCBI Taxonomy Database for the taxon. Each taxonomic group is associated with a specific NCBI Taxonomy ID, which can be used to find more detailed information about the organism in NCBI’s taxonomy database.
Taxonomy: This column lists the full taxonomic name of the organism (e.g., Salmonella enterica subsp. enterica). It helps in identifying the specific taxon to which the reads have been assigned.
___