Validation of MetaMobilePicker¶

In order to validate the results of the pipeline, we constructed a simulated metagenomic dataset. We selected the genomes of 7 highly resistant bacterial species from the PATRIC database (Wattam et al. [2014]) (accessed on March 3rd 2021). For this dataset we selected a representative completed genome from E. coli, S. aureus, E. faecalis, M. tuberculosis, S. enterica, K. pneumoniae and A. baumanii.

Strains, genome accession IDs are found in Table 1. In addition to these genomes, five phage genomes associated with the respective species were selected. See Table 1 for an overview of the selected bacterial genomes, plasmids and phages.

Using the selected genomes, we simulated a dataset of 20M reads of 150bp using InSilicoSeq (Gourlé et al. [2019]) using the MiSeq error profile. In order to simulate different relative abundances of different genomes, we generated abundance profiles for all species using a lognormal distribution. Additionally, a copy number for the plasmids was determined using a geometric distribution with probability P = min(1,log10(length)/7), to simulate shorter plasmids having a higher probability of having a higher copy number than longer plasmids. For the phages, a similar approach was used, but with a probability P = min(1,log10(length)/5)) to compensate for the lower average genome size of the phages. This copy number was multiplied by the abundance of the corresponding genome. The complete genome assemblies, plasmids and phages were annotated using Abricate with the ResFinder (Florensa et al. [2022]) database to inventorize AMR genes present. Additionally, ISEScan (Xie and Tang [2017]) was used to identify the present IS sequences.

To backtrace the contigs assembled during the pipieline’s run associated with the plasmids and phages, the reads used in the assembly were mapped onto the assembled contigs. InSilicoSeq retains the identifier of the original sequence as part of the read header. These headers were counted per contig and plasmid and phage contigs were identified using a majority vote. Contigs with inconclusive molecular origin (i.e. a mix of phage, plasmid and chromosome), were manually checked and labeled ambiguous if no suitable conclusion could be reached.

Table 1: selected species for simulated validation data¶
Accession	Species	Strain	No. plasmids	Phage accession
CP034123.1	Klebsiella pneumoniae	BJCFK909	4	NC_025418.1
NC_021733.1	Acinetobacter baumannii	BJAB0715	1	NC_031117.1
NZ_CP017593.1	Mycobacterium tuberculosis	Beijing-like/35049	0	JF937092.1
NZ_CP018642.1	Salmonella enterica	74-1357	1	NA
NZ_CP027788.1	Staphylococcus aureus	CMRSA-6	0	NC_016565.1
NZ_CP041877.1	Enterococcus faecalis	SCAID PHRX1-2018	1	NC_023595.2
NZ_CP044148.1	Escherichia coli O157	AR-0427	2	NA

Using MetaSPAdes (Nurk et al. [2017]) as part of MetaMobilePicker, the 20M reads were assembled into contigs. This resulted in 1513 contigs, of which 887 larger than 1Kb with an N50 of 82,245 and an L50 of 97. The total assembly length accumulated to 29,747,115 bases.

Plasmid identification¶

The resulting contigs were filtered on 1Kb length, and in parallel ran through the MGE identification part of MetaMobilePicker. Plasmid sequences were identified with a cut-off classification score of 0.8. This resulted in 111 contigs predicted as plasmids. The predictions of PlasClass (Pellow et al. [2020]) were cross-referenced with the origin of each contig. The resulting confusion table is shown in Table 2.

Table 2: Plasmid classification confusion matrix¶
	Predicted plasmids	Predicted non-plasmid
Plasmids	65	19
Non-plasmid	46	776

65 contigs were correctly predicted as plasmids, and 776 contigs were correctly predicted as chromosomal. 46 and 19 contigs were wrongly predicted as plasmid and chromosomal, respectively. This produces a precision and recall of 0.59 and 0.77, respectively and an F1 score of 0.67. Investigating the origin of each prediction class shows an overrepresentation of E. coli chromosomal fragments in the plasmid-prediction class. Of the 46 false positives, 30 originated from the E. coli genome. This overrepresentation might be due to a combination of factors. Comparing the length of each prediction class shows chromosomal contigs predicted as plasmids (FP) are on average shorter than chromosomal contigs not predicted as plasmids (TN) [p < 0.01]. In addition to this, we see that contigs originating from the E. coli genome are on average shorter than the contigs of most other species, with the exception of E. faecalis, possibly explaining part of the E. coli overrepresentation. An additional factor can be an algorithmic bias from PlasClass based on the abundance of E. coli in known plasmids. The length of the E. faecalis contigs are not significantly different to those of E. coli, yet E. faecalis accounts for only one FP. This species is less often found in plasmid reference databases, suggesting an explanation for this difference.

Insertion Sequences¶

To detect IS elements in the simulated metagenome validation data, we run the ISEscan module of MetaMobilePicker on the 887 contigs longer than 1kb. In total, 241 IS in 144 metagenome contigs were identified compared to 417 IS in the reference genomes, resulting in an accuracy of 58.5%. To test if all IS elements in the reference genomes were represented in the metagenomic contigs, we then mapped the metagenomics contigs to the reference genomes. This showed that on 317 IS (76.6%) annotated in the reference genomes a corresponding metagenomics contig was mapped whereas a metagenomic contig was missing entirely for 100 IS present in the reference genomes (23.4%). Of the 317 IS with a corresponding metagenomics contig, 88 (27.8%) corresponded to 18 metagenomics contigs which mapped to multiple IS in the reference genomes. These 18 metagenomic contigs are therefore likely to represent assembly collapses. Furthermore, 8 IS annotated in the reference genomes (2.5%) corresponded only partially to a metagenomics contig. The remaining 221 IS in the reference genome (69.7%) corresponded to a contig uniquely mapped to that IS in the metagenomic contigs.

To contrast the detection accuracy of ISEscan before and after metagenome assembly, we also calculated the maximum achievable amount of IS elements detection in the metagenome assembled contigs. To this end we discounted all IS elements that were lost or only partially assembled during metagenome assembly, by mapping all contigs back to the IS elements identified in the reference genomes, as described above. When taking only the 221 IS correctly mapping, the 18 contigs of the 88 ambiguously mapping IS elements, and the 8 partial mappings into account, which is all the IS contigs ISEScan had access to, ISEscan detected 227 and achieved an accuracy of 90.85%.

Bacteriophages¶

In order to measure the performance of DeepVirFinder (Ren et al. [2020]) on our metagenomics assembly, we cross-referenced the contigs predicted by DeepVirFinder with a classification score greater than 0.95 with the contigs that originated from the five phages in the dataset. This analysis shows 27 predictions, of which 5 originate from the phages added to our community. These 5 phage contigs are the full-length assemblies of these phages. Of the 22 contigs not originating from our added phages, 14 have a direct link to phage DNA, most likely originating from prophages. Another 4 contigs are putatively linked to phages, containing hits not exclusively associated with phages. The remaining 4 contigs show no clear link to phage DNA. Since we are unable to determine the number of prophage or phage related genes not identified by DeepVirFinder, we don’t take the 18 phage-related genes into account when calculating the classification metrics. This resulted in a recall of 1.0, precision of 0.555 and an F1 of 0.713.

AMR identification¶

To test the performance of the AMR identification steps, we annotated the validation genomes using the ResFinder database and cross-referenced the metagenomic hits with the genomic hits. Of the 55 predicted AMR genes in the reference genomes, 41 were identified correctly in the correct genome. For 11 of the remaining 14 hits, we found the gene but not on the correct genome. These 11 hits comprised 7 unique AMR genes. The remaining 3 hits comprising 3 unique AMR genes were not found in the metagenome. Further investigation of the 7 genes that were identified but not in all the correct genomes shows that 6 of the genes were only identified once in the metagenomic assembly. Of these 6 contigs, 4 have less than 75% of the reads mapped to the contig originating from one genome. For all 4 ambiguous contigs, a sizable part of the mapped reads originate from the genomes where the AMR gene should have been identified. This shows that, similar to the IS, these genes were collapsed during the metagenomics assembly step.

References¶

[DLD+20]

Enrique Doster, Steven M. Lakin, Christopher J. Dean, Cory Wolfe, Jared G. Young, Christina Boucher, Keith E. Belk, Noelle R. Noyes, and Paul S. Morley. MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Research, 48(D1):D561–D569, January 2020. doi:10.1093/nar/gkz1010.

[EKS+21]

A. Murat Eren, Evan Kiefl, Alon Shaiber, Iva Veseli, Samuel E. Miller, Matthew S. Schechter, Isaac Fink, Jessica N. Pan, Mahmoud Yousef, Emily C. Fogarty, Florian Trigodet, Andrea R. Watson, Özcan C. Esen, Ryan M. Moore, Quentin Clayssen, Michael D. Lee, Veronika Kivenson, Elaina D. Graham, Bryan D. Merrill, Antti Karkman, Daniel Blankenberg, John M. Eppley, Andreas Sjödin, Jarrod J. Scott, Xabier Vázquez-Campos, Luke J. McKay, Elizabeth A. McDaniel, Sarah L. R. Stevens, Rika E. Anderson, Jessika Fuessel, Antonio Fernandez-Guerra, Lois Maignien, Tom O. Delmont, and Amy D. Willis. Community-led, integrated, reproducible multi-omics with anvi’o. Nature Microbiology, 6(1):3–6, January 2021. Number: 1 Publisher: Nature Publishing Group. URL: https://www.nature.com/articles/s41564-020-00834-3 (visited on 2022-11-03), doi:10.1038/s41564-020-00834-3.

[FKC+22]

Alfred Ferrer Florensa, Rolf Sommer Kaas, Philip Thomas Lanken Conradsen Clausen, Derya Aytan-Aktug, and Frank M. Aarestrup. ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes. Microbial Genomics, 8(1):000748, January 2022. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8914360/ (visited on 2022-11-15), doi:10.1099/mgen.0.000748.

[GKLHBR19]

Hadrien Gourlé, Oskar Karlsson-Lindsjö, Juliette Hayer, and Erik Bongcam-Rudloff. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics, 35(3):521–522, February 2019. URL: https://doi.org/10.1093/bioinformatics/bty630 (visited on 2022-05-03), doi:10.1093/bioinformatics/bty630.

[HCL+10]

Doug Hyatt, Gwo-Liang Chen, Philip F. LoCascio, Miriam L. Land, Frank W. Larimer, and Loren J. Hauser. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11(1):119, March 2010. URL: https://doi.org/10.1186/1471-2105-11-119 (visited on 2022-11-03), doi:10.1186/1471-2105-11-119.

[KBZ+20]

Silas Kieser, Joseph Brown, Evgeny M. Zdobnov, Mirko Trajkovski, and Lee Ann McCue. ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. BMC Bioinformatics, 21(1):257, June 2020. URL: https://doi.org/10.1186/s12859-020-03585-4 (visited on 2022-05-03), doi:10.1186/s12859-020-03585-4.

[LD09]

Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14):1754–1760, July 2009. doi:10.1093/bioinformatics/btp324.

[MJL+21]

Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B. Hall, Christopher H. Tomkins-Tinch, Vanessa Sochat, Jan Forster, Soohyun Lee, Sven O. Twardziok, Alexander Kanitz, Andreas Wilm, Manuel Holtgrewe, Sven Rahmann, Sven Nahnsen, and Johannes Köster. Sustainable data analysis with Snakemake. F1000Research, 10:33, 2021. doi:10.12688/f1000research.29032.2.

[NMKP17]

Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A. Pevzner. metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5):824–834, May 2017. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411777/ (visited on 2022-05-03), doi:10.1101/gr.213959.116.

[PMS20]

David Pellow, Itzik Mizrahi, and Ron Shamir. PlasClass improves plasmid sequence classification. PLOS Computational Biology, 16(4):e1007781, April 2020. Publisher: Public Library of Science. URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007781 (visited on 2022-05-03), doi:10.1371/journal.pcbi.1007781.

[RSD+20]

Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, and Fengzhu Sun. Identifying viruses from metagenomic data using deep learning. Quantitative Biology, 8(1):64–77, March 2020. URL: https://doi.org/10.1007/s40484-019-0187-4 (visited on 2022-05-03), doi:10.1007/s40484-019-0187-4.

[WAD+14]

Alice R. Wattam, David Abraham, Oral Dalay, Terry L. Disz, Timothy Driscoll, Joseph L. Gabbard, Joseph J. Gillespie, Roger Gough, Deborah Hix, Ronald Kenyon, Dustin Machi, Chunhong Mao, Eric K. Nordberg, Robert Olson, Ross Overbeek, Gordon D. Pusch, Maulik Shukla, Julie Schulman, Rick L. Stevens, Daniel E. Sullivan, Veronika Vonstein, Andrew Warren, Rebecca Will, Meredith J.C. Wilson, Hyun Seung Yoo, Chengdong Zhang, Yan Zhang, and Bruno W. Sobral. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Research, 42(Database issue):D581–D591, January 2014. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965095/ (visited on 2022-11-14), doi:10.1093/nar/gkt1099.

[XT17]

Zhiqun Xie and Haixu Tang. ISEScan: automated identification of insertion sequence elements in prokaryotic genomes. Bioinformatics (Oxford, England), 33(21):3340–3347, November 2017. doi:10.1093/bioinformatics/btx433.