Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome
Закрыть
Conţinutul numărului revistei
Articolul precedent
Articolul urmator
227 0
SM ISO690:2012
MARIN, Maximillian G., VARGAS, Roger, HARRIS, Michael, JEFFREY, Brendan, EPPERSON, L. Elaine, DURBIN, David, STRONG, Michael, SALFINGER, Max, IQBAL, Zamin, AKHUNDOVA, Irada, VASHAKIDZE, Sergo, KRUDU, V., ROSENTHAL, Alex, FARHAT, Maha Reda. Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome. In: Bioinformatics, 2022, nr. 7(38), pp. 1781-1787. ISSN 1367-4803. DOI: https://doi.org/10.1093/bioinformatics/btac023
EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core
Bioinformatics
Numărul 7(38) / 2022 / ISSN 1367-4803 /ISSNe 1367-4811

Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome

DOI: https://doi.org/10.1093/bioinformatics/btac023

Pag. 1781-1787

Marin Maximillian G.1, Vargas Roger1, Harris Michael2, Jeffrey Brendan3, Epperson L. Elaine3, Durbin David3, Strong Michael3, Salfinger Max4, Iqbal Zamin5, Akhundova Irada6, Vashakidze Sergo7, Krudu V.89, Rosenthal Alex2, Farhat Maha Reda10
 
1 Harvard Medical School, Boston,
2 National Institute of Allergy and Infectious Diseases, Department of Health and Human Services, Bethesda,
3 National Jewish Health, Denver,
4 University of South Florida,
5 European Bioinformatics Institute (EMBL-EBI), Hinxton,
6 Scientific Research Institute of Lung Diseases, Ministry of Health, Baku,
7 National Center for Tuberculosis and Lung Diseases, Tbilisi,
8 Institute of Phtysiopneumology „Chiril Draganiuc”,
9 Ministerul Sănătăţii al Republicii Moldova,
10 Massachusetts General Hospital, Boston
 
Disponibil în IBN: 23 mai 2022


Rezumat

Motivation: Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content. Results: Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (<99%) was tuning the mapping quality filtering threshold, i.e. confidence of the read mapping (recall = 85.8%, precision = 99.1%, MQ ≥ 40). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall = 70.2%, precision = 99.6%, MQ ≥ 40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52/168 PE/PPE genes (34.5%). From these results, we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems and more generally for WGS applications in other organisms. 

Cuvinte-cheie
article, bacterium isolate, benchmarking, DNA base composition, filtration, masking, Mycobacterium tuberculosis, nonhuman, public health surveillance, recall, whole genome sequencing