OP-18 A comprehensive benchmarking of WGS-based structural variant callers
Presenting Author: Varuni Sarwal, University of California Los Angeles
Co-Author(s): Sebastian Niehus, Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178 Berlin, Germany Eleazar Eskin, University of California Los Angeles Jonathan Flint, University of California Los Angeles Serghei Mangul, serghei.mangul@gmail.com
Abstract: A comprehensive benchmarking of WGS-based structural variant callers<br>Advances in whole-genome sequencing promise to enable accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole-genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this project, we evaluated the performance of SV-detection tools on mouse and human WGS data using a comprehensive PCR-confirmed gold standard set of SVs and the GIAB variant set, respectively. In contrast to the previous benchmarking studies, our mouse gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Manta was the top-performing tool for both mouse and human data, with F-score values consistently above 0.6. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data as well as for different deletion length categories. We hope that the results reported in this benchmarking study can help researchers choose appropriate variant calling tools based on the organism, data coverage, and deletion length.