OP-38 Identifying Viruses from Host Genomes and Deep Learning for Prediction of Viral Integration Sites
Presenting Author: Zhongming Zhao, University of Texas Health Science Center at Houston
Abstract: Viral infections are commonly observed in nature. Effective and efficient detection of viruses in host genomes, together with tracking how viruses interact with host genomes, are major challenges. We recently developed an algorithm called VERSE: Virus intEgration sites through iterative Reference SEquence customization, which can effectively detect viruses with viral mutations from next generation sequencing data. VERSE improves detection through customizing reference genomes. Using 19 human tumors and cancer cell lines as test data, we demonstrated that VERSE substantially enhanced the sensitivity of virus integration site detection. VERSE has been used by some large network projects such as The International Cancer Genome Consortium (ICGC, 25k whole genome sequencing data). We next manually collected and curated viral integration sites (VISs, total 77,632 sites) from published works and made them publicly available through VISDB: Viral Integration Site DataBase. Furthermore, we developed a deep learning method, DeepVISP, for viral site integration prediction and motif discovery. DeepVISP is based on deep convolutional neural network (CNN) model with attention architecture. We demonstrated DeepVISP can accurately predict oncogenic VISs in the human genome using our curated benchmark integration data of three viruses, hepatitis B virus (HBV), human herpesvirus (HPV), and Epstein-Barr virus (EBV). Comparing to six classical machine learning methods, DeepVISP achieves higher accuracy and more robust performance for all three viruses through automatically learning informative features and essential genomic positions only from the DNA sequences. A user-friendly web server is developed for predicting putative oncogenic VISs in the human genome using DeepVISP.