Vorpal: A Novel RNA Virus Feature-Extraction Algorithm Demonstrated Through Interpretable Genotype-to-Phenotype Linear Models
Phillip Davis, John Bagnoli, David Yarmosh, Alan Shteyman, Lance Presser, Sharon Altmann, Shelton Bradrick, Joseph A Russell III
doi: https://doi.org/10.1101/2020.02.28.969782
Abstract
In the analysis of genomic sequence data, so-called alignment free approaches are often selected for their relative speed compared to alignment-based approaches, especially in the application of distance comparisons and taxonomic classification. These methods are typically reliant on excising K-length substrings of the input sequence, called K-mers. In the context of machine learning, K-mer based feature vectors have been used in applications ranging from amplicon sequencing classification to predictive modeling for antimicrobial resistance genes. This can be seen as an analogy of the bag-of-words model successfully employed in natural language processing and computer vision for document and image classification.
*注,本文为预印本论文手稿,是未经同行评审的初步报告,其观点仅供科研同行交流,并不是结论性内容,请使用者谨慎使用.