A distributed framework to generate machine learning features based on protein, DNA and RNA sequences
High throughput sequencing technologies have generated a huge amount of biological sequences over decades, including protein, DNA and RNA. Accordingly, many machine learning based methods are developed based on these sequences to provide powerful toolkits for surveying, classifying and predicting biological data. However, it poses a significant challenge to transform the raw sequence information into more meaningful, sequence-order-incorporated and sequence-pattern-recognized features before feeding them into computational models. Here, we have developed DIFFUSER, a distributed framework to efficiently and comprehensively generate a broad spectrum of heterogeneous features derived from biological sequences, including protein, DNA and RNA sequences. DIFFUSER outperformed current existing feature generators with three obvious improvements: 1) a brand-new distributed architecture to improve the online feature generating process by n times using decentralized/parallel computing and distributed storage. 2) the most comprehensive feature generator to cover the largest number of features in a broadest spectrum to provide all-in-one service; and 3) both a user-friendly web-based server and a unified-designed, cross-platform standalone toolkit to provide consistent feature generating service with full support of feature customization.
-
The following browsers are supported by this website:
Windows: Chrome, Firefox,Internet Explorer 8+,Opera
Mac: Chrome, Firefox, Opera, Safari
Linux: Chrome, Firefox
- Wang J et al. DIFFUSER: A distributed framework to generate machine learning features based on protein, DNA and RNA sequences. 2018, Submitted for publication.
Lithgow Group
Infection and Immunity Program
Biomedicine Discovery Institute
Faculty of Medicine, Nursing and Health Sciences
Monash University
Melbourne, VIC 3800, Australia
Contact Us