SEA-PHAGES Logo

The official website of the HHMI Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science program.

Abstract Summary

Below is a summary of the abstract you submitted. Presenting author(s) is shown in bold.

If any changes need to be made, you can modify the abstract or change the authors.

You can also download a .docx version of this abstract.

If there are any problems, please email Dan at dar78@pitt.edu and he'll take care of them!

This abstract was last modified on March 18, 2021 at 12:15 a.m..

Purdue University
Corresponding Faculty Member: Kari Clase, kclase@purdue.edu
This abstract will NOT be considered for a talk.
A Machine Learning Approach to Bacteriophage Function Call Annotation
Yug Rao, Emily Kerstiens, Kari Clase

Bacteriophages are a class of virus that can infect and replicate in bacterial cells, and are the most abundant organism on Earth. There are many applications of recent interest in utilizing bacteriophages in applications such as genetic engineering for antibacterial properties. Two major steps required to understand novel bacteriophages are start site selection and functional annotation. Whereas many algorithms exist to determine candidate start sites for a draft bacteriophage, there are no comparable function prediction tools. This problem is currently solved by tools such as NCBI BLAST, which compare a target gene sequence with every other gene in every bacteriophage – a process that lacks scalability. As more bacteriophages are discovered and annotated, this limitation will only be exacerbated with time. In recent years, powerful machine learning architectures which can derive nonlinear relationships in high dimensional data have been introduced and popularized. Machine learning models such as the Recurrent Neural Network (RNN) or Long Short-Term Memory network (LSTM) are often used to derive meaning for natural language processing, where large strings are parsed to gain insight about their message. In this research, a method is proposed to train a neural network to parse gene base pair sequences to derive the function of the input gene. By doing this, the function prediction would be almost instantaneous, as opposed to the long runtime in BLAST results. It would also be much easier to find functional relationships between smaller phams.