Talk by Lila Kari (SMSS Colloquium) - CANCELED
Room: 100
Title: Machine Learning and the Mathematics of Genomes
Abstract: In the same way we use the twenty-six letters of the alphabet to write text, and the two bits 0 and 1 to write computer code, the four basic DNA units (Adenine, Cytosine, Guanine, Thymine) are used by Nature to encode information as DNA strands. Theoretically, a DNA strand can be viewed as a “word” over the four-letter alphabet {A, C, G, T }, and the mathematical structure of such words has implications for their biological structure and function.
This talk describes our research into the mathematical properties of genomic DNA sequences by exploring the connection between word frequencies in a genome and the type of organism that the genome belongs to. In particular, I describe our investigation into the Chaos Game Representation of a DNA sequence as a potential “genomic signature” of its species. Moreover, I describe how we combine supervised machine learning techniques with such genomic signatures for ultrafast, accurate, and scalable algorithms for species identification and classification. The potential impact of such alignment-free universal classification algorithms could be significant, given that 86% of existing species on Earth and 91% of species in the oceans still await classification.