Speaker: Dr Ben O’Neill
Institution: Australian National University
Abstract: An interesting problem that arises in genetic analysis and other contexts is determining the exact or approximate probability distribution of the “string-count” giving the number of occurrences of a fixed character string in a random “text” (i.e., a random vector of symbols). The simplest case of interest is when the symbols in the text are IID categorical random variables. The prima facie simplicity of this problem sometimes attracts the attention of novice analysts, buoyed by the fact that it is easy to compute the probability of a match in the simple case where the length of the text is the same as the length of the string. (And of course, if the text is shorter than the string then the problem is trivial!) However, when the length of the text is larger than the length of the string, the problem becomes complicated, owing to the fact that occurrences of the string at different positions in the text are not independent, and the dependency depends on the structure of the string. Obtaining the probability distribution for the “string-count” in a random text of a given length then requires an excursion into the computational analysis of “formal languages”, which is unfamiliar territory for many statisticians. In this seminar we give a gentle introduction to this problem, showing how to compute the probability distribution for a “string count” in a random text using analysis of Markov chains and “deterministic finite automata”. We begin with a simple example of the problem and discuss generalisations to more complicated versions of the problem, leading to the types of applied models used in genetic analysis. We also draw parallels with other problems in the analysis of Markov chains. The seminar aims to provides an introduction to the statistical analysis of string-counts without any assumed knowledge of the theory of formal languages.
Please email RSFAS enquiries for registration details.