Transcriptional regulation of gene expression is a vital process for the life and development of any living organism. Transcription is the mechanism by which a genomic DNA is transferred into a messenger RNA that is yet to be transferred into functional protein. It is primarily controlled by special proteins called transcription factors (TFs) that bind on specific cis-regulatory elements (CREs) called TF binding sites (TFBSs) nearby target genes. Identifying TFBSs in the promoter regions of genes is essential to understand the mechanism of actions (MOAs) of the underlying biology of a particular disease or disorder. In the last two decades, many computational approaches have been proposed to locate TFBSs in genomic regions. As TFBSs are short, degenerate, spurious in the genome and not abundantly available in biological databases, several inaccurate assumptions have been postulated to model the TF DNA-binding specificity. For instance, the positions of TF motifs are assumed to be statistically independent and thus position-specific scoring matrices (PSSMs) are used to represent TF motifs. The binding energy is assumed to be additive and hence the TF DNA-binding score for a given k-mer is estimated by the sum of the binding energies of individual nucleotide positions. TFBSs are also assumed to be conserved across the promoters of related species and thus the phylogenetic footprinting information is employed to locate them. Predefined cut-off thresholds are then applied on the k-mer binding and conservation scores to classify k-mers into functional TFBSs or background sequences. However, crisp thresholds greatly affect the prediction sensitivity of TFBSs search systems. On the other hand, the TF DNA-binding affinity cannot be modelled only using the base readout information and additional information pertaining to the DNA structure should be employed. Several machine learning-based methods have been recently presented in the literature to capture the indirect shape readout information of the TF bound regions. These methods, however, are designed to work with small biological datasets that are deliberately created and fail to model the binding affinity using genome-wide binding data. iv This dissertation proposes two frameworks for TFBS recognition using genome-wide ChIP-chip binding data. The first framework, denoted by FSCAN, employs the fuzzy inference capabilities to build threshold-free fuzzy systems for the prediction of TFBSs. FSCAN aims at replacing the current PSSM-based and phylogenetic-based methods, such as MATCH, MatInspector and ConTra, whose performance mainly depends on defining cutoff thresholds. The second framework, denoted by DNNESCAN, is introduced to improve the understanding of TF-DNA interaction mechanisms by integrating base readout, shape readout information and phylogenetic footprinting information to recognize TFBSs. A new ensemble learning algorithm, named DNNE, is also proposed in this research study to be fitted in the DNNESCAN framework. In addition to DNNE, FSCAN and DNNESCAN, this dissertation technically contributes to the problem solving of TFBS identification as follows: (i) it proposes two new features to capture the position dependency of the TF motif, (ii) it introduces novel algorithms to assist training TFBS classification models on genome-wide binding data, (iii) it characterizes TFBSs with a variety of features, and (iv) it provides two TFBS search tools for the budding yeast genome. The performance of the proposed methods is comprehensively evaluated using synthetic and real datasets and then compared with other methods. The evaluation results show that DNNE outperforms other ensemble learning techniques such as bagging, boosting and random forests in terms of both effectiveness and efficiency. The prediction accuracy of FSCAN and DNNESCAN is estimated using the F1-measure on real biological datasets for 22 TF proteins. The results show an improved performance for our frameworks in comparison with other methods such as MatInspector and MATCH. Our threshold-free FSCAN outperforms MATCH and MatInspector on 14 datasets using their best performing cut-off thresholds. The F1-measure of DNNESCAN calculated on 10-fold cross-validation exceeds the F1-measure of MatInspector on 21 out of the 22 datasets. Finally, this research study effectively contributes to the problem solving of genome-wide TFBS recognition using fuzzy inference and ensemble learning techniques.
Submission note: A thesis submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy [to the] School of Engineering and Mathematical Sciences, College of Science, Health and Engineering, La Trobe University, Bundoora.
This thesis contains third party copyright material which has been reproduced here with permission. Any further use requires permission of the copyright owner. The thesis author retains all proprietary rights (such as copyright and patent rights) over all other content of this thesis, and has granted La Trobe University permission to reproduce and communicate this version of the thesis. The author has declared that any third party copyright material contained within the thesis made available here is reproduced and communicated with permission. If you believe that any material has been made available without permission of the copyright owner please contact us with the details.