Theera Piroonratana. An information-theoretic approach to pattern recognition and its application in life science. Doctoral Degree(Electrical Engineering). King Mongkut's University of Technology North Bangkok. Central Library. : King Mongkut's University of Technology North Bangkok, 2010.
An information-theoretic approach to pattern recognition and its application in life science
Abstract:
This thesis interests in two life science problems that can be tackled using
information-theoretic approaches for pattern recognition. The first problem covers the
identification of ancestry informative markers (AIMs) from genome-wide single
nucleotide polymorphisms (SNPs). A protocol for AIM extraction is proposed. The
protocol consists of three main steps: (a) identification of potential positive selection
regions via FST extremity measurement, (b) SNP screening via two-stage attribute
selection and (c) classification model construction using a naïve Bayes classifier. The
two-stage attribute selection is composed of a newly developed round robin
symmetrical uncertainty ranking technique and a wrapper embedded with a naïve
Bayes classifier. The protocol has been applied to the HapMap Phase II data. Two
AIM panels, which consist of 10 and 16 SNPs that lead to complete classification
between CEU, CHB, JPT and YRI populations, are identified. Moreover, the panels
are at least four times smaller than those reported in previous studies. The results
suggest that the protocol could be useful in a scenario involving a larger number of
populations. The second problem involves the application of a neural network and decision trees in thalassaemia screening. The aim is to classify thirteen classes of
thalassaemia abnormality and one control class by inspecting the distribution of
multiple types of haemoglobin in blood specimens, which are identified via high
performance liquid chromatography (HPLC). C4.5 and random forests are the chosen
architecture for decision tree implementation. For comparison, multilayer perceptrons
are explored in classification via a neural network. The stratified 10-fold crossvalidation
results indicate that the best classification performance is achieved when
C4.5 is used in conjunction with samples which have been pre-processed with input
attribute discretisation and redundant attribute removal. Subsequently, C4.5 is applied
to an additional sample set in a clinical trial which results in acceptably high
classification accuracy. These results suggest that a combination of C4.5 with
haemoglobin typing analysis via HPLC may give rise to a guideline for further
investigation of thalassaemia classification.