A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)

Wehbe, Gioia Wahib

dc.contributor.author	Wehbe, Gioia Wahib
dc.date.accessioned	2016-04-06T05:32:08Z
dc.date.available	2016-04-06T05:32:08Z
dc.date.copyright	12/21/2015	en_US
dc.date.issued	2016-04-06
dc.identifier.uri	http://hdl.handle.net/10725/3493
dc.description.abstract	A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data. Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures. We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels.	en_US
dc.language.iso	en	en_US
dc.subject	Population genetics	en_US
dc.subject	Population genetics -- Computer simulation	en_US
dc.subject	Human genome	en_US
dc.subject	Lebanese American University -- Dissertations	en_US
dc.subject	Dissertations, Academic	en_US
dc.title	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)	en_US
dc.type	Thesis	en_US
dc.term.submitted	Fall	en_US
dc.author.degree	MS in Computer Science	en_US
dc.author.school	SAS	en_US
dc.author.idnumber	200801378	en_US
dc.author.commembers	Azar, Danielle
dc.author.commembers	Abu Khzam, Faisal
dc.author.commembers	Zalloua, Pierre
dc.author.woa	OA	en_US
dc.author.department	Computer Science and Mathematics	en_US
dc.description.embargo	N/A	en_US
dc.description.physdesc	1 hard copy: xix, 156 leaves; col. ill.; 31 cm. available at RNL.	en_US
dc.author.advisor	Khazen, Georges
dc.keywords	Population Classification	en_US
dc.keywords	Motif Finding	en_US
dc.keywords	Feature Selection	en_US
dc.keywords	Suffix Trees	en_US
dc.keywords	Genome Autosomal Data	en_US
dc.keywords	Single Nucleotide Polymorphisms	en_US
dc.description.bibliographiccitations	Includes bibliographical references (leaves 121-132).	en_US
dc.identifier.doi	https://doi.org/10.26756/th.2015.49	en_US
dc.publisher.institution	Lebanese American University	en_US