A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)

LAUR Repository

Show simple item record

dc.contributor.author Wehbe, Gioia Wahib
dc.date.accessioned 2016-04-06T05:32:08Z
dc.date.available 2016-04-06T05:32:08Z
dc.date.copyright 2015-12-21
dc.date.issued 2016-04-06
dc.identifier.uri http://hdl.handle.net/10725/3493
dc.description.abstract A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data. Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures. We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels. en_US
dc.language.iso en en_US
dc.subject Population genetics en_US
dc.subject Population genetics -- Computer simulation en_US
dc.subject Human genome en_US
dc.subject Lebanese American University -- Dissertations en_US
dc.subject Dissertations, Academic en_US
dc.title A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015) en_US
dc.type Thesis en_US
dc.term.submitted Fall en_US
dc.author.degree MS in Computer Science en_US
dc.author.school SAS en_US
dc.author.idnumber 200801378 en_US
dc.author.commembers Azar, Danielle
dc.author.commembers Abu Khzam, Faisal
dc.author.commembers Zalloua, Pierre
dc.author.woa OA en_US
dc.author.department Computer Science and Mathematics en_US
dc.description.embargo N/A en_US
dc.description.physdesc 1 hard copy: xix, 156 leaves; col. ill.; 31 cm. available at RNL. en_US
dc.author.advisor Khazen, Georges
dc.keywords Population Classification en_US
dc.keywords Motif Finding en_US
dc.keywords Feature Selection en_US
dc.keywords Suffix Trees en_US
dc.keywords Genome Autosomal Data en_US
dc.keywords Single Nucleotide Polymorphisms en_US
dc.description.bibliographiccitations Includes bibliographical references (leaves 121-132). en_US
dc.identifier.doi https://doi.org/10.26756/th.2015.49 en_US
dc.publisher.institution Lebanese American University en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search LAUR

Advanced Search


My Account