Abstract:
A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data.
Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures.
We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels.