Abstract:
This thesis focuses on the NP-Hard problem of finding an optimal tree topology where
leaves represent biological sequences. The problem consists of minimizing the number of
changes between given and/or derived sequences. As the number of sequences to be compared
increases, the size of the search space grows exponentially, requesting the use of optimization
methods, to come up with an acceptable optimal topology. Even though, research is relating a
large number of species and gene families to each others, the computation intensive load of many
popular methods for evaluating trees (ex: parsimony and maximum likelihood) establish the
quasi-inexistence of an exact solution for more than about 20 sequences.
The alignment of two sequences (pairwise alignment), or multiple sequences (Multiple
Sequence Alignment - MSA), and the alignment of short or long sequences such as an entire
genome may require different types of algorithms. The algorithms used in all of these four cases
are dynamic programming, linear programming based or heuristic-based or a combination of
them. For large numbers of sequences to be aligned, heuristic methods may give a result similar
to the exact solution offered by the dynamic programming or linear programming based
algorithms, but in much shorter time. Heuristics used in phylogenetic inference include greedy
algorithms, hill climbing, simulated annealing, and genetic algorithms.
Thus, this research aims to suggest a general way to encode the problem into instances of
different heuristic algorithms. Another focus of the document would be to suggest that the
heuristics used, be implemented in a most optimal way, trying to get a compromise between
speedup, flexibility and detailed tracing/chronology of each run of the algorithms.