Abstract:
As the Web continues to grow and evolve, more and more information is being placed in structurally rich documents, XML documents in particular, so as to improve the efficiency of similarity clustering, information retrieval and data management applications. Various algorithms for comparing hierarchically structured data, e.g., XML documents, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being modeled as Ordered Labeled Trees. Nevertheless, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison method to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and allow the end-user to tune the comparison process according to her requirements. Our approach consists of four main modules for i) discovering the structural commonalities between sub-trees, ii) identifying sub-tree semantic resemblances, iii) computing tree-based edit operations costs, iv) and computing tree edit distance. A prototype has been developed to evaluate the optimality and performance of our method. Results demonstrate higher comparison accuracy with respect to alternative XML comparison methods, while timing experiments reflect the significant impact of semantic similarity assessment on overall system performance.
Citation:
Tekli, J., Chbeir, R., & Yetongnon, K. (2001). An XML Document Comparison Framework.