Abstract:
This work introduces a new approach to record clustering where a hybrid algorithm is
presented that clusters records based upon threshold values and the query patterns
made to a particular database. We study the space density of a file and how it affects
retrieval time before and after clustering. The hamming distaoce of a file is used as a
measure of space density. The objective of the algorithm is to minimize the hamming
distance of the file while attaching significance to the most frequent queries being
asked. Simulation experiments conducted proved that a great reduction in response
time is yielded after the restructuring of a file. Criteria, such as, block size, threshold
value, percentage of records satisfYing a given set of queries, etc ... , which affect
clustering and response time are also studied. Random statistical and graph theory are
used to substantiate the experimental results. As a further means for predicting
perfonnance, regression analysis is employed and later compared to experimental
figure.