Abstract:
Clustering is the process of dividing a set of objects into several classes in which each class is composed of similar objects. Traditional centralized clustering algorithms target those objects that are located in the same site, whereas it cannot perform on distributed objects. Distributed clustering
algorithms, however, can fulfil this gap. They extract a classification model from the distributed
objects even when they are in different sites and locations. In today’s life, and due to the trend of
storing data on different locations and sites, the popularity of distributed data is getting
tremendously booming. It seems to be one of the most prevailing fields in the coming decades,
especially with the huge amount of data propagating throughout the web. Even though a lot of
research and work was done on this topic, it is still considered in its infantry because of the challenges that is still popping up such as bandwidth limitation, transferring data to single site and
many others. In this work, we present DG-means, which is a greedy algorithm that performs on
distributed sets of data. Three datasets - Wholesale dataset, Banknotes dataset, and Iris dataset are used to compare multiple distributed clustering algorithms on different matrices: runtime execution, stability, and accuracy. DG-means exhibited superior performance when compared to
the other algorithms.