Abstract:
Classification consists of predicting group membership for new data instances by learning from pre-classified data instances. Classification is crucial as it contributes in solving problems in all fields, such as: bio-chemistry, social sciences, bioinformatics, etc. Classification has three main components: the classification algorithm, the pre-classified data (training data) and the un-classified data (testing data). Classification accuracy is a measure of how well a classification algorithm classifies the un-classified data. Several algorithms tackle this problem. Examples of such algorithms are C4.5, neural networks, Bayesian networks, etc. However, since algorithms do not perform equally on the same data, a detailed study of the “algorithm-data relationship” is needed to assess the overall performance of these algorithms rather than relying only on their accuracy. In order to rationalize this point of view, we will explore and assess eight classification algorithms on eight disease detection datasets with different characteristics each. A detailed comparative study will highlight the advantages and drawbacks of each algorithm.