Abstract:
Data mining is a relatively new term; it was introduced in the 1990s. Data mining is the process of extracting useful information from huge amounts of data; it is sometimes called data discovery or knowledge Discovery in databases [6]. What exactly defines useful information depends on the goal that data mining was for in the first place. Useful information can be used to increase revenue and to cut costs. It also can be used for the purpose of research. Advances in hardware and software in the late 1990s made data centralizing possible. Data centralizing is called "data warehousing" or "data warehouse for the centralized data". With the process of data centralization came a very important issue, the quality of the data that has been centralized, since centralization includes the joining of multiple data sources. The data given as an input for the data mining process should be of high quality in order that the results of the data mining process be accurate and reliable. Before data could be mined to extract useful information, it goes through a process called data cleansing. This process, data cleansing, is as old as the word data itself; however, the term is a new term introduced in the 1990s. Data cleansing involves several steps and several processes that include one or more algorithms. One of these steps which is of high importance is duplicate data detection, it became more important when hardware advances permitted data warehouses to be able to include more and more data. In this work, a tool which is based on the K-way sorting algorithm is implemented and it is used for duplicate data detection. The tool has many features for data cleansing. The tool also has support for multiple languages especially for the Arabic language, where no other tool offers.