Parallel data mining
Title: Parallel data mining
DNr: SNIC 2015/1-134
Project Type: SNIC Medium Compute
Principal Investigator: Håkan Sundell <Hakan.Sundell@hb.se>
Affiliation: Högskolan i Borås
Duration: 2015-03-31 – 2016-04-01
Classification: 10201
Keywords:

Abstract

We aim for improving the performance of data mining and machine learning algorithms using high performance tools. In this respect we parallelize some of the popular algorithms in data mining and machine learning, e.g. Support Vector Machines (SVMs). We study the existing overheads such as communication and synchronization times. We identify the existing dependencies in the structure of the chosen algorithms and try to design approximation algorithms to transform the existing dependencies in the original sequential algorithms in order to achieve time efficiency. At this stage of the research work, we are working on Support Vector Machine (SVM) as one of the popular machine learning algorithms. An existing parallel SVM algorithm using Message Passing Interface (MPI) has been studied and we try to improve the algorithm considering the communication and synchronization overheads. We try to design an approximation algorithm that can decrease the existing dependencies in the structure of the chosen algorithm in order to degrade the communication overhead which takes place between machines. In the cases that we use approximation or make estimation to replace the dependent parts of an algorithm, we may sacrifice accuracy in favor of faster computations regarding how well the approximation or estimations are made. Therefore we will study the trade-offs between accuracy and speed of calculations. We will define an error bound for the obtained accuracy of the results when different approximation algorithms are used.