Data Clustering
Title: Data Clustering
SNIC Project: SNIC 2021/22-751
Project Type: SNIC Small Compute
Principal Investigator: Amir Keramatian <>
Affiliation: Chalmers tekniska högskola
Duration: 2021-10-04 – 2021-12-01
Classification: 10201


Today's datasets are getting larger and larger in size and number of features, and many of today's data processing tasks require flexibility regarding the choice of similarity/dis-similarity measurements. The aforementioned aspects have started challenging many of the established data mining methods. Data clustering, a popular unsupervised data mining tool, is not an exception. Density-based spatial clustering of applications with noise (DBSCAN) is a prominent clustering algorithm capable of finding clusters of desired shapes and sizes while distinguishing noise points without a prior knowledge on the number of clusters. DBSCAN is a versatile algorithm in terms of being applicable with any desired dis-similarity measurement, including Euclidean distance. Consecutively, it can be applied in a large domain of applications. Computational complexity of traditional DBSCAN is in the worst-case quadratic in the input size, making it ineffectual considering attributes of today's datasets. Nonetheless, there have been various attempts to increase usability of DBSCAN such as utilization of indexing structures, parallelization, and approximation. Locality-sensitive hash (LSH) is a relatively recent and established indexing approach for similarity search in large and feature-rich datasets. LSH is based on the simple idea that if two data points are considered close using a particular dis-similarity measurement, using an appropriate locality-sensitive hash function, their hashed values are equal with a high probability. Consecutively, LSH-based indexing methods have been successfully utilized for finding similar items in large and feature-rich datasets. Furthermore, it has been shown that LSH is the best-known method for similarity search for feature-rich datasets. We have created an approximate concurrent LSH-based DBSCAN algorithm that we believe can outperform the state-of-the-art algorithms and prove useful in use-cases that similar algorithms fail to finish in a timely manner. Therefore, we would like to get resources to be able to benchmark our algorithm against the state-of-the-art competition.