本文简单介绍DBSCAN算法的原理及实现。
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一种基于密度的空间聚类算法。该算法将具有足够密度的区域划分为簇,并在具有噪声的空间数据库中发现任意形状的簇,它将簇定义为密度相连的点的最大集合。
假设样本集为 D = ( x 1 , x 2 , . . . , x m ) D=(x_1, x_2, ..., x_m) D=(x1,x2,...,xm),DBSCAN算法有如下相关定义:
注意:
总体思路
详细算法
算法1
输入:样本集 D = ( x 1 , x 2 , . . . , x m ) D=(x_1, x_2, ..., x_m) D=(x1,x2,...,xm),参数 ( ϵ , M i n P t s ) (\epsilon, MinPts) (ϵ,MinPts)
输出:核心点集合 Ω \Omega Ω
算法2
输入:样本集 D = ( x 1 , x 2 , . . . , x m ) D=(x_1, x_2, ..., x_m) D=(x1,x2,...,xm),参数 ( ϵ , M i n P t s ) (\epsilon, MinPts) (ϵ,MinPts)
输出:簇划分结果 C = { C 1 , C 2 , . . . , C k } C=\{ C_1, C_2, ..., C_k \} C={C1,C2,...,Ck}
尝试使用sklearn中的DBSCAN算法。
# all import from sklearn import datasets import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN import numpy as np import time
# generate data n_samples = 1500 noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5,noise=.05) X, y = noisy_circles
# plot plt.figure(figsize=(10, 5)) plt.subplot(1, 2, 1) # plt.figure(1) plt.scatter(X[:, 0], X[:, 1]) plt.subplot(1, 2, 2) # plt.figure(2) plt.scatter(X[:, 0], X[:, 1], c=y)
上图是sklearn生成的数据,右图是根据label直接绘制的结果。
下面使用sklearn中的DBSCAN类对数据X进行聚类。
# run sklearn DBSCAN to predict y_pred = DBSCAN(eps=0.1).fit_predict(X) # plot sklearn DBSCAN result plt.figure(figsize=(5, 5)) plt.scatter(X[:, 0], X[:, 1], c=y_pred)
此部分代码为笔者依据上述伪代码实现,未经过优化
# my DBSCAN algorithm from sklearn.metrics.pairwise import pairwise_distances class MyDBSCAN: def __init__(self, X, eps=0.5, minPts=5): """ @param: X: input dataset, (n, m) eps: minPts: """ self.X = X self.n = len(X) self.eps = eps self.minPts = minPts def get_core_points(self): """ @return: core_points: the index of core points """ core_points = [] for i in range(self.n): neighbors = self.get_neighbors(i) if len(neighbors) >= self.minPts: core_points.append(i) return core_points def get_neighbors(self, point_idx): """ @param: point_idx: index of point @return: neighbors: neighbors of point """ neighbors = [] for i in range(self.n): if self.dis_mat[point_idx][i] < self.eps: neighbors.append(i) return neighbors def compute_dis_mat(self): """ @param: X: input dataset, (n, m) @return: self.dis_mat: (n, n) """ self.dis_mat = pairwise_distances(X) return def random_choose(self, points): """ random choose a point """ # return points[0] return points[np.random.randint(0, len(points))] def fit_predict(self): """ @param: X: input dataset, (n, m) @return: y_pred: predicted result, (n,) """ # computer distance matrix self.compute_dis_mat() core_points = set(self.get_core_points()) waiting_points = set([i for i in range(len(self.X))]) cluster = -1 * np.ones(self.n, dtype=np.int32) k = 0 while len(core_points) > 0: o = self.random_choose(list(core_points)) cur_core_points = set() cur_core_points.add(o) cur_cluster = set() cur_cluster.add(o) waiting_points.remove(o) while True: if len(cur_core_points) > 0: point = list(cur_core_points)[0] neighbors = set(self.get_neighbors(point)) delta = neighbors & waiting_points cur_cluster = cur_cluster | delta waiting_points = waiting_points - delta cur_core_points = cur_core_points | (delta & core_points) cur_core_points.remove(point) core_points = core_points - cur_cluster else: cluster[list(cur_cluster)] = k k += 1 break return cluster
将其用到sklearn生成的数据集上测试效果:
n_samples = 500 noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.05) X, y = noisy_circles
start = time.time() y_my_pred = MyDBSCAN(X=X, eps=0.15).fit_predict() end1 = time.time() y_pred = DBSCAN(eps=0.15).fit_predict(X) end2 = time.time() print('{:15} {:5} {:3}'.format('Algorithm', 'Time', 'NMI')) print('{:15} {:5.2f} {:3}'.format('My DBSCAN ', end1 - start, metrics.normalized_mutual_info_score(y, y_my_pred))) print('{:15} {:5.2f} {:3}'.format('Sklearn DBSCAN', end2 - end1, metrics.normalized_mutual_info_score(y, y_pred)))
Algorithm Time NMI My DBSCAN 0.11 1.0 Sklearn DBSCAN 0.00 1.0
# plot sklearn DBSCAN result plt.figure(figsize=(15, 5)) plt.subplot(1, 3, 1) plt.scatter(X[:, 0], X[:, 1], c=y) plt.title('origin data') plt.subplot(1, 3, 2) plt.scatter(X[:, 0], X[:, 1], c=y_pred) plt.title('sklearn DBSCAN') plt.subplot(1, 3, 3) plt.scatter(X[:, 0], X[:, 1], c=y_my_pred) plt.title('My DBSCAN')
修改eps后的测试结果(minPts=5):
eps | result |
---|---|
0.1 | |
0.12 | |
0.15 | |
可以观察到在sklearn生成的该类数据集上My DBSCAN与sklearn DBSCAN的聚类结果非常接近,但My DBSCAN的运行时间较长。
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (aaai.org)
DBSCAN密度聚类算法 - 刘建平Pinard - 博客园
用scikit-learn学习DBSCAN聚类 - 刘建平Pinard - 博客园
DBSCAN - 维基百科
详解DBSCAN聚类 - 知乎
Visualizing DBSCAN Clustering