sciencemaths的个人博客分享 http://blog.sciencenet.cn/u/sciencemaths

博文

k-means, k-medoids, k-median and k-center

已有 12046 次阅读 2014-5-22 13:45 |个人分类:算法|系统分类:科研笔记

k-means, k-medoids, k-median and k-center, 先不要晕. 这4个都是聚类算法, 个中区别慢慢讲来.


k-means, 这位是最有名的了. 因为简单有效, 通常是聚类的第一选择.

N data items ---- >  k  clusters

in each cluster, there is an averaged center (mean) called  centroids.

Object:  minimize the sum of squared distance from each item to its nearest averaged center.

EM algorithm is the most common and simple way to realize it.


k-medoids,

N data items ---- >  k  clusters

in each cluster, there is a medoid, which is a real data item from the data set (not averaged !!!).

Object:  minimize the sum of squared distance from each item to its nearest medoids.

Main realization :  PAM, CLARA, CLARANS, EM algorithm (like k-means)

PAM:  global optimal,  but very slow

CLARA:  use PAM on samples, efficient, not global optimal

CLARANS:  random search,  better than CLARA

EM:  very fast, but not global optimal


k-median,

N data items ---- >  k  clusters

in each cluster, there is a median (median !! not mean or medoids !!).

Object:  minimize the sum of  distance from each item to its nearest median (sum of distance !! not sum of squared distance !!).


k-center,

N data items ---- >  k  clusters

in each cluster, there is a cluster center.

Object:  minimize the maximum  distance from each item to its nearest cluster centers (maximum distance !! not sum of distance !!)


According to  (Bradley NIPS1997),

k-median is to assign n points in m-dimensional real value space to k clusters so that the sum of distances of each point to the nearest  center is minimized. The center is a vector in m-dimensional real value space, but not the one of n points. A center of one cluster is iteratively computed as  the median vector of all points in this cluster.

k-median algorithm uses the same strategy as k-means to update the centers, but it  uses the 1-norm distance.

In contrast the k-means algorithm uses squares of 2-norm distances to generate cluster centers.

According to (Arya STOC2001),  k-median problem is to minimize the average distance from data points to their closest cluster centers.  k-center problem is to minimize the maximum  distance from data points to their closest cluster centers, which is the min-max analogue of the k-median problem.

In a general metric space, the k-median problem is known to be NP-hard. Its approximation has been widely studied in (Arya STOC2001, Guha JCSS2002).


就这么定了, 以后谁还敢混淆这4个, 先拉出去每个算法的实现程序写100遍.




https://blog.sciencenet.cn/blog-704655-796710.html


收藏 IP: 60.247.77.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (1 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-20 16:50

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部