大數據分析演算法--帶Canopy預處裡的kmeans
- 4. Canopy演算法基本概念 2/2
• 演算法步驟
1. 取得T1與T2距離閥值。
2. 將資料集向量化後得到一list放入記憶體。
3. 從list任意取一點P,計算P點與所有Canopy之間的距離(如果當前沒有
Canopy則把P當作一個新Canopy的中心),如果P點與某個Canopy距離在
T1以內,則點P加入到這個Canopy。
4. 如果P點與某個Canopy距離在T2以內,則需將P點從list中刪除,這一步
是認為P點與這個Canopy已經很接近了,因此不用它再做其他Canopy的
中心。
5. 重複3、4步驟,直到list為空則結束
4
- 12. 串回客戶ID或是識別ID方式
3. 透過SQL語法自行計算每個觀察值到各重心的距離後,再歸類到
最近的群集
12
select
cust_id,
case when C1_Dist < C2_Dist and C1_Dist < C3_Dist and C1_Dist < C4_Dist then 'Cluster1'
when C2_Dist < C1_Dist and C2_Dist < C3_Dist and C2_Dist < C4_Dist then 'Cluster2'
when C3_Dist < C1_Dist and C3_Dist < C2_Dist and C3_Dist < C4_Dist then 'Cluster3'
when C4_Dist < C1_Dist and C4_Dist < C2_Dist and C4_Dist < C3_Dist then 'Cluster4'
end as cluster_name
from
(
select
cust_id,
/*計算到各質心距離*/
SQRT( SQUARE(var1 - 6.249) + SQUARE(var2 - 2.868) + SQUARE(var3 - 4.854) + SQUARE(var4 - 1.651) ) as C1_Dist,
SQRT( SQUARE(var1 - 6.913) + SQUARE(var2 - 3.1) + SQUARE(var3 - 5.847) + SQUARE(var4 - 2.131) ) as C2_Dist,
SQRT( SQUARE(var1 - 5.606) + SQUARE(var2 - 2.642) + SQUARE(var3 - 3.997) + SQUARE(var4 - 1.235) ) as C3_Dist,
SQRT( SQUARE(var1 - 5.006) + SQUARE(var2 - 3.418) + SQUARE(var3 - 1.464) + SQUARE(var4 - 0.244) ) as C4_Dist
From kmean_DT
) aa
) aaa