採用机器学习中的k-means
  利用数据: 美国加州大学公开的批发商的客户数据
  编程代码(Python):
import pandas as pd import numpy as np from sklearn.cluster import KMeans import matplotlib.pyplot as plt df_read = pd.read_csv('Wholesale_customers_data.csv') #读取数据 cust_df = df_read.copy() #删除不要的列 del (cust_df['Channel']) del (cust_df['Region']) cust_array = np.array([cust_df['Fresh'].tolist(), cust_df['Milk'].tolist(), cust_df['Grocery'].tolist(), cust_df['Frozen'].tolist(), cust_df['Detergents_Paper'].tolist(), cust_df['Delicassen'].tolist()], np.int32) cust_array = cust_array.T k = 3 labels = KMeans(n_clusters=k, random_state=0).fit_predict(cust_array) #进行预测 df_read['cluster'] = labels #结果追加在最后 pd.DataFrame.to_excel(df_read,'c:/kmeans_result.xlsx',index=False) #结果保存 cust_df['cluster'] = labels print('---簇的计数---') print(cust_df['cluster'].value_counts()) for i in range(k): print('---簇{}的平均---'.format(str(i))) print(cust_df[cust_df['cluster'] == i].mean()) #可视化 clusterinf = pd.DataFrame() for i in range(k): clusterinf['cluster' + str(i)] = cust_df[cust_df['cluster'] == i].mean() clusterinf = clusterinf.drop('cluster') clustersInf = "Mean({} Clusters)".format(str(k)) clus_plot = clusterinf.T.plot(kind='bar', stacked=True, title=clustersInf ) clus_plot.set_xticklabels(clus_plot.xaxis.get_majorticklabels(), rotation=0) plt.show()
---簇的计数---
1 328
2 59
0 53
Name: cluster, dtype: int64
---簇0的平均---
Fresh 7751.981132
Milk 17910.509434
Grocery 27037.905660
Frozen 1970.943396
Detergents_Paper 12104.867925
Delicassen 2185.735849
cluster 0.000000
dtype: float64
---簇1的平均---
Fresh 8341.612805
Milk 3779.893293
Grocery 5152.173780
Frozen 2577.237805
Detergents_Paper 1720.573171
Delicassen 1136.542683
cluster 1.000000
dtype: float64
---簇2的平均---
Fresh 36156.389831
Milk 6123.644068
Grocery 6366.779661
Frozen 6811.118644
Detergents_Paper 1050.016949
Delicassen 3090.050847
cluster 2.000000
dtype: float64
  顾客被分成3簇
  簇1(cluster1)有328人
  簇2(cluster2)有59人
  簇0(cluster0)有53人
    从图中可以看到被分到簇1(cluster1)的顾客(328人) 整体订购量很低(人数最多)
    订购量较高的是被分到簇2(cluster2)的顾客(59人)与被分到簇0(cluster0)的顾客(53人)
    被分到簇2(cluster2)的顾客(59人)的特点为订购鲜货量较高
    被分到簇0(cluster0)的顾客(53人)的特点为各种订购量比较平均(杂货稍微多点)
    k-means可用于数据分类
    (在医学上应该可用于基因分类等)
  预先知道顾客的聚类数(簇)k那最好,如果不知道,可利用手肘法来选取最佳簇数(虽然手肘法不是一个完美的方法)
具体做法是让k从1开始取值直到合适的上限(这里选取上限为19),
对每一个k值进行聚类并记下对应的distortion,然后描绘k与distortion的关系图,最后选取肘部对应的k(弯曲点)作为最佳簇数。
执行以下的代码可得下面的关系图
  从图中可以看出出现弯曲点为K=3的位置(X轴)所以上面例中簇数选3
  用手肘法选取最佳簇数python代码:
from sklearn.cluster import KMeans import matplotlib.pyplot as plt import pandas as pd X = pd.read_csv("Wholesale_customers_data.csv") del (X['Channel']) del (X['Region']) X = X.values # k means determine k distortions = [] # 存放每次结果的误差平方和 clusters = range(1, 20) for k in clusters: km = KMeans(n_clusters=k).fit(X) distortions.append(km.inertia_) # Plot the elbow plt.plot(clusters, distortions, marker='o') plt.xlabel('Number of clusters (k)') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show()