K Means Clustering¶
ⅰ. 모듈 불러오기 & DATA 특성 파악¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv("Data/Mall_Customers.csv", index_col = 0)
df.head()
Out[2]:
Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|
CustomerID | ||||
1 | Male | 19 | 15 | 39 |
2 | Male | 21 | 15 | 81 |
3 | Female | 20 | 16 | 6 |
4 | Female | 23 | 16 | 77 |
5 | Female | 31 | 17 | 40 |
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 200 entries, 1 to 200 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 200 non-null object 1 Age 200 non-null int64 2 Annual Income (k$) 200 non-null int64 3 Spending Score (1-100) 200 non-null int64 dtypes: int64(3), object(1) memory usage: 7.8+ KB
In [4]:
df.describe()
Out[4]:
Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|
count | 200.000000 | 200.000000 | 200.000000 |
mean | 38.850000 | 60.560000 | 50.200000 |
std | 13.969007 | 26.264721 | 25.823522 |
min | 18.000000 | 15.000000 | 1.000000 |
25% | 28.750000 | 41.500000 | 34.750000 |
50% | 36.000000 | 61.500000 | 50.000000 |
75% | 49.000000 | 78.000000 | 73.000000 |
max | 70.000000 | 137.000000 | 99.000000 |
- Missing Value Processing
In [5]:
df.isna().sum()
Out[5]:
Gender 0 Age 0 Annual Income (k$) 0 Spending Score (1-100) 0 dtype: int64
- Gender 변수 one-hot encoding
In [6]:
df = pd.get_dummies(df, columns= ['Gender'], drop_first=True)
ⅱ. K Means Clustering¶
In [7]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(df)
model.labels_
Out[7]:
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
※ 현재 DF에는 feature가 여러개 있기 때문에 산점도를 그리기 어려움
In [8]:
res_df = df.copy()
res_df['label'] = model.labels_
res_df
Out[8]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | label | |
---|---|---|---|---|---|
CustomerID | |||||
1 | 19 | 15 | 39 | 1 | 2 |
2 | 21 | 15 | 81 | 1 | 2 |
3 | 20 | 16 | 6 | 0 | 2 |
4 | 23 | 16 | 77 | 0 | 2 |
5 | 31 | 17 | 40 | 0 | 2 |
... | ... | ... | ... | ... | ... |
196 | 35 | 120 | 79 | 0 | 1 |
197 | 45 | 126 | 28 | 0 | 0 |
198 | 32 | 126 | 74 | 1 | 1 |
199 | 32 | 137 | 18 | 1 | 0 |
200 | 30 | 137 | 83 | 1 | 1 |
200 rows × 5 columns
In [9]:
res_df.groupby('label').mean()
Out[9]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
label | ||||
0 | 40.394737 | 87.000000 | 18.631579 | 0.526316 |
1 | 32.692308 | 86.538462 | 82.128205 | 0.461538 |
2 | 40.325203 | 44.154472 | 49.829268 | 0.406504 |
In [10]:
res_df['label'].value_counts()
Out[10]:
2 123 1 39 0 38 Name: label, dtype: int64
ⅲ-ⅰ. 최적의 K값 찾기, Elbow Method¶
In [11]:
dist_list = []
for n in range(2,16):
model = KMeans(n_clusters = n)
model.fit(df)
dist_list.append(model.inertia_)
dist_list
Out[11]:
[212889.44245524294, 143391.59236035674, 104414.67534220174, 75399.61541401486, 58348.64136331504, 51575.27793107792, 44357.32664902663, 40649.64102453101, 37132.84983602549, 34877.838013837994, 32140.57637686386, 30214.964718614712, 28060.734110615635, 26841.910772642394]
In [12]:
sns.lineplot(x=list(range(2,16)), y=dist_list, marker='o')
Out[12]:
<AxesSubplot:>
ⅲ-ⅱ. 최적의 K값 찾기, Silhouette Score¶
- Cluster 간의 거리도 계산에 포함( 각각의 Cluster의 거리가 멀수록 좋은 Clustering )
In [13]:
from sklearn.metrics import silhouette_score
silhouette_score(df, model.labels_)
Out[13]:
0.33719682488062047
In [14]:
sil_list = []
for n in range(2,16):
model = KMeans(n_clusters = n)
model.fit(df)
sil_list.append(silhouette_score(df, model.labels_))
sil_list
Out[14]:
[0.29307334005502633, 0.383798873822341, 0.4052954330641215, 0.4440669204743008, 0.45205475380756527, 0.4347734443683834, 0.4333505967993175, 0.41541695889588964, 0.373801236698759, 0.36589323758883546, 0.36291744943287396, 0.360112430940247, 0.35380780876759266, 0.3403664220303368]
In [15]:
sns.lineplot(x=list(range(2, 16)), y=sil_list, marker='o')
Out[15]:
<AxesSubplot:>
Cluster가 6개일 때 실루엣 스코어가 가장 높으므로, 최적은 K값은 6이라 할 수 있다.
ⅳ. 최적의 K값으로 다시 Modeling¶
In [16]:
model = KMeans(n_clusters = 6)
model.fit(df)
df['label'] = model.labels_
df.head()
Out[16]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | label | |
---|---|---|---|---|---|
CustomerID | |||||
1 | 19 | 15 | 39 | 1 | 5 |
2 | 21 | 15 | 81 | 1 | 4 |
3 | 20 | 16 | 6 | 0 | 5 |
4 | 23 | 16 | 77 | 0 | 4 |
5 | 31 | 17 | 40 | 0 | 5 |
In [17]:
df.groupby('label').mean()
Out[17]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
label | ||||
0 | 32.692308 | 86.538462 | 82.128205 | 0.461538 |
1 | 27.000000 | 56.657895 | 49.131579 | 0.342105 |
2 | 41.685714 | 88.228571 | 17.285714 | 0.571429 |
3 | 56.155556 | 53.377778 | 49.088889 | 0.444444 |
4 | 25.272727 | 25.727273 | 79.363636 | 0.409091 |
5 | 44.142857 | 25.142857 | 19.523810 | 0.380952 |
- 시각화를 통해 label 별 feature들의 분포 파악
In [18]:
sns.boxplot(data=df, x='label', y='Age')
Out[18]:
<AxesSubplot:xlabel='label', ylabel='Age'>
In [19]:
sns.boxplot(data=df, x='label', y='Annual Income (k$)')
Out[19]:
<AxesSubplot:xlabel='label', ylabel='Annual Income (k$)'>
In [20]:
sns.boxplot(data=df, x='label', y='Spending Score (1-100)')
Out[20]:
<AxesSubplot:xlabel='label', ylabel='Spending Score (1-100)'>
ⅴ. 주성분 분석, Principle Component Analysis¶
In [21]:
df.drop('label', axis=1, inplace=True)
df.head()
Out[21]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
CustomerID | ||||
1 | 19 | 15 | 39 | 1 |
2 | 21 | 15 | 81 | 1 |
3 | 20 | 16 | 6 | 0 |
4 | 23 | 16 | 77 | 0 |
5 | 31 | 17 | 40 | 0 |
In [22]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(df)
pca_data = pca.transform(df)
pca_df = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])
pca_df
Out[22]:
PC1 | PC2 | |
---|---|---|
0 | -31.869945 | -33.001252 |
1 | 0.764494 | -56.842901 |
2 | -57.408276 | -13.124961 |
3 | -2.168543 | -53.478590 |
4 | -32.174085 | -30.388412 |
... | ... | ... |
195 | 58.352515 | 31.017542 |
196 | 19.908001 | 66.446108 |
197 | 58.520804 | 38.346039 |
198 | 20.979130 | 79.376405 |
199 | 72.447693 | 41.811336 |
200 rows × 2 columns
In [23]:
plt.figure(figsize=(20, 10))
sns.scatterplot(x=pca_df['PC1'], y=pca_df['PC2'], hue=model.labels_, palette='Set2', s=100)
Out[23]:
<AxesSubplot:xlabel='PC1', ylabel='PC2'>
'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글
E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0) | 2021.01.18 |
---|---|
E-Commerce Part Ⅶ: 시계열 분석 응용 (0) | 2021.01.18 |
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅲ: KNN 모델 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0) | 2021.01.12 |