ch03_KNN

K Nearest Neighbour¶

ⅰ. 모듈 불러오기 & DATA 확인¶

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

df = pd.read_csv("Data/churn.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

In [3]:

pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 50)

df.head()

Out[3]:

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

숫자(float)로 다뤄야 하는 필드가 문자로 되어 있는 상황 처리

In [4]:

df.iloc[488]

Out[4]:

customerID                         4472-LVYGI
gender                                 Female
SeniorCitizen                               0
Partner                                   Yes
Dependents                                Yes
tenure                                      0
PhoneService                               No
MultipleLines                No phone service
InternetService                           DSL
OnlineSecurity                            Yes
OnlineBackup                               No
DeviceProtection                          Yes
TechSupport                               Yes
StreamingTV                               Yes
StreamingMovies                            No
Contract                             Two year
PaperlessBilling                          Yes
PaymentMethod       Bank transfer (automatic)
MonthlyCharges                          52.55
TotalCharges                                 
Churn                                      No
Name: 488, dtype: object

In [5]:

df["TotalCharges"] = pd.to_numeric(df['TotalCharges'].replace(" ", ""))

df.describe()

Out[5]:

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges
count	7043.000000	7043.000000	7043.000000	7032.000000
mean	0.162147	32.371149	64.761692	2283.300441
std	0.368612	24.559481	30.090047	2266.771362
min	0.000000	0.000000	18.250000	18.800000
25%	0.000000	9.000000	35.500000	401.450000
50%	0.000000	29.000000	70.350000	1397.475000
75%	0.000000	55.000000	89.850000	3794.737500
max	1.000000	72.000000	118.750000	8684.800000

In [6]:

sns.displot(df['TotalCharges'])

Out[6]:

<seaborn.axisgrid.FacetGrid at 0x11749bf8340>

ⅱ. 카테고리 변수 처리 (One-hot Encoding)¶

다루고 있는 데이터에 문자열을 숫자로 치환

In [7]:

df['customerID'].dtype

Out[7]:

dtype('O')

In [8]:

col_list = []

for col in df.columns:
    if df[col].dtype == 'O':
        col_list.append(col)
        
# customerID는 unique한 값이기 때문에 제외
col_list = col_list[1:]
print(col_list)

['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']

In [9]:

df = pd.get_dummies(df, columns=col_list, drop_first=True)

In [10]:

df.head()

Out[10]:

	customerID	tenure	MonthlyCharges	TotalCharges	gender_Male	Partner_Yes	PhoneService_Yes	MultipleLines_No phone service	InternetService_Fiber optic	OnlineSecurity_Yes	...	DeviceProtection_Yes	TechSupport_Yes	Contract_One year	PaperlessBilling_Yes	PaymentMethod_Electronic check	PaymentMethod_Mailed check	Churn_Yes
0	7590-VHVEG	1	29.85	29.85	0	1	0	1	0	0	...	0	0	0	1	1	0	0
1	5575-GNVDE	34	56.95	1889.50	1	0	1	0	0	1	...	1	0	1	0	0	1	0
2	3668-QPYBK	2	53.85	108.15	1	0	1	0	0	1	...	0	0	0	1	0	1	1
3	7795-CFOCW	45	42.30	1840.75	1	0	0	1	0	1	...	1	1	1	0	0	0	0
4	9237-HQITU	2	70.70	151.65	0	0	1	0	1	0	...	0	0	0	1	1	0	1

5 rows × 32 columns

ⅲ. 결측/이상치 처리¶

In [11]:

df.isna().sum()

Out[11]:

customerID                                0
SeniorCitizen                             0
tenure                                    0
MonthlyCharges                            0
TotalCharges                             11
gender_Male                               0
Partner_Yes                               0
Dependents_Yes                            0
PhoneService_Yes                          0
MultipleLines_No phone service            0
MultipleLines_Yes                         0
InternetService_Fiber optic               0
InternetService_No                        0
OnlineSecurity_No internet service        0
OnlineSecurity_Yes                        0
OnlineBackup_No internet service          0
OnlineBackup_Yes                          0
DeviceProtection_No internet service      0
DeviceProtection_Yes                      0
TechSupport_No internet service           0
TechSupport_Yes                           0
StreamingTV_No internet service           0
StreamingTV_Yes                           0
StreamingMovies_No internet service       0
StreamingMovies_Yes                       0
Contract_One year                         0
Contract_Two year                         0
PaperlessBilling_Yes                      0
PaymentMethod_Credit card (automatic)     0
PaymentMethod_Electronic check            0
PaymentMethod_Mailed check                0
Churn_Yes                                 0
dtype: int64

In [12]:

print(df['TotalCharges'].mean())
print(df['TotalCharges'].median())

sns.displot(df['TotalCharges'])

2283.3004408418697
1397.475

Out[12]:

<seaborn.axisgrid.FacetGrid at 0x11749f9b250>

In [13]:

# 데이터가 한쪽에 치우쳐져 있기 때문에 산술평균보다는 중간값을 활용
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())

df.isna().sum()

Out[13]:

customerID                               0
SeniorCitizen                            0
tenure                                   0
MonthlyCharges                           0
TotalCharges                             0
gender_Male                              0
Partner_Yes                              0
Dependents_Yes                           0
PhoneService_Yes                         0
MultipleLines_No phone service           0
MultipleLines_Yes                        0
InternetService_Fiber optic              0
InternetService_No                       0
OnlineSecurity_No internet service       0
OnlineSecurity_Yes                       0
OnlineBackup_No internet service         0
OnlineBackup_Yes                         0
DeviceProtection_No internet service     0
DeviceProtection_Yes                     0
TechSupport_No internet service          0
TechSupport_Yes                          0
StreamingTV_No internet service          0
StreamingTV_Yes                          0
StreamingMovies_No internet service      0
StreamingMovies_Yes                      0
Contract_One year                        0
Contract_Two year                        0
PaperlessBilling_Yes                     0
PaymentMethod_Credit card (automatic)    0
PaymentMethod_Electronic check           0
PaymentMethod_Mailed check               0
Churn_Yes                                0
dtype: int64

ⅳ. 스케일 맞추기, Scaling¶

Standard Scaler : 표준정규분포
Robust Scaler : 이상치의 영향을 덜 받음
Min-Max Scaler : 데이터의 분포를 가장 덜 왜곡함. 0에서 1사이 값

In [14]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# 식별자료인 customerID 변수 삭제
df.drop('customerID', axis=1, inplace=True)

# 종속 변수를 제외하고 scaling 진행
minmax = MinMaxScaler()
minmax.fit(df)
scaled_data = minmax.transform(df)

# 스케일링한 데이터
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

In [15]:

scaled_df

Out[15]:

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges	gender_Male	Partner_Yes	Dependents_Yes	PhoneService_Yes	MultipleLines_No phone service	MultipleLines_Yes	InternetService_Fiber optic	InternetService_No	OnlineSecurity_No internet service	OnlineSecurity_Yes	OnlineBackup_No internet service	...	DeviceProtection_No internet service	DeviceProtection_Yes	TechSupport_No internet service	TechSupport_Yes	StreamingTV_No internet service	StreamingTV_Yes	StreamingMovies_No internet service	StreamingMovies_Yes	Contract_One year	Contract_Two year	PaperlessBilling_Yes	PaymentMethod_Credit card (automatic)	PaymentMethod_Electronic check	PaymentMethod_Mailed check	Churn_Yes
0	0.0	0.013889	0.115423	0.001275	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
1	0.0	0.472222	0.385075	0.215867	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
2	0.0	0.027778	0.354229	0.010310	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	1.0
3	0.0	0.625000	0.239303	0.210241	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.027778	0.521891	0.015330	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	0.0	0.333333	0.662189	0.227521	1.0	1.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0
7039	0.0	1.000000	0.845274	0.847461	0.0	1.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0
7040	0.0	0.152778	0.112935	0.037809	0.0	1.0	1.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
7041	1.0	0.055556	0.558706	0.033210	1.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	1.0
7042	0.0	0.916667	0.869652	0.787641	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0

7043 rows × 31 columns

ⅴ. Modeling¶

In [16]:

from sklearn.model_selection import train_test_split

X = scaled_df.drop('Churn_Yes', axis=1)
y = scaled_df['Churn_Yes']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1234)

In [17]:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

Out[17]:

KNeighborsClassifier()

In [18]:

pred = knn.predict(X_test) 

pd.DataFrame({'actual_val': y_test, 'pred_val': pred}).head(10)

Out[18]:

	actual_val	pred_val
6692	0.0	0.0
2624	0.0	0.0
1076	0.0	0.0
1428	1.0	1.0
7026	1.0	0.0
2886	0.0	0.0
3049	1.0	0.0
1032	1.0	0.0
6661	0.0	0.0
240	0.0	0.0

In [19]:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

0.7406530998580217
[[1289  274]
 [ 274  276]]
              precision    recall  f1-score   support

         0.0       0.82      0.82      0.82      1563
         1.0       0.50      0.50      0.50       550

    accuracy                           0.74      2113
   macro avg       0.66      0.66      0.66      2113
weighted avg       0.74      0.74      0.74      2113

ⅵ. K값 최적화¶

In [20]:

acc_list = []

for n in range(1, 101):
    knn = KNeighborsClassifier(n_neighbors = n)
    knn.fit(X_train, y_train)
    pred = knn.predict(X_test)
    acc_list.append(accuracy_score(y_test, pred))
    
acc_list[:10]

Out[20]:

[0.7061050638902036,
 0.7496450544249882,
 0.7269285376242309,
 0.754850922858495,
 0.7406530998580217,
 0.7614765735920492,
 0.7534311405584477,
 0.7600567912920019,
 0.7628963558920966,
 0.7652626597255088]

In [21]:

plt.figure(figsize=(20, 10))
sns.lineplot(x = range(1, 101), y = acc_list, marker = 'o', markersize=5, markerfacecolor='red')

Out[21]:

<AxesSubplot:>

In [22]:

# 가장 높은 정확도와 K값의 인덱스 추출
print(max(acc_list))
print(acc_list.index(max(acc_list)))

0.7889256980596309
35

In [23]:

# range(1, 101)로 시작한 리스트의 35번 인덱스이니 k=36
knn = KNeighborsClassifier(n_neighbors = 36)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print(accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

0.7889256980596309
[[1372  191]
 [ 255  295]]
              precision    recall  f1-score   support

         0.0       0.84      0.88      0.86      1563
         1.0       0.61      0.54      0.57       550

    accuracy                           0.79      2113
   macro avg       0.73      0.71      0.71      2113
weighted avg       0.78      0.79      0.78      2113

저작자표시 비영리 변경금지 (새창열림)

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0)	2021.01.18
E-Commerce Part Ⅶ: 시계열 분석 응용 (0)	2021.01.18
E-Commerce Part Ⅵ: K Means Clustering 응용 (0)	2021.01.16
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0)	2021.01.15
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0)	2021.01.12

CheeseChaser

E-Commerce Part Ⅲ: KNN 모델 응용

K Nearest Neighbour¶

ⅰ. 모듈 불러오기 & DATA 확인¶

ⅱ. 카테고리 변수 처리 (One-hot Encoding)¶

ⅲ. 결측/이상치 처리¶

ⅳ. 스케일 맞추기, Scaling¶

ⅴ. Modeling¶

ⅵ. K값 최적화¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

티스토리툴바

E-Commerce Part Ⅲ: KNN 모델 응용

K Nearest Neighbour¶

ⅰ. 모듈 불러오기 & DATA 확인¶

ⅱ. 카테고리 변수 처리 (One-hot Encoding)¶

ⅲ. 결측/이상치 처리¶

ⅳ. 스케일 맞추기, Scaling¶

ⅴ. Modeling¶

ⅵ. K값 최적화¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

'데이터 분석/Proj. E-Commerce' Related Articles

티스토리툴바