K Nearest Neighbour¶
ⅰ. 모듈 불러오기 & DATA 확인¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv("Data/churn.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
In [3]:
pd.set_option('display.max_columns', 30)
pd.set_option('display.max_rows', 50)
df.head()
Out[3]:
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
- 숫자(float)로 다뤄야 하는 필드가 문자로 되어 있는 상황 처리
In [4]:
df.iloc[488]
Out[4]:
customerID 4472-LVYGI gender Female SeniorCitizen 0 Partner Yes Dependents Yes tenure 0 PhoneService No MultipleLines No phone service InternetService DSL OnlineSecurity Yes OnlineBackup No DeviceProtection Yes TechSupport Yes StreamingTV Yes StreamingMovies No Contract Two year PaperlessBilling Yes PaymentMethod Bank transfer (automatic) MonthlyCharges 52.55 TotalCharges Churn No Name: 488, dtype: object
In [5]:
df["TotalCharges"] = pd.to_numeric(df['TotalCharges'].replace(" ", ""))
df.describe()
Out[5]:
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 | 7032.000000 |
mean | 0.162147 | 32.371149 | 64.761692 | 2283.300441 |
std | 0.368612 | 24.559481 | 30.090047 | 2266.771362 |
min | 0.000000 | 0.000000 | 18.250000 | 18.800000 |
25% | 0.000000 | 9.000000 | 35.500000 | 401.450000 |
50% | 0.000000 | 29.000000 | 70.350000 | 1397.475000 |
75% | 0.000000 | 55.000000 | 89.850000 | 3794.737500 |
max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 |
In [6]:
sns.displot(df['TotalCharges'])
Out[6]:
<seaborn.axisgrid.FacetGrid at 0x11749bf8340>
ⅱ. 카테고리 변수 처리 (One-hot Encoding)¶
- 다루고 있는 데이터에 문자열을 숫자로 치환
In [7]:
df['customerID'].dtype
Out[7]:
dtype('O')
In [8]:
col_list = []
for col in df.columns:
if df[col].dtype == 'O':
col_list.append(col)
# customerID는 unique한 값이기 때문에 제외
col_list = col_list[1:]
print(col_list)
['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'Churn']
In [9]:
df = pd.get_dummies(df, columns=col_list, drop_first=True)
In [10]:
df.head()
Out[10]:
customerID | SeniorCitizen | tenure | MonthlyCharges | TotalCharges | gender_Male | Partner_Yes | Dependents_Yes | PhoneService_Yes | MultipleLines_No phone service | MultipleLines_Yes | InternetService_Fiber optic | InternetService_No | OnlineSecurity_No internet service | OnlineSecurity_Yes | ... | DeviceProtection_No internet service | DeviceProtection_Yes | TechSupport_No internet service | TechSupport_Yes | StreamingTV_No internet service | StreamingTV_Yes | StreamingMovies_No internet service | StreamingMovies_Yes | Contract_One year | Contract_Two year | PaperlessBilling_Yes | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | 0 | 1 | 29.85 | 29.85 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 5575-GNVDE | 0 | 34 | 56.95 | 1889.50 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 3668-QPYBK | 0 | 2 | 53.85 | 108.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
3 | 7795-CFOCW | 0 | 45 | 42.30 | 1840.75 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 9237-HQITU | 0 | 2 | 70.70 | 151.65 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
5 rows × 32 columns
ⅲ. 결측/이상치 처리¶
In [11]:
df.isna().sum()
Out[11]:
customerID 0 SeniorCitizen 0 tenure 0 MonthlyCharges 0 TotalCharges 11 gender_Male 0 Partner_Yes 0 Dependents_Yes 0 PhoneService_Yes 0 MultipleLines_No phone service 0 MultipleLines_Yes 0 InternetService_Fiber optic 0 InternetService_No 0 OnlineSecurity_No internet service 0 OnlineSecurity_Yes 0 OnlineBackup_No internet service 0 OnlineBackup_Yes 0 DeviceProtection_No internet service 0 DeviceProtection_Yes 0 TechSupport_No internet service 0 TechSupport_Yes 0 StreamingTV_No internet service 0 StreamingTV_Yes 0 StreamingMovies_No internet service 0 StreamingMovies_Yes 0 Contract_One year 0 Contract_Two year 0 PaperlessBilling_Yes 0 PaymentMethod_Credit card (automatic) 0 PaymentMethod_Electronic check 0 PaymentMethod_Mailed check 0 Churn_Yes 0 dtype: int64
In [12]:
print(df['TotalCharges'].mean())
print(df['TotalCharges'].median())
sns.displot(df['TotalCharges'])
2283.3004408418697 1397.475
Out[12]:
<seaborn.axisgrid.FacetGrid at 0x11749f9b250>
In [13]:
# 데이터가 한쪽에 치우쳐져 있기 때문에 산술평균보다는 중간값을 활용
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())
df.isna().sum()
Out[13]:
customerID 0 SeniorCitizen 0 tenure 0 MonthlyCharges 0 TotalCharges 0 gender_Male 0 Partner_Yes 0 Dependents_Yes 0 PhoneService_Yes 0 MultipleLines_No phone service 0 MultipleLines_Yes 0 InternetService_Fiber optic 0 InternetService_No 0 OnlineSecurity_No internet service 0 OnlineSecurity_Yes 0 OnlineBackup_No internet service 0 OnlineBackup_Yes 0 DeviceProtection_No internet service 0 DeviceProtection_Yes 0 TechSupport_No internet service 0 TechSupport_Yes 0 StreamingTV_No internet service 0 StreamingTV_Yes 0 StreamingMovies_No internet service 0 StreamingMovies_Yes 0 Contract_One year 0 Contract_Two year 0 PaperlessBilling_Yes 0 PaymentMethod_Credit card (automatic) 0 PaymentMethod_Electronic check 0 PaymentMethod_Mailed check 0 Churn_Yes 0 dtype: int64
ⅳ. 스케일 맞추기, Scaling¶
- Standard Scaler : 표준정규분포
- Robust Scaler : 이상치의 영향을 덜 받음
- Min-Max Scaler : 데이터의 분포를 가장 덜 왜곡함. 0에서 1사이 값
In [14]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# 식별자료인 customerID 변수 삭제
df.drop('customerID', axis=1, inplace=True)
# 종속 변수를 제외하고 scaling 진행
minmax = MinMaxScaler()
minmax.fit(df)
scaled_data = minmax.transform(df)
# 스케일링한 데이터
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
In [15]:
scaled_df
Out[15]:
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | gender_Male | Partner_Yes | Dependents_Yes | PhoneService_Yes | MultipleLines_No phone service | MultipleLines_Yes | InternetService_Fiber optic | InternetService_No | OnlineSecurity_No internet service | OnlineSecurity_Yes | OnlineBackup_No internet service | ... | DeviceProtection_No internet service | DeviceProtection_Yes | TechSupport_No internet service | TechSupport_Yes | StreamingTV_No internet service | StreamingTV_Yes | StreamingMovies_No internet service | StreamingMovies_Yes | Contract_One year | Contract_Two year | PaperlessBilling_Yes | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.013889 | 0.115423 | 0.001275 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.472222 | 0.385075 | 0.215867 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.027778 | 0.354229 | 0.010310 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
3 | 0.0 | 0.625000 | 0.239303 | 0.210241 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.027778 | 0.521891 | 0.015330 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7038 | 0.0 | 0.333333 | 0.662189 | 0.227521 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
7039 | 0.0 | 1.000000 | 0.845274 | 0.847461 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
7040 | 0.0 | 0.152778 | 0.112935 | 0.037809 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
7041 | 1.0 | 0.055556 | 0.558706 | 0.033210 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
7042 | 0.0 | 0.916667 | 0.869652 | 0.787641 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7043 rows × 31 columns
ⅴ. Modeling¶
In [16]:
from sklearn.model_selection import train_test_split
X = scaled_df.drop('Churn_Yes', axis=1)
y = scaled_df['Churn_Yes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1234)
In [17]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
Out[17]:
KNeighborsClassifier()
In [18]:
pred = knn.predict(X_test)
pd.DataFrame({'actual_val': y_test, 'pred_val': pred}).head(10)
Out[18]:
actual_val | pred_val | |
---|---|---|
6692 | 0.0 | 0.0 |
2624 | 0.0 | 0.0 |
1076 | 0.0 | 0.0 |
1428 | 1.0 | 1.0 |
7026 | 1.0 | 0.0 |
2886 | 0.0 | 0.0 |
3049 | 1.0 | 0.0 |
1032 | 1.0 | 0.0 |
6661 | 0.0 | 0.0 |
240 | 0.0 | 0.0 |
In [19]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
0.7406530998580217 [[1289 274] [ 274 276]] precision recall f1-score support 0.0 0.82 0.82 0.82 1563 1.0 0.50 0.50 0.50 550 accuracy 0.74 2113 macro avg 0.66 0.66 0.66 2113 weighted avg 0.74 0.74 0.74 2113
ⅵ. K값 최적화¶
In [20]:
acc_list = []
for n in range(1, 101):
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
acc_list.append(accuracy_score(y_test, pred))
acc_list[:10]
Out[20]:
[0.7061050638902036, 0.7496450544249882, 0.7269285376242309, 0.754850922858495, 0.7406530998580217, 0.7614765735920492, 0.7534311405584477, 0.7600567912920019, 0.7628963558920966, 0.7652626597255088]
In [21]:
plt.figure(figsize=(20, 10))
sns.lineplot(x = range(1, 101), y = acc_list, marker = 'o', markersize=5, markerfacecolor='red')
Out[21]:
<AxesSubplot:>
In [22]:
# 가장 높은 정확도와 K값의 인덱스 추출
print(max(acc_list))
print(acc_list.index(max(acc_list)))
0.7889256980596309 35
In [23]:
# range(1, 101)로 시작한 리스트의 35번 인덱스이니 k=36
knn = KNeighborsClassifier(n_neighbors = 36)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
print(accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
0.7889256980596309 [[1372 191] [ 255 295]] precision recall f1-score support 0.0 0.84 0.88 0.86 1563 1.0 0.61 0.54 0.57 550 accuracy 0.79 2113 macro avg 0.73 0.71 0.71 2113 weighted avg 0.78 0.79 0.78 2113
'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글
E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0) | 2021.01.18 |
---|---|
E-Commerce Part Ⅶ: 시계열 분석 응용 (0) | 2021.01.18 |
E-Commerce Part Ⅵ: K Means Clustering 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0) | 2021.01.12 |