Decision Tree¶
ⅰ. 모듈 불러오기 & DATA 특성 확인¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv("Data/galaxy.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1485 entries, 0 to 1484 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BuyItNow 1485 non-null int64 1 startprice 1485 non-null float64 2 carrier 1179 non-null object 3 color 892 non-null object 4 productline 1485 non-null object 5 noDescription 1485 non-null object 6 charCountDescription 1485 non-null int64 7 upperCaseDescription 1485 non-null int64 8 sold 1485 non-null int64 dtypes: float64(1), int64(4), object(4) memory usage: 104.5+ KB
In [3]:
df.head()
Out[3]:
BuyItNow | startprice | carrier | color | productline | noDescription | charCountDescription | upperCaseDescription | sold | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 199.99 | None | White | Galaxy_S9 | contains description | 0 | 0 | 1 |
1 | 0 | 235.00 | None | NaN | Galaxy_Note9 | contains description | 0 | 0 | 0 |
2 | 0 | 199.99 | NaN | NaN | Unknown | no description | 100 | 2 | 0 |
3 | 1 | 175.00 | AT&T | Space Gray | Galaxy_Note9 | contains description | 0 | 0 | 1 |
4 | 1 | 100.00 | None | Space Gray | Galaxy_S8 | contains description | 0 | 0 | 1 |
In [4]:
df.describe()
Out[4]:
BuyItNow | startprice | charCountDescription | upperCaseDescription | sold | |
---|---|---|---|---|---|
count | 1485.000000 | 1485.000000 | 1485.000000 | 1485.000000 | 1485.000000 |
mean | 0.449158 | 216.844162 | 31.184512 | 2.863300 | 0.461953 |
std | 0.497576 | 172.893308 | 41.744518 | 9.418585 | 0.498718 |
min | 0.000000 | 0.010000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 80.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 198.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 1.000000 | 310.000000 | 79.000000 | 2.000000 | 1.000000 |
max | 1.000000 | 999.000000 | 111.000000 | 81.000000 | 1.000000 |
In [5]:
sns.displot(df['startprice'])
sns.displot(df['charCountDescription'])
Out[5]:
<seaborn.axisgrid.FacetGrid at 0x2254f2a7460>
In [6]:
plt.figure(figsize=(20, 10))
sns.boxplot(x='productline', y='startprice', data=df)
Out[6]:
<AxesSubplot:xlabel='productline', ylabel='startprice'>
ⅱ. 결측/이상치 처리¶
In [7]:
df.isna().sum() / len(df)
Out[7]:
BuyItNow 0.000000 startprice 0.000000 carrier 0.206061 color 0.399327 productline 0.000000 noDescription 0.000000 charCountDescription 0.000000 upperCaseDescription 0.000000 sold 0.000000 dtype: float64
In [8]:
df.head(10)
Out[8]:
BuyItNow | startprice | carrier | color | productline | noDescription | charCountDescription | upperCaseDescription | sold | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 199.99 | None | White | Galaxy_S9 | contains description | 0 | 0 | 1 |
1 | 0 | 235.00 | None | NaN | Galaxy_Note9 | contains description | 0 | 0 | 0 |
2 | 0 | 199.99 | NaN | NaN | Unknown | no description | 100 | 2 | 0 |
3 | 1 | 175.00 | AT&T | Space Gray | Galaxy_Note9 | contains description | 0 | 0 | 1 |
4 | 1 | 100.00 | None | Space Gray | Galaxy_S8 | contains description | 0 | 0 | 1 |
5 | 1 | 0.99 | NaN | White | Galaxy_S7 | contains description | 0 | 0 | 1 |
6 | 1 | 150.00 | None | White | Galaxy_S9 | contains description | 0 | 0 | 1 |
7 | 0 | 199.99 | None | Midnight Black | Galaxy_S9 | no description | 92 | 0 | 1 |
8 | 0 | 99.99 | None | White | Galaxy_S7 | contains description | 0 | 0 | 0 |
9 | 1 | 20.00 | AT&T | Midnight Black | Galaxy_S7 | no description | 96 | 41 | 1 |
In [9]:
# 결측치 전체를 Unknown으로 처리
df.fillna("Unknown", inplace=True)
df.head(10)
Out[9]:
BuyItNow | startprice | carrier | color | productline | noDescription | charCountDescription | upperCaseDescription | sold | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 199.99 | None | White | Galaxy_S9 | contains description | 0 | 0 | 1 |
1 | 0 | 235.00 | None | Unknown | Galaxy_Note9 | contains description | 0 | 0 | 0 |
2 | 0 | 199.99 | Unknown | Unknown | Unknown | no description | 100 | 2 | 0 |
3 | 1 | 175.00 | AT&T | Space Gray | Galaxy_Note9 | contains description | 0 | 0 | 1 |
4 | 1 | 100.00 | None | Space Gray | Galaxy_S8 | contains description | 0 | 0 | 1 |
5 | 1 | 0.99 | Unknown | White | Galaxy_S7 | contains description | 0 | 0 | 1 |
6 | 1 | 150.00 | None | White | Galaxy_S9 | contains description | 0 | 0 | 1 |
7 | 0 | 199.99 | None | Midnight Black | Galaxy_S9 | no description | 92 | 0 | 1 |
8 | 0 | 99.99 | None | White | Galaxy_S7 | contains description | 0 | 0 | 0 |
9 | 1 | 20.00 | AT&T | Midnight Black | Galaxy_S7 | no description | 96 | 41 | 1 |
ⅲ. One-hot Encoding¶
In [10]:
df[['carrier', 'color', 'productline', 'noDescription']].nunique()
Out[10]:
carrier 5 color 8 productline 8 noDescription 2 dtype: int64
In [11]:
print(df['carrier'].value_counts())
print()
print(df['color'].value_counts())
print()
print(df['productline'].value_counts())
print()
print(df['noDescription'].value_counts())
None 863 Unknown 306 AT&T 177 Verizon 87 Sprint/T-Mobile 52 Name: carrier, dtype: int64 Unknown 593 White 328 Midnight Black 274 Space Gray 180 Gold 52 Black 38 Aura Black 19 Prism Black 1 Name: color, dtype: int64 Galaxy_Note10 351 Galaxy_S8 277 Galaxy_S7 227 Unknown 204 Galaxy_S9 158 Galaxy_Note8 153 Galaxy_Note9 107 Galaxy_S10 8 Name: productline, dtype: int64 contains description 856 no description 629 Name: noDescription, dtype: int64
In [12]:
df['color'] = df['color'].apply(lambda x : "Black" if x[-5:]=="Black" in x else x )
df['color'].value_counts()
Out[12]:
Unknown 593 Black 332 White 328 Space Gray 180 Gold 52 Name: color, dtype: int64
In [13]:
df = pd.get_dummies(df, columns = ['carrier', 'color', 'productline', 'noDescription'])
ⅳ. Decision Tree Model¶
In [14]:
from sklearn.model_selection import train_test_split
X = df.drop('sold', axis=1)
y = df['sold']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
In [15]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth = 10)
model.fit(X_train, y_train)
pred = model.predict(X_test)
pd.DataFrame([pred, y_test])
Out[15]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | ... | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
2 rows × 297 columns
In [16]:
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, pred)
Out[16]:
0.7373737373737373
Parameter tuning¶
In [17]:
depth_list =[]
for n in range(2, 31):
model = DecisionTreeClassifier(max_depth = n)
model.fit(X_train, y_train)
pred = model.predict(X_test)
depth_list.append(accuracy_score(y_test, pred))
depth_list.index(max(depth_list))
# 가장 정확도가 높은 depth는 4
Out[17]:
2
In [18]:
model = DecisionTreeClassifier(max_depth = 4)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)
Out[18]:
0.7744107744107744
In [19]:
confusion_matrix(y_test, pred)
Out[19]:
array([[127, 19], [ 48, 103]], dtype=int64)
ⅴ. Visualization¶
In [20]:
from sklearn.tree import plot_tree
plt.figure(figsize=(40,20))
plot_tree(model, feature_names = X_train.columns, fontsize=15, label = None, max_depth=4)
Out[20]:
[Text(1116.0, 978.48, 'BuyItNow <= 0.5\n0.495\n1188\n[653, 535]'), Text(558.0, 761.0400000000001, 'startprice <= 91.5\n0.33\n661\n[523, 138]'), Text(279.0, 543.6, 'productline_Galaxy_S7 <= 0.5\n0.481\n77\n[46, 31]'), Text(139.5, 326.1600000000001, 'charCountDescription <= 68.5\n0.393\n41\n[30, 11]'), Text(69.75, 108.72000000000003, '0.464\n30\n[19, 11]'), Text(209.25, 108.72000000000003, '0.0\n11\n[11, 0]'), Text(418.5, 326.1600000000001, 'startprice <= 54.995\n0.494\n36\n[16, 20]'), Text(348.75, 108.72000000000003, '0.219\n8\n[1, 7]'), Text(488.25, 108.72000000000003, '0.497\n28\n[15, 13]'), Text(837.0, 543.6, 'upperCaseDescription <= 6.5\n0.299\n584\n[477, 107]'), Text(697.5, 326.1600000000001, 'startprice <= 504.5\n0.318\n529\n[424, 105]'), Text(627.75, 108.72000000000003, '0.34\n456\n[357, 99]'), Text(767.25, 108.72000000000003, '0.151\n73\n[67, 6]'), Text(976.5, 326.1600000000001, 'charCountDescription <= 27.5\n0.07\n55\n[53, 2]'), Text(906.75, 108.72000000000003, '0.5\n2\n[1, 1]'), Text(1046.25, 108.72000000000003, '0.037\n53\n[52, 1]'), Text(1674.0, 761.0400000000001, 'startprice <= 111.0\n0.372\n527\n[130, 397]'), Text(1395.0, 543.6, 'startprice <= 63.5\n0.171\n308\n[29, 279]'), Text(1255.5, 326.1600000000001, 'startprice <= 3.0\n0.097\n215\n[11, 204]'), Text(1185.75, 108.72000000000003, '0.019\n107\n[1, 106]'), Text(1325.25, 108.72000000000003, '0.168\n108\n[10, 98]'), Text(1534.5, 326.1600000000001, 'productline_Galaxy_S7 <= 0.5\n0.312\n93\n[18, 75]'), Text(1464.75, 108.72000000000003, '0.202\n70\n[8, 62]'), Text(1604.25, 108.72000000000003, '0.491\n23\n[10, 13]'), Text(1953.0, 543.6, 'startprice <= 205.995\n0.497\n219\n[101, 118]'), Text(1813.5, 326.1600000000001, 'productline_Galaxy_S7 <= 0.5\n0.449\n103\n[35, 68]'), Text(1743.75, 108.72000000000003, '0.427\n97\n[30, 67]'), Text(1883.25, 108.72000000000003, '0.278\n6\n[5, 1]'), Text(2092.5, 326.1600000000001, 'productline_Galaxy_Note10 <= 0.5\n0.49\n116\n[66, 50]'), Text(2022.75, 108.72000000000003, '0.361\n55\n[42, 13]'), Text(2162.25, 108.72000000000003, '0.477\n61\n[24, 37]')]
'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글
E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0) | 2021.01.18 |
---|---|
E-Commerce Part Ⅶ: 시계열 분석 응용 (0) | 2021.01.18 |
E-Commerce Part Ⅵ: K Means Clustering 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅲ: KNN 모델 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0) | 2021.01.12 |