ch04_Decision Tree

Decision Tree¶

ⅰ. 모듈 불러오기 & DATA 특성 확인¶

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

df = pd.read_csv("Data/galaxy.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1485 entries, 0 to 1484
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   BuyItNow              1485 non-null   int64  
 1   startprice            1485 non-null   float64
 2   carrier               1179 non-null   object 
 3   color                 892 non-null    object 
 4   productline           1485 non-null   object 
 5   noDescription         1485 non-null   object 
 6   charCountDescription  1485 non-null   int64  
 7   upperCaseDescription  1485 non-null   int64  
 8   sold                  1485 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 104.5+ KB

In [3]:

df.head()

Out[3]:

	BuyItNow	startprice	carrier	color	productline	noDescription	charCountDescription	upperCaseDescription	sold
0	0	199.99	None	White	Galaxy_S9	contains description	0	0	1
1	0	235.00	None	NaN	Galaxy_Note9	contains description	0	0	0
2	0	199.99	NaN	NaN	Unknown	no description	100	2	0
3	1	175.00	AT&T	Space Gray	Galaxy_Note9	contains description	0	0	1
4	1	100.00	None	Space Gray	Galaxy_S8	contains description	0	0	1

In [4]:

df.describe()

Out[4]:

	BuyItNow	startprice	charCountDescription	upperCaseDescription	sold
count	1485.000000	1485.000000	1485.000000	1485.000000	1485.000000
mean	0.449158	216.844162	31.184512	2.863300	0.461953
std	0.497576	172.893308	41.744518	9.418585	0.498718
min	0.000000	0.010000	0.000000	0.000000	0.000000
25%	0.000000	80.000000	0.000000	0.000000	0.000000
50%	0.000000	198.000000	0.000000	0.000000	0.000000
75%	1.000000	310.000000	79.000000	2.000000	1.000000
max	1.000000	999.000000	111.000000	81.000000	1.000000

In [5]:

sns.displot(df['startprice'])
sns.displot(df['charCountDescription'])

Out[5]:

<seaborn.axisgrid.FacetGrid at 0x2254f2a7460>

In [6]:

plt.figure(figsize=(20, 10))
sns.boxplot(x='productline', y='startprice', data=df)

Out[6]:

<AxesSubplot:xlabel='productline', ylabel='startprice'>

ⅱ. 결측/이상치 처리¶

In [7]:

df.isna().sum() / len(df)

Out[7]:

BuyItNow                0.000000
startprice              0.000000
carrier                 0.206061
color                   0.399327
productline             0.000000
noDescription           0.000000
charCountDescription    0.000000
upperCaseDescription    0.000000
sold                    0.000000
dtype: float64

In [8]:

df.head(10)

Out[8]:

	BuyItNow	startprice	carrier	color	productline	noDescription	charCountDescription	upperCaseDescription	sold
0	0	199.99	None	White	Galaxy_S9	contains description	0	0	1
1	0	235.00	None	NaN	Galaxy_Note9	contains description	0	0	0
2	0	199.99	NaN	NaN	Unknown	no description	100	2	0
3	1	175.00	AT&T	Space Gray	Galaxy_Note9	contains description	0	0	1
4	1	100.00	None	Space Gray	Galaxy_S8	contains description	0	0	1
5	1	0.99	NaN	White	Galaxy_S7	contains description	0	0	1
6	1	150.00	None	White	Galaxy_S9	contains description	0	0	1
7	0	199.99	None	Midnight Black	Galaxy_S9	no description	92	0	1
8	0	99.99	None	White	Galaxy_S7	contains description	0	0	0
9	1	20.00	AT&T	Midnight Black	Galaxy_S7	no description	96	41	1

In [9]:

# 결측치 전체를 Unknown으로 처리
df.fillna("Unknown", inplace=True)
df.head(10)

Out[9]:

	BuyItNow	startprice	carrier	color	productline	noDescription	charCountDescription	upperCaseDescription	sold
0	0	199.99	None	White	Galaxy_S9	contains description	0	0	1
1	0	235.00	None	Unknown	Galaxy_Note9	contains description	0	0	0
2	0	199.99	Unknown	Unknown	Unknown	no description	100	2	0
3	1	175.00	AT&T	Space Gray	Galaxy_Note9	contains description	0	0	1
4	1	100.00	None	Space Gray	Galaxy_S8	contains description	0	0	1
5	1	0.99	Unknown	White	Galaxy_S7	contains description	0	0	1
6	1	150.00	None	White	Galaxy_S9	contains description	0	0	1
7	0	199.99	None	Midnight Black	Galaxy_S9	no description	92	0	1
8	0	99.99	None	White	Galaxy_S7	contains description	0	0	0
9	1	20.00	AT&T	Midnight Black	Galaxy_S7	no description	96	41	1

ⅲ. One-hot Encoding¶

In [10]:

df[['carrier', 'color', 'productline', 'noDescription']].nunique()

Out[10]:

carrier          5
color            8
productline      8
noDescription    2
dtype: int64

In [11]:

print(df['carrier'].value_counts())
print()
print(df['color'].value_counts())
print()
print(df['productline'].value_counts())
print()
print(df['noDescription'].value_counts())

None               863
Unknown            306
AT&T               177
Verizon             87
Sprint/T-Mobile     52
Name: carrier, dtype: int64

Unknown           593
White             328
Midnight Black    274
Space Gray        180
Gold               52
Black              38
Aura Black         19
Prism Black         1
Name: color, dtype: int64

Galaxy_Note10    351
Galaxy_S8        277
Galaxy_S7        227
Unknown          204
Galaxy_S9        158
Galaxy_Note8     153
Galaxy_Note9     107
Galaxy_S10         8
Name: productline, dtype: int64

contains description    856
no description          629
Name: noDescription, dtype: int64

In [12]:

df['color'] = df['color'].apply(lambda x : "Black" if x[-5:]=="Black" in x else x )
df['color'].value_counts()

Out[12]:

Unknown       593
Black         332
White         328
Space Gray    180
Gold           52
Name: color, dtype: int64

In [13]:

df = pd.get_dummies(df, columns = ['carrier', 'color', 'productline', 'noDescription'])

ⅳ. Decision Tree Model¶

In [14]:

from sklearn.model_selection import train_test_split

X = df.drop('sold', axis=1)
y = df['sold']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [15]:

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth = 10)
model.fit(X_train, y_train)

pred = model.predict(X_test)

pd.DataFrame([pred, y_test])

Out[15]:

	0	1	2	3	4	5	6	7	8	9	...	287	288	289	290	291	292	293	294	295	296
0	1	1	0	0	1	1	0	0	1	0	...	1	1	0	0	1	0	1	0	1	0
1	1	0	1	0	1	1	0	0	1	0	...	0	1	1	0	1	1	1	1	1	0

2 rows × 297 columns

In [16]:

from sklearn.metrics import accuracy_score, confusion_matrix

accuracy_score(y_test, pred)

Out[16]:

0.7373737373737373

Parameter tuning¶

In [17]:

depth_list =[]

for n in range(2, 31):
    model = DecisionTreeClassifier(max_depth = n)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    
    depth_list.append(accuracy_score(y_test, pred))
    
depth_list.index(max(depth_list))
# 가장 정확도가 높은 depth는 4

Out[17]:

In [18]:

model = DecisionTreeClassifier(max_depth = 4)
model.fit(X_train, y_train)
pred = model.predict(X_test)

accuracy_score(y_test, pred)

Out[18]:

0.7744107744107744

In [19]:

confusion_matrix(y_test, pred)

Out[19]:

array([[127,  19],
       [ 48, 103]], dtype=int64)

ⅴ. Visualization¶

In [20]:

from sklearn.tree import plot_tree

plt.figure(figsize=(40,20))
plot_tree(model, feature_names = X_train.columns, fontsize=15, label = None, max_depth=4)

Out[20]:

[Text(1116.0, 978.48, 'BuyItNow <= 0.5\n0.495\n1188\n[653, 535]'),
 Text(558.0, 761.0400000000001, 'startprice <= 91.5\n0.33\n661\n[523, 138]'),
 Text(279.0, 543.6, 'productline_Galaxy_S7 <= 0.5\n0.481\n77\n[46, 31]'),
 Text(139.5, 326.1600000000001, 'charCountDescription <= 68.5\n0.393\n41\n[30, 11]'),
 Text(69.75, 108.72000000000003, '0.464\n30\n[19, 11]'),
 Text(209.25, 108.72000000000003, '0.0\n11\n[11, 0]'),
 Text(418.5, 326.1600000000001, 'startprice <= 54.995\n0.494\n36\n[16, 20]'),
 Text(348.75, 108.72000000000003, '0.219\n8\n[1, 7]'),
 Text(488.25, 108.72000000000003, '0.497\n28\n[15, 13]'),
 Text(837.0, 543.6, 'upperCaseDescription <= 6.5\n0.299\n584\n[477, 107]'),
 Text(697.5, 326.1600000000001, 'startprice <= 504.5\n0.318\n529\n[424, 105]'),
 Text(627.75, 108.72000000000003, '0.34\n456\n[357, 99]'),
 Text(767.25, 108.72000000000003, '0.151\n73\n[67, 6]'),
 Text(976.5, 326.1600000000001, 'charCountDescription <= 27.5\n0.07\n55\n[53, 2]'),
 Text(906.75, 108.72000000000003, '0.5\n2\n[1, 1]'),
 Text(1046.25, 108.72000000000003, '0.037\n53\n[52, 1]'),
 Text(1674.0, 761.0400000000001, 'startprice <= 111.0\n0.372\n527\n[130, 397]'),
 Text(1395.0, 543.6, 'startprice <= 63.5\n0.171\n308\n[29, 279]'),
 Text(1255.5, 326.1600000000001, 'startprice <= 3.0\n0.097\n215\n[11, 204]'),
 Text(1185.75, 108.72000000000003, '0.019\n107\n[1, 106]'),
 Text(1325.25, 108.72000000000003, '0.168\n108\n[10, 98]'),
 Text(1534.5, 326.1600000000001, 'productline_Galaxy_S7 <= 0.5\n0.312\n93\n[18, 75]'),
 Text(1464.75, 108.72000000000003, '0.202\n70\n[8, 62]'),
 Text(1604.25, 108.72000000000003, '0.491\n23\n[10, 13]'),
 Text(1953.0, 543.6, 'startprice <= 205.995\n0.497\n219\n[101, 118]'),
 Text(1813.5, 326.1600000000001, 'productline_Galaxy_S7 <= 0.5\n0.449\n103\n[35, 68]'),
 Text(1743.75, 108.72000000000003, '0.427\n97\n[30, 67]'),
 Text(1883.25, 108.72000000000003, '0.278\n6\n[5, 1]'),
 Text(2092.5, 326.1600000000001, 'productline_Galaxy_Note10 <= 0.5\n0.49\n116\n[66, 50]'),
 Text(2022.75, 108.72000000000003, '0.361\n55\n[42, 13]'),
 Text(2162.25, 108.72000000000003, '0.477\n61\n[24, 37]')]

저작자표시 비영리 변경금지 (새창열림)

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0)	2021.01.18
E-Commerce Part Ⅶ: 시계열 분석 응용 (0)	2021.01.18
E-Commerce Part Ⅵ: K Means Clustering 응용 (0)	2021.01.16
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅲ: KNN 모델 응용 (0)	2021.01.15
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0)	2021.01.15
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0)	2021.01.12

CheeseChaser

E-Commerce Part Ⅳ: Decision Tree 모델 응용

Decision Tree¶

ⅰ. 모듈 불러오기 & DATA 특성 확인¶

ⅱ. 결측/이상치 처리¶

ⅲ. One-hot Encoding¶

ⅳ. Decision Tree Model¶

Parameter tuning¶

ⅴ. Visualization¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

티스토리툴바

E-Commerce Part Ⅳ: Decision Tree 모델 응용

Decision Tree¶

ⅰ. 모듈 불러오기 & DATA 특성 확인¶

ⅱ. 결측/이상치 처리¶

ⅲ. One-hot Encoding¶

ⅳ. Decision Tree Model¶

Parameter tuning¶

ⅴ. Visualization¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

'데이터 분석/Proj. E-Commerce' Related Articles

티스토리툴바