ch02_Logistic Regression

로지스틱 회귀분석, Logistic Regression¶

ⅰ. 모듈 불러오기 & DATA 특성 확인¶

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

df = pd.read_csv("Data/advertising.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       916 non-null    float64
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(4), int64(2), object(4)
memory usage: 78.2+ KB

In [3]:

df.head()

Out[3]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	NaN	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	3/27/2016 0:53
1	80.23	31.0	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	4/4/2016 1:39
2	69.47	26.0	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	3/13/2016 20:35
3	74.15	29.0	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	1/10/2016 2:31
4	68.37	35.0	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	6/3/2016 3:36

In [4]:

df.describe()

Out[4]:

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.000000	916.000000	1000.000000	1000.000000	1000.000000	1000.00000
mean	65.000200	36.128821	55000.000080	180.000100	0.481000	0.50000
std	15.853615	9.018548	13414.634022	43.902339	0.499889	0.50025
min	32.600000	19.000000	13996.500000	104.780000	0.000000	0.00000
25%	51.360000	29.000000	47031.802500	138.830000	0.000000	0.00000
50%	68.215000	35.000000	57012.300000	183.130000	0.000000	0.50000
75%	78.547500	42.000000	65470.635000	218.792500	1.000000	1.00000
max	91.430000	61.000000	79484.800000	269.960000	1.000000	1.00000

주요 변수의 왜도, 첨도 확인하기

In [5]:

sns.displot(df['Area Income'])
sns.displot(df['Age'])

Out[5]:

<seaborn.axisgrid.FacetGrid at 0x1788e634940>

In [6]:

# to skim duplicted data
print(df['Country'].nunique())
print(df['City'].nunique())
print(df['Ad Topic Line'].nunique())

237
969
1000

ⅱ. 결측치 처리¶

In [7]:

df.isna().sum() / len(df)

Out[7]:

Daily Time Spent on Site    0.000
Age                         0.084
Area Income                 0.000
Daily Internet Usage        0.000
Ad Topic Line               0.000
City                        0.000
Male                        0.000
Country                     0.000
Timestamp                   0.000
Clicked on Ad               0.000
dtype: float64

In [8]:

print(df['Age'].mean())
print(round(df['Age'].mean()))
print(df['Age'].median())

36.12882096069869
36
35.0

In [9]:

df = df.fillna(round(df['Age'].mean()))
df.isna().sum()

Out[9]:

Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64

ⅲ. Logistic Regression Model¶

In [10]:

from sklearn.model_selection import train_test_split

# 모델에 사용할 변수들 
X = df[['Daily Time Spent on Site','Age', 'Area Income', 'Daily Internet Usage', 'Male']]
y = df['Clicked on Ad']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [11]:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

Out[11]:

LogisticRegression()

In [12]:

model.coef_

Out[12]:

array([[-6.54858301e-02,  2.58572233e-01, -1.32794534e-05,
        -2.34073196e-02,  1.62755623e-03]])

ⅳ. 모델 평가하기¶

In [13]:

from sklearn.metrics import accuracy_score, confusion_matrix

pred = model.predict(X_test)

print('Model의 예측률: {} \n'.format(accuracy_score(y_test, pred)))

print(confusion_matrix(y_test, pred))

Model의 예측률: 0.92 

[[89  3]
 [13 95]]

예측 정확도가 90% 가까이 되는 괜찮은 예측 모델이 만들어졌다

ⅴ. 만들어진 모델에 간단한 예측 적용¶

Q1. 20대 여성 고객들의 하루 사이트 이용시간에 따른 광고 클릭 예측 (연령과 사이트 이용 시간을 제외한 다른 변수들에는 평균값 대입)

In [14]:

pred_list_by_time_age = []

for time in range(35, 80):
    
    pred_list_by_time = []
    
    for age in range(20, 30):
        val_list = [[time, age, 55000.00, 104.78, 0]]
        pred = model.predict(val_list)
        pred_list_by_time.append(int(pred))
                
    pred_list_by_time_age.append(pred_list_by_time)
        
print(np.array(pred_list_by_time_age).shape)
pred_list_by_time_age

(45, 10)

Out[14]:

[[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

저작자표시 비영리 변경금지 (새창열림)

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0)	2021.01.18
E-Commerce Part Ⅶ: 시계열 분석 응용 (0)	2021.01.18
E-Commerce Part Ⅵ: K Means Clustering 응용 (0)	2021.01.16
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅲ: KNN 모델 응용 (0)	2021.01.15
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0)	2021.01.12

CheeseChaser

E-Commerce Part Ⅱ: 로지스틱회귀분석 응용

로지스틱 회귀분석, Logistic Regression¶

ⅰ. 모듈 불러오기 & DATA 특성 확인¶

ⅱ. 결측치 처리¶

ⅲ. Logistic Regression Model¶

ⅳ. 모델 평가하기¶

ⅴ. 만들어진 모델에 간단한 예측 적용¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

티스토리툴바

E-Commerce Part Ⅱ: 로지스틱회귀분석 응용

로지스틱 회귀분석, Logistic Regression¶

ⅰ. 모듈 불러오기 & DATA 특성 확인¶

ⅱ. 결측치 처리¶

ⅲ. Logistic Regression Model¶

ⅳ. 모델 평가하기¶

ⅴ. 만들어진 모델에 간단한 예측 적용¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

'데이터 분석/Proj. E-Commerce' Related Articles

티스토리툴바