로지스틱 회귀분석, Logistic Regression¶
ⅰ. 모듈 불러오기 & DATA 특성 확인¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv("Data/advertising.csv")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Daily Time Spent on Site 1000 non-null float64 1 Age 916 non-null float64 2 Area Income 1000 non-null float64 3 Daily Internet Usage 1000 non-null float64 4 Ad Topic Line 1000 non-null object 5 City 1000 non-null object 6 Male 1000 non-null int64 7 Country 1000 non-null object 8 Timestamp 1000 non-null object 9 Clicked on Ad 1000 non-null int64 dtypes: float64(4), int64(2), object(4) memory usage: 78.2+ KB
In [3]:
df.head()
Out[3]:
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | NaN | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 3/27/2016 0:53 | 0 |
1 | 80.23 | 31.0 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 4/4/2016 1:39 | 0 |
2 | 69.47 | 26.0 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 3/13/2016 20:35 | 0 |
3 | 74.15 | 29.0 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 1/10/2016 2:31 | 0 |
4 | 68.37 | 35.0 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 6/3/2016 3:36 | 0 |
In [4]:
df.describe()
Out[4]:
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | Clicked on Ad | |
---|---|---|---|---|---|---|
count | 1000.000000 | 916.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 |
mean | 65.000200 | 36.128821 | 55000.000080 | 180.000100 | 0.481000 | 0.50000 |
std | 15.853615 | 9.018548 | 13414.634022 | 43.902339 | 0.499889 | 0.50025 |
min | 32.600000 | 19.000000 | 13996.500000 | 104.780000 | 0.000000 | 0.00000 |
25% | 51.360000 | 29.000000 | 47031.802500 | 138.830000 | 0.000000 | 0.00000 |
50% | 68.215000 | 35.000000 | 57012.300000 | 183.130000 | 0.000000 | 0.50000 |
75% | 78.547500 | 42.000000 | 65470.635000 | 218.792500 | 1.000000 | 1.00000 |
max | 91.430000 | 61.000000 | 79484.800000 | 269.960000 | 1.000000 | 1.00000 |
- 주요 변수의 왜도, 첨도 확인하기
In [5]:
sns.displot(df['Area Income'])
sns.displot(df['Age'])
Out[5]:
<seaborn.axisgrid.FacetGrid at 0x1788e634940>
In [6]:
# to skim duplicted data
print(df['Country'].nunique())
print(df['City'].nunique())
print(df['Ad Topic Line'].nunique())
237 969 1000
ⅱ. 결측치 처리¶
In [7]:
df.isna().sum() / len(df)
Out[7]:
Daily Time Spent on Site 0.000 Age 0.084 Area Income 0.000 Daily Internet Usage 0.000 Ad Topic Line 0.000 City 0.000 Male 0.000 Country 0.000 Timestamp 0.000 Clicked on Ad 0.000 dtype: float64
In [8]:
print(df['Age'].mean())
print(round(df['Age'].mean()))
print(df['Age'].median())
36.12882096069869 36 35.0
In [9]:
df = df.fillna(round(df['Age'].mean()))
df.isna().sum()
Out[9]:
Daily Time Spent on Site 0 Age 0 Area Income 0 Daily Internet Usage 0 Ad Topic Line 0 City 0 Male 0 Country 0 Timestamp 0 Clicked on Ad 0 dtype: int64
ⅲ. Logistic Regression Model¶
In [10]:
from sklearn.model_selection import train_test_split
# 모델에 사용할 변수들
X = df[['Daily Time Spent on Site','Age', 'Area Income', 'Daily Internet Usage', 'Male']]
y = df['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
In [11]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
Out[11]:
LogisticRegression()
In [12]:
model.coef_
Out[12]:
array([[-6.54858301e-02, 2.58572233e-01, -1.32794534e-05, -2.34073196e-02, 1.62755623e-03]])
ⅳ. 모델 평가하기¶
In [13]:
from sklearn.metrics import accuracy_score, confusion_matrix
pred = model.predict(X_test)
print('Model의 예측률: {} \n'.format(accuracy_score(y_test, pred)))
print(confusion_matrix(y_test, pred))
Model의 예측률: 0.92 [[89 3] [13 95]]
- 예측 정확도가 90% 가까이 되는 괜찮은 예측 모델이 만들어졌다
ⅴ. 만들어진 모델에 간단한 예측 적용¶
Q1. 20대 여성 고객들의 하루 사이트 이용시간에 따른 광고 클릭 예측 (연령과 사이트 이용 시간을 제외한 다른 변수들에는 평균값 대입)
In [14]:
pred_list_by_time_age = []
for time in range(35, 80):
pred_list_by_time = []
for age in range(20, 30):
val_list = [[time, age, 55000.00, 104.78, 0]]
pred = model.predict(val_list)
pred_list_by_time.append(int(pred))
pred_list_by_time_age.append(pred_list_by_time)
print(np.array(pred_list_by_time_age).shape)
pred_list_by_time_age
(45, 10)
Out[14]:
[[0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글
E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0) | 2021.01.18 |
---|---|
E-Commerce Part Ⅶ: 시계열 분석 응용 (0) | 2021.01.18 |
E-Commerce Part Ⅵ: K Means Clustering 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅲ: KNN 모델 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0) | 2021.01.12 |