ch01_Linear Regression

선형회귀분석, Linear Regression¶

ⅰ. 모듈 불러오기¶

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

ⅱ. 데이터 특성 확인하기¶

데이터의 Missing Value / Outlier 의 여부를 확인하고, 데이터 분석에 사용할 변수를 파악한다.

In [2]:

df = pd.read_csv("Data/ecommerce.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

In [3]:

df.head()

Out[3]:

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092

In [4]:

df.describe()

Out[4]:

	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
count	500.000000	500.000000	500.000000	500.000000	500.000000
mean	33.053194	12.052488	37.060445	3.533462	499.314038
std	0.992563	0.994216	1.010489	0.999278	79.314782
min	29.532429	8.508152	33.913847	0.269901	256.670582
25%	32.341822	11.388153	36.349257	2.930450	445.038277
50%	33.082008	11.983231	37.069367	3.533975	498.887875
75%	33.711985	12.753850	37.716432	4.126502	549.313828
max	36.139662	15.126994	40.005182	6.922689	765.518462

describe 함수를 통해 변수들의 Scale과 Outlier에 대한 정보를 대략적으로 파악한다

ⅲ. 불필요한 변수 제거¶

'멤버십 유지기간'과 '서비스 APP에 머문 시간'으로 '고객의 연간 지출액'을 예측하는 선형회귀분석 모델을 만들 예정이다.
고객의 '이메일', '주소' 등의 정보는 현재 분석 과정에서 크게 의미있는 변수가 아니다.
상황에 따라 모델에 불필요하다 생각되는 변수를 미리 제거한다.

In [5]:

df.drop(['Email','Address','Avatar'], axis =1, inplace =True)

ⅳ. Train/Test Set 나누기¶

보통 모델을 만들때 데이터를 Train set과 Test set으로 나눈다.
Train set에 있는 데이터로 모델을 구축하고, Test set으로 해당 모델을 평가하기 위함이다.
이번 예제에서는 데이터의 양이 많은 편이 아니라 8:2 비율로 나눈다.

In [6]:

from sklearn.model_selection import train_test_split

X = df.drop('Yearly Amount Spent', axis=1)
y = df['Yearly Amount Spent']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)

ⅴ. 선형회귀분석 모델 만들기¶

In [7]:

import statsmodels.api as sm

lm = sm.OLS(y_train, X_train).fit()
lm.summary()

Out[7]:

OLS Regression Results
Dep. Variable:	Yearly Amount Spent	R-squared (uncentered):	0.998
Model:	OLS	Adj. R-squared (uncentered):	0.998
Method:	Least Squares	F-statistic:	4.997e+04
Date:	Tue, 12 Jan 2021	Prob (F-statistic):	0.00
Time:	17:34:07	Log-Likelihood:	-1813.3
No. Observations:	400	AIC:	3635.
Df Residuals:	396	BIC:	3650.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Avg. Session Length	11.7641	0.829	14.183	0.000	10.133	13.395
Time on App	34.8647	1.103	31.615	0.000	32.697	37.033
Time on Website	-14.1410	0.761	-18.586	0.000	-15.637	-12.645
Length of Membership	60.7673	1.132	53.673	0.000	58.542	62.993

Omnibus:	2.567	Durbin-Watson:	1.835
Prob(Omnibus):	0.277	Jarque-Bera (JB):	2.304
Skew:	-0.100	Prob(JB):	0.316
Kurtosis:	2.686	Cond. No.	54.4

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS 함수를 통해 손쉽게 선형회귀분석 모델을 만들었다.
summary에서는 간략히 Adj.R squared, P-value를 확인한다
- Adj.R squared는 0 부터 1 사이의 값을 가지며 1에 가까울 수록 설득력이 높은 모델이다.
- P-value(P > ㅣtㅣ)는 '독립변수와 종속변수 사이의 관계가 우연일 확률' 정도로 이해해도 되며 0.05이하가 되어야 설득력이 높은 변수이다.

ⅵ. 예측 및 모델 평가¶

pred는 우리가 X_train과 y_train 데이터를 학습시켜 만든 모델(lm)에 X_test 데이터 입력하여 예측한 y의 값들이다.
pred의 값들을 실제 X_test의 y값인 y_test와 비교해본다.

In [8]:

pred = lm.predict(X_test)

pd.DataFrame([pred, y_test])

Out[8]:

	67	416	350	358	112	329	299	64	27	373	...	133	160	486	198	194	214	181	386	407	435
Unnamed 0	493.201077	496.822932	538.968439	425.884702	437.004955	448.855223	343.146666	541.239874	480.738508	453.730759	...	560.151034	480.570849	581.841993	545.650834	410.577938	373.508390	507.975587	496.764476	419.372150	577.312430
Yearly Amount Spent	469.310861	511.038786	535.480775	382.416108	424.762636	445.062186	282.471246	540.263400	486.838935	430.588883	...	542.711558	468.913501	576.477607	560.560161	434.021700	357.863719	557.529274	550.813368	409.094526	571.216005

2 rows × 100 columns

In [9]:

plt.figure(figsize=(10, 10))
sns.scatterplot(x = y_test, y = pred)

Out[9]:

<AxesSubplot:xlabel='Yearly Amount Spent'>

산점도의 점들이 y=x 그래프에 가까울 수록 예측을 잘 하는 모델이라 할 수 있다.

In [10]:

from sklearn import metrics

print('MSE:', metrics.mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

MSE: 546.0009495214866
RMSE: 23.366663208971165

모델간의 예측 성능 비교를 할때 평균제곱오차, 평균제곱근오차 등을 활용할 수도 있다.

ⅶ. 간단한 활용¶

만들어진 모델에 다양한 가상의 값을 대입하여 예측값을 도출해낼 수 있다.

Ex 1) 멤버십 1년 차 유저 중, Web과 App 두 가지 서비스 중에서 어느 서비스를 오래 쓰는 사람들의 연간지출금액 얼마나 더 클까?

비교를 위해 각 변수(App과 Web의 사용시간)의 1분위, 3분위 값을 번갈아 대입

In [11]:

user_a = [33.05, 12.75, 36.34, 1] # App을 상위 25% 정도의 시간 만큼 사용하고, Web을 상위 75% 정도의 시간만큼 사용
user_b = [33.05, 11.38, 37.71, 1] # Web을 상위 25% 정도의 시간 만큼 사용하고, App을 상위 75% 정도의 시간만큼 사용

print(lm.predict(user_a))
print(lm.predict(user_b))

[380.21370215]
[313.07600146]

결과를 보면 Web보다 App을 더 오래 사용하는 사람들이 연간지출금액이 67만큼 더 크다는 것을 확인 할 수 있다.

Ex 2) Web 보다 App을 더 많이 이용하는 사용자들 사이에서 멤버십 기간이 1년 차와 2년 차의 연간지출 금액의 차이는 얼마나 날까?

In [12]:

user_c = [33.05, 12.75, 36.34, 1] 
user_d = [33.05, 12.75, 36.34, 2]
user_e = [33.05, 12.75, 36.34, 3]

print(lm.predict(user_c))
print(lm.predict(user_d))
print(lm.predict(user_e))

[380.21370215]
[440.98104002]
[501.74837789]

저작자표시 비영리 변경금지 (새창열림)

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

E-Commerce Part Ⅷ: 자연어분석(NLP) 응용 (0)	2021.01.18
E-Commerce Part Ⅶ: 시계열 분석 응용 (0)	2021.01.18
E-Commerce Part Ⅵ: K Means Clustering 응용 (0)	2021.01.16
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅲ: KNN 모델 응용 (0)	2021.01.15
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0)	2021.01.15

CheeseChaser

E-Commerce Part Ⅰ: 선형회귀분석 응용

선형회귀분석, Linear Regression¶

ⅰ. 모듈 불러오기¶

ⅱ. 데이터 특성 확인하기¶

ⅲ. 불필요한 변수 제거¶

ⅳ. Train/Test Set 나누기¶

ⅴ. 선형회귀분석 모델 만들기¶

ⅵ. 예측 및 모델 평가¶

ⅶ. 간단한 활용¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

티스토리툴바

E-Commerce Part Ⅰ: 선형회귀분석 응용

선형회귀분석, Linear Regression¶

ⅰ. 모듈 불러오기¶

ⅱ. 데이터 특성 확인하기¶

ⅲ. 불필요한 변수 제거¶

ⅳ. Train/Test Set 나누기¶

ⅴ. 선형회귀분석 모델 만들기¶

ⅵ. 예측 및 모델 평가¶

ⅶ. 간단한 활용¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

'데이터 분석/Proj. E-Commerce' Related Articles

티스토리툴바