Natural Language Processing (NLP)¶
ⅰ. Importing Modules & Data Skimming¶
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
In [2]:
df = pd.read_csv("Data/yelp.csv", index_col=0)
df.head()
Out[2]:
review_id | user_id | business_id | stars | date | text | useful | funny | cool | |
---|---|---|---|---|---|---|---|---|---|
2967245 | aMleVK0lQcOSNCs56_gSbg | miHaLnLanDKfZqZHet0uWw | Xp_cWXY5rxDLkX-wqUg-iQ | 5 | 2015-09-30 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 |
4773684 | Hs1f--t9JnVKW9A1U2uhKA | r_RUQSGZcd5bSgmTcS5IfQ | NuGZD3yBVqzpY1HuzT26mQ | 5 | 2015-06-04 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 |
1139855 | i7aiPgNrNaFoM8J_j2OSyQ | zz7lojg6QdZbKFCJiHsj7w | ii8sAGBexBOJoYRFafF9XQ | 1 | 2016-07-03 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 |
3997153 | uft6iMwNQh4I2UDpmbXggA | p_oXN3L9oi8nmmJigf8c9Q | r0j4IpUbcdC1-HfoMYae4w | 5 | 2016-10-15 | Love this place - super amazing - staff here i... | 0 | 0 | 0 |
4262000 | y9QmJ16mrfBZS6Td6Yqo0g | jovtGPaHAqP6XfG9BFwY7A | j6UwIfXrSkGTdVkRu7K6WA | 5 | 2017-03-14 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 |
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 2967245 to 838267 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review_id 10000 non-null object 1 user_id 10000 non-null object 2 business_id 10000 non-null object 3 stars 10000 non-null int64 4 date 10000 non-null object 5 text 10000 non-null object 6 useful 10000 non-null int64 7 funny 10000 non-null int64 8 cool 10000 non-null int64 dtypes: int64(4), object(5) memory usage: 781.2+ KB
In [4]:
df.describe()
Out[4]:
stars | useful | funny | cool | |
---|---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 4.012800 | 1.498800 | 0.464200 | 0.542500 |
std | 1.724684 | 6.339355 | 1.926523 | 2.010273 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 5.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 5.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 5.000000 | 2.000000 | 0.000000 | 0.000000 |
max | 5.000000 | 533.000000 | 83.000000 | 97.000000 |
In [5]:
# 불필요한 필드 제거
df.drop(['review_id', 'user_id', 'business_id', 'date'], axis=1, inplace=True)
df.head()
Out[5]:
stars | text | useful | funny | cool | |
---|---|---|---|---|---|
2967245 | 5 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 |
4773684 | 5 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 |
1139855 | 1 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 |
3997153 | 5 | Love this place - super amazing - staff here i... | 0 | 0 | 0 |
4262000 | 5 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 |
In [6]:
df['text_len'] = df['text'].apply(lambda x : len(x))
df.head()
Out[6]:
stars | text | useful | funny | cool | text_len | |
---|---|---|---|---|---|---|
2967245 | 5 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 | 347 |
4773684 | 5 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 | 377 |
1139855 | 1 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 | 663 |
3997153 | 5 | Love this place - super amazing - staff here i... | 0 | 0 | 0 | 141 |
4262000 | 5 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 | 455 |
In [7]:
# 별점의 수 파악
print(df['stars'].value_counts())
sns.countplot(df['stars'])
5 7532 1 2468 Name: stars, dtype: int64
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Out[7]:
<AxesSubplot:xlabel='stars', ylabel='count'>
In [8]:
# 리뷰 텍스트 길이의 분포 확인
plt.figure(figsize=(20, 10))
sns.displot(df['text_len'])
Out[8]:
<seaborn.axisgrid.FacetGrid at 0x29348af8460>
<Figure size 1440x720 with 0 Axes>
In [9]:
sns.heatmap(df.corr(), cmap='coolwarm')
Out[9]:
<AxesSubplot:>
ⅱ. Removing Special Characters¶
In [10]:
import string
punc = string.punctuation
punc
Out[10]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [11]:
def remove_punc(x):
text_list = []
for s in x:
if s not in punc:
text_list.append(s)
return ''.join(text_list)
In [12]:
df['text'] = df['text'].apply(lambda x : "".join([t for t in x if t not in punc]))
df.head()
Out[12]:
stars | text | useful | funny | cool | text_len | |
---|---|---|---|---|---|---|
2967245 | 5 | LOVE the cheeses here They are worth the pric... | 0 | 0 | 1 | 347 |
4773684 | 5 | This has become our goto sushi place The sushi... | 0 | 0 | 0 | 377 |
1139855 | 1 | I was very disappointed with the hotel The res... | 2 | 1 | 1 | 663 |
3997153 | 5 | Love this place super amazing staff here is ... | 0 | 0 | 0 | 141 |
4262000 | 5 | Thank you Dana Having dyed my hair black previ... | 0 | 0 | 0 | 455 |
ⅲ. Stopwords Processing¶
In [13]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stopwords.words('english')
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\fulse\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
Out[13]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [15]:
def stop_words(x):
new_t = []
for t in x.lower().split():
if t not in stopwords.words('english'):
new_t.append(t)
return new_t
In [16]:
df['text'] = df['text'].apply(lambda x: ([t for t in x.lower().split() if t not in stopwords.words('english')]))
In [17]:
df.head()
Out[17]:
stars | text | useful | funny | cool | text_len | |
---|---|---|---|---|---|---|
2967245 | 5 | [love, cheeses, worth, price, great, finding, ... | 0 | 0 | 1 | 347 |
4773684 | 5 | [become, goto, sushi, place, sushi, always, fr... | 0 | 0 | 0 | 377 |
1139855 | 1 | [disappointed, hotel, restaurants, good, booke... | 2 | 1 | 1 | 663 |
3997153 | 5 | [love, place, super, amazing, staff, always, f... | 0 | 0 | 0 | 141 |
4262000 | 5 | [thank, dana, dyed, hair, black, previously, k... | 0 | 0 | 0 | 455 |
ⅳ. Visualizing Frequency & Wordcloud¶
In [18]:
word_split = []
for n in range(len(df)):
for i in df.iloc[n]['text']:
word_split.append(i)
len(word_split)
Out[18]:
542773
In [19]:
from nltk.probability import FreqDist
plt.figure(figsize=(20,10))
FreqDist(word_split).plot(50)
Out[19]:
<AxesSubplot:xlabel='Samples', ylabel='Counts'>
In [20]:
from wordcloud import WordCloud
wc = WordCloud().generate(str(df['text']))
plt.figure(figsize=(10, 5))
plt.imshow(wc)
plt.axis('off')
Out[20]:
(-0.5, 399.5, 199.5, -0.5)
wordcloud의 모듈은 자체적으로 stopwords 기능을 한 번 더 수행하기 때문에 빈도수 그래프와 결과가 일치 하지 않는다
별점 별로 워드클라우드 만들기¶
In [21]:
pos_review = df[df['stars']==5]['text']
neg_review = df[df['stars']==1]['text']
In [22]:
wc_pos = WordCloud().generate(str(pos_review))
plt.figure(figsize=(10, 5))
plt.imshow(wc_pos)
plt.axis('off')
Out[22]:
(-0.5, 399.5, 199.5, -0.5)
In [23]:
wc_neg = WordCloud().generate(str(neg_review))
plt.figure(figsize=(10, 5))
plt.imshow(wc_neg)
plt.axis('off')
Out[23]:
(-0.5, 399.5, 199.5, -0.5)
Naive Bayes Model¶
- 각 변수(쪼갠 텍스트)가 독립적이라는 가정
- n(관측치) < p(Col의 수) 인 상황에서 유용하게 활용
- 스팸 메일 필터링, 감정 분석에 적합
In [24]:
df_nb = pd.read_csv("Data/yelp.csv", index_col=0)
df_nb.head()
Out[24]:
review_id | user_id | business_id | stars | date | text | useful | funny | cool | |
---|---|---|---|---|---|---|---|---|---|
2967245 | aMleVK0lQcOSNCs56_gSbg | miHaLnLanDKfZqZHet0uWw | Xp_cWXY5rxDLkX-wqUg-iQ | 5 | 2015-09-30 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 |
4773684 | Hs1f--t9JnVKW9A1U2uhKA | r_RUQSGZcd5bSgmTcS5IfQ | NuGZD3yBVqzpY1HuzT26mQ | 5 | 2015-06-04 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 |
1139855 | i7aiPgNrNaFoM8J_j2OSyQ | zz7lojg6QdZbKFCJiHsj7w | ii8sAGBexBOJoYRFafF9XQ | 1 | 2016-07-03 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 |
3997153 | uft6iMwNQh4I2UDpmbXggA | p_oXN3L9oi8nmmJigf8c9Q | r0j4IpUbcdC1-HfoMYae4w | 5 | 2016-10-15 | Love this place - super amazing - staff here i... | 0 | 0 | 0 |
4262000 | y9QmJ16mrfBZS6Td6Yqo0g | jovtGPaHAqP6XfG9BFwY7A | j6UwIfXrSkGTdVkRu7K6WA | 5 | 2017-03-14 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 |
In [25]:
# 독립변수와 종속변수 구분
X = df_nb['text']
y = df_nb['stars']
- Counter Vectorizer: 단어별 빈도를 계산하여 데이터 프레임으로 정리
In [26]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X)
In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1234)
In [28]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)
pred = model.predict(X_test)
In [29]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
0.9195 [[ 444 58] [ 103 1395]] precision recall f1-score support 1 0.81 0.88 0.85 502 5 0.96 0.93 0.95 1498 accuracy 0.92 2000 macro avg 0.89 0.91 0.90 2000 weighted avg 0.92 0.92 0.92 2000
Random Forest Model¶
In [30]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=10, n_estimators = 1000)
rf.fit(X_train, y_train)
pred2 = rf.predict(X_test)
In [31]:
print(accuracy_score(y_test, pred2))
print(confusion_matrix(y_test, pred2))
print(classification_report(y_test, pred2))
0.7805 [[ 68 434] [ 5 1493]] precision recall f1-score support 1 0.93 0.14 0.24 502 5 0.77 1.00 0.87 1498 accuracy 0.78 2000 macro avg 0.85 0.57 0.55 2000 weighted avg 0.81 0.78 0.71 2000
'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글
E-Commerce Part Ⅶ: 시계열 분석 응용 (0) | 2021.01.18 |
---|---|
E-Commerce Part Ⅵ: K Means Clustering 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0) | 2021.01.16 |
E-Commerce Part Ⅲ: KNN 모델 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0) | 2021.01.15 |
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0) | 2021.01.12 |