ch08_Natural Language Processing_Word Cloud

Natural Language Processing (NLP)¶

ⅰ. Importing Modules & Data Skimming¶

In [1]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import nltk

In [2]:

df = pd.read_csv("Data/yelp.csv", index_col=0)
df.head()

Out[2]:

	review_id	user_id	business_id	stars	date	text	useful	funny	cool
2967245	aMleVK0lQcOSNCs56_gSbg	miHaLnLanDKfZqZHet0uWw	Xp_cWXY5rxDLkX-wqUg-iQ	5	2015-09-30	LOVE the cheeses here. They are worth the pri...	0	0	1
4773684	Hs1f--t9JnVKW9A1U2uhKA	r_RUQSGZcd5bSgmTcS5IfQ	NuGZD3yBVqzpY1HuzT26mQ	5	2015-06-04	This has become our go-to sushi place. The sus...	0	0	0
1139855	i7aiPgNrNaFoM8J_j2OSyQ	zz7lojg6QdZbKFCJiHsj7w	ii8sAGBexBOJoYRFafF9XQ	1	2016-07-03	I was very disappointed with the hotel. The re...	2	1	1
3997153	uft6iMwNQh4I2UDpmbXggA	p_oXN3L9oi8nmmJigf8c9Q	r0j4IpUbcdC1-HfoMYae4w	5	2016-10-15	Love this place - super amazing - staff here i...	0	0	0
4262000	y9QmJ16mrfBZS6Td6Yqo0g	jovtGPaHAqP6XfG9BFwY7A	j6UwIfXrSkGTdVkRu7K6WA	5	2017-03-14	Thank you Dana!!!! Having dyed my hair black p...	0	0	0

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 2967245 to 838267
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    10000 non-null  object
 1   user_id      10000 non-null  object
 2   business_id  10000 non-null  object
 3   stars        10000 non-null  int64 
 4   date         10000 non-null  object
 5   text         10000 non-null  object
 6   useful       10000 non-null  int64 
 7   funny        10000 non-null  int64 
 8   cool         10000 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 781.2+ KB

In [4]:

df.describe()

Out[4]:

	stars	useful	funny	cool
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	4.012800	1.498800	0.464200	0.542500
std	1.724684	6.339355	1.926523	2.010273
min	1.000000	0.000000	0.000000	0.000000
25%	5.000000	0.000000	0.000000	0.000000
50%	5.000000	0.000000	0.000000	0.000000
75%	5.000000	2.000000	0.000000	0.000000
max	5.000000	533.000000	83.000000	97.000000

In [5]:

# 불필요한 필드 제거
df.drop(['review_id', 'user_id', 'business_id', 'date'], axis=1, inplace=True)
df.head()

Out[5]:

	stars	text	useful	funny	cool
2967245	5	LOVE the cheeses here. They are worth the pri...	0	0	1
4773684	5	This has become our go-to sushi place. The sus...	0	0	0
1139855	1	I was very disappointed with the hotel. The re...	2	1	1
3997153	5	Love this place - super amazing - staff here i...	0	0	0
4262000	5	Thank you Dana!!!! Having dyed my hair black p...	0	0	0

In [6]:

df['text_len'] = df['text'].apply(lambda x : len(x))
df.head()

Out[6]:

	stars	text	useful	funny	cool	text_len
2967245	5	LOVE the cheeses here. They are worth the pri...	0	0	1	347
4773684	5	This has become our go-to sushi place. The sus...	0	0	0	377
1139855	1	I was very disappointed with the hotel. The re...	2	1	1	663
3997153	5	Love this place - super amazing - staff here i...	0	0	0	141
4262000	5	Thank you Dana!!!! Having dyed my hair black p...	0	0	0	455

In [7]:

# 별점의 수 파악
print(df['stars'].value_counts())
sns.countplot(df['stars'])

5    7532
1    2468
Name: stars, dtype: int64

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

Out[7]:

<AxesSubplot:xlabel='stars', ylabel='count'>

In [8]:

# 리뷰 텍스트 길이의 분포 확인
plt.figure(figsize=(20, 10))
sns.displot(df['text_len'])

Out[8]:

<seaborn.axisgrid.FacetGrid at 0x29348af8460>

<Figure size 1440x720 with 0 Axes>

In [9]:

sns.heatmap(df.corr(), cmap='coolwarm')

Out[9]:

<AxesSubplot:>

ⅱ. Removing Special Characters¶

In [10]:

import string

punc = string.punctuation
punc

Out[10]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [11]:

def remove_punc(x):
    text_list = []
    for s in x:
        if s not in punc:
            text_list.append(s)
    
    return ''.join(text_list)

In [12]:

df['text'] = df['text'].apply(lambda x : "".join([t for t in x if t not in punc]))
df.head()

Out[12]:

	stars	text	useful	funny	cool	text_len
2967245	5	LOVE the cheeses here They are worth the pric...	0	0	1	347
4773684	5	This has become our goto sushi place The sushi...	0	0	0	377
1139855	1	I was very disappointed with the hotel The res...	2	1	1	663
3997153	5	Love this place super amazing staff here is ...	0	0	0	141
4262000	5	Thank you Dana Having dyed my hair black previ...	0	0	0	455

ⅲ. Stopwords Processing¶

In [13]:

from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')

stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fulse\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[13]:

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

In [15]:

def stop_words(x):
    new_t = []
    for t in x.lower().split():
        if t not in stopwords.words('english'):
            new_t.append(t)
    
    return new_t

In [16]:

df['text'] = df['text'].apply(lambda x: ([t for t in x.lower().split() if t not in stopwords.words('english')]))

In [17]:

df.head()

Out[17]:

	stars	text	useful	funny	cool	text_len
2967245	5	[love, cheeses, worth, price, great, finding, ...	0	0	1	347
4773684	5	[become, goto, sushi, place, sushi, always, fr...	0	0	0	377
1139855	1	[disappointed, hotel, restaurants, good, booke...	2	1	1	663
3997153	5	[love, place, super, amazing, staff, always, f...	0	0	0	141
4262000	5	[thank, dana, dyed, hair, black, previously, k...	0	0	0	455

ⅳ. Visualizing Frequency & Wordcloud¶

In [18]:

word_split = []

for n in range(len(df)):
    for i in df.iloc[n]['text']:
        word_split.append(i)
    
len(word_split)

Out[18]:

In [19]:

from nltk.probability import FreqDist

plt.figure(figsize=(20,10))
FreqDist(word_split).plot(50)

Out[19]:

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

In [20]:

from wordcloud import WordCloud

wc = WordCloud().generate(str(df['text']))

plt.figure(figsize=(10, 5))
plt.imshow(wc)
plt.axis('off')

Out[20]:

(-0.5, 399.5, 199.5, -0.5)

wordcloud의 모듈은 자체적으로 stopwords 기능을 한 번 더 수행하기 때문에 빈도수 그래프와 결과가 일치 하지 않는다

별점 별로 워드클라우드 만들기¶

In [21]:

pos_review = df[df['stars']==5]['text']
neg_review = df[df['stars']==1]['text']

In [22]:

wc_pos = WordCloud().generate(str(pos_review))

plt.figure(figsize=(10, 5))
plt.imshow(wc_pos)
plt.axis('off')

Out[22]:

(-0.5, 399.5, 199.5, -0.5)

In [23]:

wc_neg = WordCloud().generate(str(neg_review))

plt.figure(figsize=(10, 5))
plt.imshow(wc_neg)
plt.axis('off')

Out[23]:

(-0.5, 399.5, 199.5, -0.5)

Naive Bayes Model¶

각 변수(쪼갠 텍스트)가 독립적이라는 가정
n(관측치) < p(Col의 수) 인 상황에서 유용하게 활용
스팸 메일 필터링, 감정 분석에 적합

In [24]:

df_nb = pd.read_csv("Data/yelp.csv", index_col=0)
df_nb.head()

Out[24]:

	review_id	user_id	business_id	stars	date	text	useful	funny	cool
2967245	aMleVK0lQcOSNCs56_gSbg	miHaLnLanDKfZqZHet0uWw	Xp_cWXY5rxDLkX-wqUg-iQ	5	2015-09-30	LOVE the cheeses here. They are worth the pri...	0	0	1
4773684	Hs1f--t9JnVKW9A1U2uhKA	r_RUQSGZcd5bSgmTcS5IfQ	NuGZD3yBVqzpY1HuzT26mQ	5	2015-06-04	This has become our go-to sushi place. The sus...	0	0	0
1139855	i7aiPgNrNaFoM8J_j2OSyQ	zz7lojg6QdZbKFCJiHsj7w	ii8sAGBexBOJoYRFafF9XQ	1	2016-07-03	I was very disappointed with the hotel. The re...	2	1	1
3997153	uft6iMwNQh4I2UDpmbXggA	p_oXN3L9oi8nmmJigf8c9Q	r0j4IpUbcdC1-HfoMYae4w	5	2016-10-15	Love this place - super amazing - staff here i...	0	0	0
4262000	y9QmJ16mrfBZS6Td6Yqo0g	jovtGPaHAqP6XfG9BFwY7A	j6UwIfXrSkGTdVkRu7K6WA	5	2017-03-14	Thank you Dana!!!! Having dyed my hair black p...	0	0	0

In [25]:

# 독립변수와 종속변수 구분
X = df_nb['text']
y = df_nb['stars']

Counter Vectorizer: 단어별 빈도를 계산하여 데이터 프레임으로 정리

In [26]:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(X)

In [27]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1234)

In [28]:

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

pred = model.predict(X_test)

In [29]:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

print(accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

0.9195
[[ 444   58]
 [ 103 1395]]
              precision    recall  f1-score   support

           1       0.81      0.88      0.85       502
           5       0.96      0.93      0.95      1498

    accuracy                           0.92      2000
   macro avg       0.89      0.91      0.90      2000
weighted avg       0.92      0.92      0.92      2000

Random Forest Model¶

In [30]:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=10, n_estimators = 1000)
rf.fit(X_train, y_train)

pred2 = rf.predict(X_test)

In [31]:

print(accuracy_score(y_test, pred2))
print(confusion_matrix(y_test, pred2))
print(classification_report(y_test, pred2))

0.7805
[[  68  434]
 [   5 1493]]
              precision    recall  f1-score   support

           1       0.93      0.14      0.24       502
           5       0.77      1.00      0.87      1498

    accuracy                           0.78      2000
   macro avg       0.85      0.57      0.55      2000
weighted avg       0.81      0.78      0.71      2000

저작자표시 비영리 변경금지 (새창열림)

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

E-Commerce Part Ⅶ: 시계열 분석 응용 (0)	2021.01.18
E-Commerce Part Ⅵ: K Means Clustering 응용 (0)	2021.01.16
E-Commerce Part Ⅴ: Random Forest 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅳ: Decision Tree 모델 응용 (0)	2021.01.16
E-Commerce Part Ⅲ: KNN 모델 응용 (0)	2021.01.15
E-Commerce Part Ⅱ: 로지스틱회귀분석 응용 (0)	2021.01.15
E-Commerce Part Ⅰ: 선형회귀분석 응용 (0)	2021.01.12

CheeseChaser

E-Commerce Part Ⅷ: 자연어분석(NLP) 응용

Natural Language Processing (NLP)¶

ⅰ. Importing Modules & Data Skimming¶

ⅱ. Removing Special Characters¶

ⅲ. Stopwords Processing¶

ⅳ. Visualizing Frequency & Wordcloud¶

별점 별로 워드클라우드 만들기¶

Naive Bayes Model¶

Random Forest Model¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

티스토리툴바

E-Commerce Part Ⅷ: 자연어분석(NLP) 응용

Natural Language Processing (NLP)¶

ⅰ. Importing Modules & Data Skimming¶

ⅱ. Removing Special Characters¶

ⅲ. Stopwords Processing¶

ⅳ. Visualizing Frequency & Wordcloud¶

별점 별로 워드클라우드 만들기¶

Naive Bayes Model¶

Random Forest Model¶

'데이터 분석 > Proj. E-Commerce' 카테고리의 다른 글

'데이터 분석/Proj. E-Commerce' Related Articles

티스토리툴바