자연어 처리

[ 자연어 처리 ] 임베딩 시각화

예진또이(애덤스미스 아님) 2023. 9. 12. 02:03

728x90

!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf
-----------------------------------------------------------------------------------------------
# 결과
Reading package lists... Done
Building dependency tree       
Reading state information... Done
fonts-nanum is already the newest version (20180306-3).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.
/usr/share/fonts: caching, new cache contents: 0 fonts, 1 dirs
/usr/share/fonts/truetype: caching, new cache contents: 0 fonts, 3 dirs
/usr/share/fonts/truetype/humor-sans: caching, new cache contents: 1 fonts, 0 dirs
/usr/share/fonts/truetype/liberation: caching, new cache contents: 16 fonts, 0 dirs
/usr/share/fonts/truetype/nanum: caching, new cache contents: 10 fonts, 0 dirs
/usr/local/share/fonts: caching, new cache contents: 0 fonts, 0 dirs
/root/.local/share/fonts: skipping, no such directory
/root/.fonts: skipping, no such directory
/usr/share/fonts/truetype: skipping, looped directory detected
/usr/share/fonts/truetype/humor-sans: skipping, looped directory detected
/usr/share/fonts/truetype/liberation: skipping, looped directory detected
/usr/share/fonts/truetype/nanum: skipping, looped directory detected
/var/cache/fontconfig: cleaning cache directory
/root/.cache/fontconfig: not cleaning non-existent cache directory
/root/.fontconfig: not cleaning non-existent cache directory
fc-cache: succeeded

1. 네이버 영화 리뷰 데이터셋

총 200,000개의 리뷰로 구성된 데이터로, 영화 리뷰를 긍/부정으로 분류하기 위해 만들어진 데이터셋
리뷰가 긍정인 경우 1, 부정인 경우 0으로 표시한 레이블로 구성되어 있음

import urllib.request
import pandas as pd

urllib.request.urlretrieve('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt', filename='ratings_train.txt' )
urllib.request.urlretrieve('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt', filename='ratings_test.txt' )
-----------------------------------------------------------------------------------------------
# 결과
('ratings_test.txt', <http.client.HTTPMessage at 0x7fc84d044f40>)

-----------------------------------------------------------------------------------------------
train_dataset = pd.read_table('ratings_test.txt')
train_dataset

# 결과

len(train_dataset)
--------------------------------------
# 결과
50000

2. 데이터 전처리

# 결측치를 확인하고 결측치를 제거
train_dataset.replace('', float('NaN'), inplace=True)
train_dataset.isnull().values.any()
-------------------------------------------------------
# 결과
True

-------------------------------------------------------
train_dataset = train_dataset.dropna().reset_index(drop=True)
len(train_dataset)
-------------------------------------------------------------
# 결과
49997

--------------------------------------------------------------
# 열을 기준으로 중복 데이터를 제거
train_dataset = train_dataset.drop_duplicates(['document']).reset_index(drop=True)
len(train_dataset)
--------------------------------------------------------------
# 결과
49157

---------------------------------------------------------------
# 한글이 아닌 문자를 포함하는 데이터 제거(ㅋㅋㅋ 제거하지 않음)
train_dataset['document'] = train_dataset['document'].str.replace('[^ㄱ-ㅎㅏ-ㅣ가-힣]', ' ')
train_dataset

# 결과

# 너무 짧은 단어를 제거(단어의 길이가 한글자인 단어만 제거)
train_dataset['document'] = train_dataset['document'].apply(lambda x: ' '.join([token for token in x.split() if len(token) > 2]))
train_dataset

# 결과

# 전체 길이가 10이하이거나 전체 단어 개수가 5개 이하인 데이터를 제거
train_dataset = train_dataset[train_dataset.document.apply(lambda x: len(str(x)) > 60 and len(str(x).split()) > 5)].reset_index(drop=True)
train_dataset

# 결과

!pip install konlpy
--------------------------------------------------
# 결과
Requirement already satisfied: konlpy in /usr/local/lib/python3.10/dist-packages (0.6.0)
Requirement already satisfied: JPype1>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from konlpy) (1.4.1)
Requirement already satisfied: lxml>=4.1.0 in /usr/local/lib/python3.10/dist-packages (from konlpy) (4.9.2)
Requirement already satisfied: numpy>=1.6 in /usr/local/lib/python3.10/dist-packages (from konlpy) (1.22.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from JPype1>=0.7.0->konlpy) (23.1)

----------------------------------------------------
from konlpy.tag import Okt

# 불용어 정의
stopwords = ['의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로', '자', '에', '와', '한', '하다']

train_dataset = list(train_dataset['document'])
# train_dataset

okt = Okt()

tokenized_data = []

for sentence in train_dataset:
    tokenized_sentence = okt.morphs(sentence, stem=True)
    stopwords_removed_sentence = [word for word in tokenized_sentence if not word in stopwords]
    tokenized_data.append(stopwords_removed_sentence)
    
tokenized_data[0]
---------------------------------------------------------------
# 결과
['갈수록',
 '개판',
 '되다',
 '중국영화',
 '유치하다',
 '내용',
 '없다',
 '폼',
 '잡다',
 '말',
 '안되다',
 '무기',
 '유치하다',
 '그리다',
 '동사서독',
 '같다',
 '영화',
 '류',
 '아',
 '류작',
 '이다']
 
 -------------------------------------------------------------
 import matplotlib.pyplot as plt
 
 print('리뷰의 최대 길이: ', max(len(review) for review in tokenized_data))
print('리뷰의 평균 길이: ', sum(map(len, tokenized_data))/len(tokenized_data))
plt.hist([len(review) for review in tokenized_data], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()

# 결과

3. 워드 임베딩 구축

from gensim.models import Word2Vec

embedding_dim = 100

# sg: 0(CBOW), 1(Skip-gram)
model = Word2Vec(
    sentences = tokenized_data,
    vector_size = embedding_dim,
    window = 5,
    min_count = 5,
    workers = 4,
    sg=0
)

# 임베딩 행렬의 크기
# 단어 사전에는 총 12107개의 단어가 존재하고, 각각의 단어는 미리 설정한 embedding_dim=100 차원으로 구성
model.wv.vectors.shape
-------------------------------------------------------------------
# 결과
(3137, 100)

--------------------------------------------------------------------
word_vectors = model.wv
vocabs = list(word_vectors.index_to_key)
vocabs[:20]
--------------------------------------------------------------------
# 결과
['영화',
 '보다',
 '을',
 '이다',
 '있다',
 '적',
 '로',
 '없다',
 '되다',
 '에서',
 '아니다',
 '같다',
 '생각',
 '만',
 '좋다',
 '사람',
 '인',
 '다',
 '나오다',
 '않다']
 
 --------------------------------------------------
 for sim_word in model.wv.most_similar('영화'):
    print(sim_word)
 ---------------------------------------------------
 # 결과
 ('걸', 0.9991418123245239)
('재미없다', 0.9990863800048828)
('중', 0.999067485332489)
('속', 0.9990255832672119)
('재미있다', 0.999021053314209)
('생각', 0.9990206956863403)
('수준', 0.9989945292472839)
('친구', 0.9989907741546631)
('작품', 0.9989888072013855)
('명작', 0.9989792108535767)

-----------------------------------------------------
for sim_word in model.wv.most_similar('좋다'):
    print(sim_word)
------------------------------------------------------
# 결과
('씨', 0.9995885491371155)
('없이', 0.9995604157447815)
('니', 0.9995359182357788)
('많다', 0.9995313882827759)
('인데', 0.9995303750038147)
('이나', 0.9995232224464417)
('재다', 0.9995187520980835)
('짜증나다', 0.9995148181915283)
('되어다', 0.9995091557502747)
('음악', 0.9995079636573792)

-------------------------------------------------------
model.wv.similarity('좋다', '괜찮다')
-------------------------------------------------------
# 결과
0.9994349

4. 워드 임베딩 시각화

import matplotlib.font_manager

font_list = matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')
[matplotlib.font_manager.FontProperties(fname=font).get_name() for font in font_list if 'Nanum' in font]
-------------------------------------------------------------------
# 결과
['NanumBarunGothic',
 'NanumGothic',
 'NanumSquare',
 'NanumSquareRound',
 'NanumBarunGothic',
 'NanumMyeongjo',
 'NanumSquare',
 'NanumSquareRound',
 'NanumGothic',
 'NanumMyeongjo']
 
 -------------------------------------------------------------------
 plt.rc('font', family='NanumGothic')
 
word_vector_list = [word_vectors[word] for word in vocabs]
word_vector_list[0]
---------------------------------------------------------------------
# 결과
array([-0.21918485,  0.28382018,  0.21966262,  0.12835518,  0.17709662,
       -0.72233176,  0.07260188,  0.98034286, -0.41017163, -0.01281969,
       -0.25179505, -0.6974792 , -0.03544386,  0.3049884 ,  0.18475449,
       -0.4098386 ,  0.09352504, -0.3410262 , -0.12994663, -0.791819  ,
        0.35505405,  0.13900925,  0.33304736, -0.28356975,  0.0256617 ,
       -0.03121308, -0.36421636, -0.14121231, -0.35296145,  0.137801  ,
        0.5315192 ,  0.19671257,  0.05734828, -0.27934414, -0.31424147,
        0.44539452,  0.22070728, -0.3615338 , -0.23568882, -0.8758261 ,
       -0.09035558, -0.3674138 , -0.20646371,  0.16797723,  0.49805075,
       -0.13717142, -0.4402386 , -0.1137519 ,  0.33916476,  0.33037713,
        0.12584367, -0.35951215, -0.20134424, -0.07037662, -0.1785056 ,
        0.12041386,  0.3291886 , -0.04990957, -0.53875107,  0.13610046,
       -0.08868801,  0.21108653, -0.07903782,  0.03985845, -0.62521714,
        0.37372404,  0.04797073,  0.39069578, -0.6340038 ,  0.4984773 ,
       -0.25026834,  0.25867477,  0.57225466, -0.04777239,  0.49445954,
        0.1329986 ,  0.21337293, -0.0900127 , -0.31222627,  0.14912027,
       -0.24639623, -0.00579283, -0.441542  ,  0.731052  , -0.03494806,
        0.04168737, -0.0131084 ,  0.35270578,  0.43479764,  0.1896215 ,
        0.4878867 ,  0.14898925,  0.14942856,  0.17013623,  0.8141121 ,
        0.44618127,  0.10866941, -0.46221238,  0.08273039, -0.07976237],
      dtype=float32)
      
-------------------------------------------------------------------------
# PCA가 자주 이용되는 차원 축소 방식이긴 하지만 군집의 변별력을 해친다는 단점이 있음
# PCA를 개선한 방법이 t-SNE 차원 축소 방식
from sklearn.manifold import TSNE

import numpy as np

tsne = TSNE(learning_rate=100)
transformed = tsne.fit_transform(np.array(word_vector_list))

x_axis_tsne = transformed[:, 0]
y_axis_tsne = transformed[:, 1]

def plot_tsne_graph(vocabs, x_asix, y_asix):
  plt.figure(figsize=(30, 30))
  plt.scatter(x_asix, y_asix, marker = 'o')
  for i, v in enumerate(vocabs):
    plt.annotate(v, xy=(x_asix[i], y_asix[i]))
    
plot_tsne_graph(vocabs,x_axis_tsne,y_axis_tsne)

# 결과

5.TSNE 고도화

Python에서 제공하는 interactive visualization library 인 bokeh를 사용하여 시각화를 고도화

import pickle

tsne_df = pd.DataFrame(transformed, columns = ['x_coord', 'y_coord'])

tsne_df

# 결과

def plot_tsne_graph(vocabs, x_asix, y_asix):
  plt.figure(figsize=(30, 30))
  plt.scatter(x_asix, y_asix, marker = 'o')
  for i, v in enumerate(vocabs):
    plt.annotate(v, xy=(x_asix[i], y_asix[i]))
    
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

# prepare the data in a form suitable for bokeh.
plot_data = ColumnDataSource(tsne_df)
# create the plot and configure it
tsne_plot = figure(title='t-SNE Word Embeddings',
  plot_width = 800,
  plot_height = 800,
  active_scroll='wheel_zoom'
)
# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@word') )
tsne_plot.circle(
    'x_coord', 'y_coord', source=plot_data,
    color='red', line_alpha=0.2, fill_alpha=0.1,
    size=10, hover_line_color='orange'
  )
# adjust visual elements of the plot
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
# show time!
show(tsne_plot);

# 결과

from gensim.models import KeyedVectors

model.wv.save_word2vec_format('sample_word2vec_embedding')  # 학습한 파일 저장

# sample_word2vec_embedding을 입력하여 출력도 sample_word2vec_embedding 이름으로 나오게함
!python -m gensim.scripts.word2vec2tensor --input sample_word2vec_embedding --output sample_word2vec_embedding

-----------------------------------------------------------------------------------------------
# 결과
2023-07-05 04:56:43,649 - word2vec2tensor - INFO - running /usr/local/lib/python3.10/dist-packages/gensim/scripts/word2vec2tensor.py --input sample_word2vec_embedding --output sample_word2vec_embedding
2023-07-05 04:56:43,649 - keyedvectors - INFO - loading projection weights from sample_word2vec_embedding
2023-07-05 04:56:43,873 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (3137, 100) matrix of type float32 from sample_word2vec_embedding', 'binary': False, 'encoding': 'utf8', 'datetime': '2023-07-05T04:56:43.871106', 'gensim': '4.3.1', 'python': '3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]', 'platform': 'Linux-5.15.107+-x86_64-with-glibc2.31', 'event': 'load_word2vec_format'}
2023-07-05 04:56:44,118 - word2vec2tensor - INFO - 2D tensor file saved to sample_word2vec_embedding_tensor.tsv
2023-07-05 04:56:44,118 - word2vec2tensor - INFO - Tensor metadata file saved to sample_word2vec_embedding_metadata.tsv
2023-07-05 04:56:44,119 - word2vec2tensor - INFO - finished running word2vec2tensor.py

임베딩 프로젝트

(https://projector.tensorflow.org/)

tensor데이터와 metadata를 다운받아 load에 파일 업로드

728x90