자연어 처리
[ 자연어 처리 ] 임베딩 시각화
예진또이(애덤스미스 아님)
2023. 9. 12. 02:03
728x90
!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf
-----------------------------------------------------------------------------------------------
# 결과
Reading package lists... Done
Building dependency tree
Reading state information... Done
fonts-nanum is already the newest version (20180306-3).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.
/usr/share/fonts: caching, new cache contents: 0 fonts, 1 dirs
/usr/share/fonts/truetype: caching, new cache contents: 0 fonts, 3 dirs
/usr/share/fonts/truetype/humor-sans: caching, new cache contents: 1 fonts, 0 dirs
/usr/share/fonts/truetype/liberation: caching, new cache contents: 16 fonts, 0 dirs
/usr/share/fonts/truetype/nanum: caching, new cache contents: 10 fonts, 0 dirs
/usr/local/share/fonts: caching, new cache contents: 0 fonts, 0 dirs
/root/.local/share/fonts: skipping, no such directory
/root/.fonts: skipping, no such directory
/usr/share/fonts/truetype: skipping, looped directory detected
/usr/share/fonts/truetype/humor-sans: skipping, looped directory detected
/usr/share/fonts/truetype/liberation: skipping, looped directory detected
/usr/share/fonts/truetype/nanum: skipping, looped directory detected
/var/cache/fontconfig: cleaning cache directory
/root/.cache/fontconfig: not cleaning non-existent cache directory
/root/.fontconfig: not cleaning non-existent cache directory
fc-cache: succeeded
1. 네이버 영화 리뷰 데이터셋
- 총 200,000개의 리뷰로 구성된 데이터로, 영화 리뷰를 긍/부정으로 분류하기 위해 만들어진 데이터셋
- 리뷰가 긍정인 경우 1, 부정인 경우 0으로 표시한 레이블로 구성되어 있음
import urllib.request
import pandas as pd
urllib.request.urlretrieve('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt', filename='ratings_train.txt' )
urllib.request.urlretrieve('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt', filename='ratings_test.txt' )
-----------------------------------------------------------------------------------------------
# 결과
('ratings_test.txt', <http.client.HTTPMessage at 0x7fc84d044f40>)
-----------------------------------------------------------------------------------------------
train_dataset = pd.read_table('ratings_test.txt')
train_dataset
# 결과
len(train_dataset)
--------------------------------------
# 결과
50000
2. 데이터 전처리
# 결측치를 확인하고 결측치를 제거
train_dataset.replace('', float('NaN'), inplace=True)
train_dataset.isnull().values.any()
-------------------------------------------------------
# 결과
True
-------------------------------------------------------
train_dataset = train_dataset.dropna().reset_index(drop=True)
len(train_dataset)
-------------------------------------------------------------
# 결과
49997
--------------------------------------------------------------
# 열을 기준으로 중복 데이터를 제거
train_dataset = train_dataset.drop_duplicates(['document']).reset_index(drop=True)
len(train_dataset)
--------------------------------------------------------------
# 결과
49157
---------------------------------------------------------------
# 한글이 아닌 문자를 포함하는 데이터 제거(ㅋㅋㅋ 제거하지 않음)
train_dataset['document'] = train_dataset['document'].str.replace('[^ㄱ-ㅎㅏ-ㅣ가-힣]', ' ')
train_dataset
# 결과
# 너무 짧은 단어를 제거(단어의 길이가 한글자인 단어만 제거)
train_dataset['document'] = train_dataset['document'].apply(lambda x: ' '.join([token for token in x.split() if len(token) > 2]))
train_dataset
# 결과
# 전체 길이가 10이하이거나 전체 단어 개수가 5개 이하인 데이터를 제거
train_dataset = train_dataset[train_dataset.document.apply(lambda x: len(str(x)) > 60 and len(str(x).split()) > 5)].reset_index(drop=True)
train_dataset
# 결과
!pip install konlpy
--------------------------------------------------
# 결과
Requirement already satisfied: konlpy in /usr/local/lib/python3.10/dist-packages (0.6.0)
Requirement already satisfied: JPype1>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from konlpy) (1.4.1)
Requirement already satisfied: lxml>=4.1.0 in /usr/local/lib/python3.10/dist-packages (from konlpy) (4.9.2)
Requirement already satisfied: numpy>=1.6 in /usr/local/lib/python3.10/dist-packages (from konlpy) (1.22.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from JPype1>=0.7.0->konlpy) (23.1)
----------------------------------------------------
from konlpy.tag import Okt
# 불용어 정의
stopwords = ['의', '가', '이', '은', '들', '는', '좀', '잘', '걍', '과', '도', '를', '으로', '자', '에', '와', '한', '하다']
train_dataset = list(train_dataset['document'])
# train_dataset
okt = Okt()
tokenized_data = []
for sentence in train_dataset:
tokenized_sentence = okt.morphs(sentence, stem=True)
stopwords_removed_sentence = [word for word in tokenized_sentence if not word in stopwords]
tokenized_data.append(stopwords_removed_sentence)
tokenized_data[0]
---------------------------------------------------------------
# 결과
['갈수록',
'개판',
'되다',
'중국영화',
'유치하다',
'내용',
'없다',
'폼',
'잡다',
'말',
'안되다',
'무기',
'유치하다',
'그리다',
'동사서독',
'같다',
'영화',
'류',
'아',
'류작',
'이다']
-------------------------------------------------------------
import matplotlib.pyplot as plt
print('리뷰의 최대 길이: ', max(len(review) for review in tokenized_data))
print('리뷰의 평균 길이: ', sum(map(len, tokenized_data))/len(tokenized_data))
plt.hist([len(review) for review in tokenized_data], bins=50)
plt.xlabel('length of samples')
plt.ylabel('number of samples')
plt.show()
# 결과
3. 워드 임베딩 구축
from gensim.models import Word2Vec
embedding_dim = 100
# sg: 0(CBOW), 1(Skip-gram)
model = Word2Vec(
sentences = tokenized_data,
vector_size = embedding_dim,
window = 5,
min_count = 5,
workers = 4,
sg=0
)
# 임베딩 행렬의 크기
# 단어 사전에는 총 12107개의 단어가 존재하고, 각각의 단어는 미리 설정한 embedding_dim=100 차원으로 구성
model.wv.vectors.shape
-------------------------------------------------------------------
# 결과
(3137, 100)
--------------------------------------------------------------------
word_vectors = model.wv
vocabs = list(word_vectors.index_to_key)
vocabs[:20]
--------------------------------------------------------------------
# 결과
['영화',
'보다',
'을',
'이다',
'있다',
'적',
'로',
'없다',
'되다',
'에서',
'아니다',
'같다',
'생각',
'만',
'좋다',
'사람',
'인',
'다',
'나오다',
'않다']
--------------------------------------------------
for sim_word in model.wv.most_similar('영화'):
print(sim_word)
---------------------------------------------------
# 결과
('걸', 0.9991418123245239)
('재미없다', 0.9990863800048828)
('중', 0.999067485332489)
('속', 0.9990255832672119)
('재미있다', 0.999021053314209)
('생각', 0.9990206956863403)
('수준', 0.9989945292472839)
('친구', 0.9989907741546631)
('작품', 0.9989888072013855)
('명작', 0.9989792108535767)
-----------------------------------------------------
for sim_word in model.wv.most_similar('좋다'):
print(sim_word)
------------------------------------------------------
# 결과
('씨', 0.9995885491371155)
('없이', 0.9995604157447815)
('니', 0.9995359182357788)
('많다', 0.9995313882827759)
('인데', 0.9995303750038147)
('이나', 0.9995232224464417)
('재다', 0.9995187520980835)
('짜증나다', 0.9995148181915283)
('되어다', 0.9995091557502747)
('음악', 0.9995079636573792)
-------------------------------------------------------
model.wv.similarity('좋다', '괜찮다')
-------------------------------------------------------
# 결과
0.9994349
4. 워드 임베딩 시각화
import matplotlib.font_manager
font_list = matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')
[matplotlib.font_manager.FontProperties(fname=font).get_name() for font in font_list if 'Nanum' in font]
-------------------------------------------------------------------
# 결과
['NanumBarunGothic',
'NanumGothic',
'NanumSquare',
'NanumSquareRound',
'NanumBarunGothic',
'NanumMyeongjo',
'NanumSquare',
'NanumSquareRound',
'NanumGothic',
'NanumMyeongjo']
-------------------------------------------------------------------
plt.rc('font', family='NanumGothic')
word_vector_list = [word_vectors[word] for word in vocabs]
word_vector_list[0]
---------------------------------------------------------------------
# 결과
array([-0.21918485, 0.28382018, 0.21966262, 0.12835518, 0.17709662,
-0.72233176, 0.07260188, 0.98034286, -0.41017163, -0.01281969,
-0.25179505, -0.6974792 , -0.03544386, 0.3049884 , 0.18475449,
-0.4098386 , 0.09352504, -0.3410262 , -0.12994663, -0.791819 ,
0.35505405, 0.13900925, 0.33304736, -0.28356975, 0.0256617 ,
-0.03121308, -0.36421636, -0.14121231, -0.35296145, 0.137801 ,
0.5315192 , 0.19671257, 0.05734828, -0.27934414, -0.31424147,
0.44539452, 0.22070728, -0.3615338 , -0.23568882, -0.8758261 ,
-0.09035558, -0.3674138 , -0.20646371, 0.16797723, 0.49805075,
-0.13717142, -0.4402386 , -0.1137519 , 0.33916476, 0.33037713,
0.12584367, -0.35951215, -0.20134424, -0.07037662, -0.1785056 ,
0.12041386, 0.3291886 , -0.04990957, -0.53875107, 0.13610046,
-0.08868801, 0.21108653, -0.07903782, 0.03985845, -0.62521714,
0.37372404, 0.04797073, 0.39069578, -0.6340038 , 0.4984773 ,
-0.25026834, 0.25867477, 0.57225466, -0.04777239, 0.49445954,
0.1329986 , 0.21337293, -0.0900127 , -0.31222627, 0.14912027,
-0.24639623, -0.00579283, -0.441542 , 0.731052 , -0.03494806,
0.04168737, -0.0131084 , 0.35270578, 0.43479764, 0.1896215 ,
0.4878867 , 0.14898925, 0.14942856, 0.17013623, 0.8141121 ,
0.44618127, 0.10866941, -0.46221238, 0.08273039, -0.07976237],
dtype=float32)
-------------------------------------------------------------------------
# PCA가 자주 이용되는 차원 축소 방식이긴 하지만 군집의 변별력을 해친다는 단점이 있음
# PCA를 개선한 방법이 t-SNE 차원 축소 방식
from sklearn.manifold import TSNE
import numpy as np
tsne = TSNE(learning_rate=100)
transformed = tsne.fit_transform(np.array(word_vector_list))
x_axis_tsne = transformed[:, 0]
y_axis_tsne = transformed[:, 1]
def plot_tsne_graph(vocabs, x_asix, y_asix):
plt.figure(figsize=(30, 30))
plt.scatter(x_asix, y_asix, marker = 'o')
for i, v in enumerate(vocabs):
plt.annotate(v, xy=(x_asix[i], y_asix[i]))
plot_tsne_graph(vocabs,x_axis_tsne,y_axis_tsne)
# 결과
5.TSNE 고도화
- Python에서 제공하는 interactive visualization library 인 bokeh를 사용하여 시각화를 고도화
import pickle
tsne_df = pd.DataFrame(transformed, columns = ['x_coord', 'y_coord'])
tsne_df
# 결과
def plot_tsne_graph(vocabs, x_asix, y_asix):
plt.figure(figsize=(30, 30))
plt.scatter(x_asix, y_asix, marker = 'o')
for i, v in enumerate(vocabs):
plt.annotate(v, xy=(x_asix[i], y_asix[i]))
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value
output_notebook()
# prepare the data in a form suitable for bokeh.
plot_data = ColumnDataSource(tsne_df)
# create the plot and configure it
tsne_plot = figure(title='t-SNE Word Embeddings',
plot_width = 800,
plot_height = 800,
active_scroll='wheel_zoom'
)
# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@word') )
tsne_plot.circle(
'x_coord', 'y_coord', source=plot_data,
color='red', line_alpha=0.2, fill_alpha=0.1,
size=10, hover_line_color='orange'
)
# adjust visual elements of the plot
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
# show time!
show(tsne_plot);
# 결과
from gensim.models import KeyedVectors
model.wv.save_word2vec_format('sample_word2vec_embedding') # 학습한 파일 저장
# sample_word2vec_embedding을 입력하여 출력도 sample_word2vec_embedding 이름으로 나오게함
!python -m gensim.scripts.word2vec2tensor --input sample_word2vec_embedding --output sample_word2vec_embedding
-----------------------------------------------------------------------------------------------
# 결과
2023-07-05 04:56:43,649 - word2vec2tensor - INFO - running /usr/local/lib/python3.10/dist-packages/gensim/scripts/word2vec2tensor.py --input sample_word2vec_embedding --output sample_word2vec_embedding
2023-07-05 04:56:43,649 - keyedvectors - INFO - loading projection weights from sample_word2vec_embedding
2023-07-05 04:56:43,873 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (3137, 100) matrix of type float32 from sample_word2vec_embedding', 'binary': False, 'encoding': 'utf8', 'datetime': '2023-07-05T04:56:43.871106', 'gensim': '4.3.1', 'python': '3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0]', 'platform': 'Linux-5.15.107+-x86_64-with-glibc2.31', 'event': 'load_word2vec_format'}
2023-07-05 04:56:44,118 - word2vec2tensor - INFO - 2D tensor file saved to sample_word2vec_embedding_tensor.tsv
2023-07-05 04:56:44,118 - word2vec2tensor - INFO - Tensor metadata file saved to sample_word2vec_embedding_metadata.tsv
2023-07-05 04:56:44,119 - word2vec2tensor - INFO - finished running word2vec2tensor.py
임베딩 프로젝트
- tensor데이터와 metadata를 다운받아 load에 파일 업로드
728x90
반응형