[ 자연어 처리 ] BERT를 활용한 단어 추론 실습

2023. 9. 12. 02:30자연어 처리

728x90

허깅페이스

  • 트랜스포머를 기반으로 다양한 모델과 학습 데이터, 학습 방법을 구현해 놓은 모듈
  • 질의 응답, 텍스트 분류, 텍스트 요약, 개체명 인식, 텍스트 생성, 번역, 언어모델
!pip install transformers
------------------------------------------------
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 62.1 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.2)
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 31.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.22.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2022.10.31)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.27.1)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 98.9 MB/s eta 0:00:00
Collecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 76.2 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.14.1->transformers) (4.7.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.5.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)
Installing collected packages: tokenizers, safetensors, huggingface-hub, transformers
Successfully installed huggingface-hub-0.16.4 safetensors-0.3.1 tokenizers-0.13.3 transformers-4.30.2
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

#KLUE
# 한국어 자연어처리 과제에 사용되는 언어 모델,
# 카카오에서 개발한 한국어 BERT
# 사전훈련(pre-training)과 다양한 작업에 대해 세부 조정(fine-tuning)하는 과정을 통해 학습
tokenizer= BertTokenizer.from_pretrained('klue/bert-base')

# 결과

text = '[CLS] 이순신은 누구입니까? [SEP] 16세기 말 조선의 명장이자 충무공이며 임진왜란 및 정유재란 당시 조선 수군을 지휘했던 제독이다 [SEP]'
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)

------------------------------------------------------------------------
# 결과
['[CLS]', '이순신', '##은', '누구', '##입', '##니까', '?', '[SEP]', '16', '##세기', '말', '조선', '##의', '명장', '##이', '##자', '충무', '##공', '##이', '##며', '임진왜란', '및', '정유', '##재', '##란', '당시', '조선', '수군', '##을', '지휘', '##했', '##던', '제독', '##이다', '[SEP]']

------------------------------------------------------------------------
masked_index = 16
tokenized_text[masked_index]='[MASK]'
print(tokenized_text)
------------------------------------------------------------------------
# 결과
['[CLS]', '이순신', '##은', '누구', '##입', '##니까', '?', '[SEP]', '16', '##세기', '말', '조선', '##의', '명장', '##이', '##자', '[MASK]', '##공', '##이', '##며', '임진왜란', '및', '정유', '##재', '##란', '당시', '조선', '수군', '##을', '지휘', '##했', '##던', '제독', '##이다', '[SEP]']

----------------------------------------------------------------------------
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(indexed_tokens)
----------------------------------------------------------------------------
# 결과
[2, 10661, 2073, 4061, 2372, 3707, 35, 3, 3879, 14223, 1041, 3957, 2079, 17449, 2052, 2155, 4, 2086, 2052, 2307, 15294, 1116, 9530, 2070, 2241, 3817, 3957, 15560, 2069, 5872, 2371, 2414, 28348, 28674, 3]

-----------------------------------------------------------------------------
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

tokens_tensor= torch.tensor([indexed_tokens])
segment_tensor = torch.tensor([segments_ids])

model=BertModel.from_pretrained('klue/bert-base')
model.eval()
----------------------------------------------------------------------------------
# 결과
Some weights of the model checkpoint at klue/bert-base were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

-------------------------------------------------------------------------------
tokens_tensor = tokens_tensor.to('cuda')
segment_tensor = segment_tensor.to('cuda')
model.to('cuda')
-------------------------------------------------------------------------------
# 결과
BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)
with torch.no_grad():
  outputs = model(tokens_tensor, token_type_ids = segment_tensor)
  encoded_layers = outputs[0]
print(encoded_layers)
print(encoded_layers.shape)

----------------------------------------------------------------------
# 결과
tensor([[[ 1.0652, -0.5908, -1.5817,  ...,  0.7246, -0.0901,  1.0030],
         [ 0.8961,  0.3418, -0.7090,  ...,  0.8243,  0.2939,  1.3385],
         [-0.7230, -0.8876, -1.0106,  ...,  2.3141, -0.2984,  0.1895],
         ...,
         [-0.0042, -0.0462, -0.2227,  ...,  1.1374,  0.7977,  1.7137],
         [-0.5353, -0.9521, -1.7591,  ...,  1.3387,  0.5158,  1.6115],
         [-0.0802, -0.9287, -1.7088,  ...,  0.1650, -0.5035,  1.4360]]],
       device='cuda:0')
torch.Size([1, 35, 768])

-----------------------------------------------------------------------
predicated_index = torch.argmax(encoded_layers[0,masked_index]).item()
print(predicated_index)
predicated_token = tokenizer.convert_ids_to_tokens([predicated_index])[0]
print(predicated_token)
-------------------------------------------------------------------------
# 결과
563
갱

-------------------------------------------------------------------------
model = BertForMaskedLM.from_pretrained('klue/bert-base')
model.eval()

tokens_tensor = tokens_tensor.to('cuda')
segment_tensor = segment_tensor.to('cuda')
model.to('cuda')

with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segment_tensor)
    encoded_layers = outputs[0]

predicted_index = torch.argmax(encoded_layers[0, masked_index]).item()
print(predicted_index)
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(predicted_token)
-----------------------------------------------------------------------------
# 결과
Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
15100
충무
728x90
반응형

'자연어 처리' 카테고리의 다른 글

[ 자연어 처리 ] BERT  (0) 2023.09.12
[ 자연어 처리 ] GPT  (0) 2023.09.12
[ 자연어 처리 ] 트랜스포머  (0) 2023.09.12
[ 자연어 처리 ] ELMO  (0) 2023.09.12
[ 자연어 처리 ] Seq2Seq  (1) 2023.09.12