MultiClass predict 예제¶

notebook : https://github.com/donarts/sourcecode/blob/main/pytorch/06_bert

from transformers import AutoModelForSequenceClassification
from transformers import Trainer
from transformers import AutoTokenizer

import pandas as pd

import torch

import numpy as np

train 과 같은 pretrain된 토크나이저를 준비합니다.

pretrained_model_name="beomi/kcbert-base"
tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name
)

예제 데이터를 준비하였습니다. 문구는 제가 임의로 만든 문장입니다.

test_data = pd.DataFrame({"texts":["신선하다.",
                                   "야이 XX야",
                                   "행복한 일만 생기길 바랍니다.",
                                   "까불지 마라",
                                   "꺼져라",
                                   "함께해요"]})

test_data

	texts
0	신선하다.
1	야이 XX야
2	행복한 일만 생기길 바랍니다.
3	까불지 마라
4	꺼져라
5	함께해요

tokenized_test_sentences = tokenizer(
    list(test_data.texts),
    return_tensors="pt",
    padding=True,
    truncation=True,
)

tokenized_test_sentences

{'input_ids': tensor([[    2, 23645,  8013,    17,     3,     0,     0,     0],
        [    2, 12047, 27778,  4144,     3,     0,     0,     0],
        [    2, 19165, 14620, 10173,  4583,  9306,    17,     3],
        [    2, 14695,  4102,  8879,     3,     0,     0,     0],
        [    2, 10809,     3,     0,     0,     0,     0,     0],
        [    2,  9158,  8929,     3,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0]])}

이 부분이 중요한데, predict는 label이 필요하지 않습니다, 그래서 데이터 로딩하는 부분은 train 부분에서 label 부분을 삭제하고 새로 만들었습니다.

class DataloaderDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        return len(self.encodings['input_ids'])

pred_dataset = DataloaderDataset(tokenized_test_sentences)

len(pred_dataset)

이전에 finetuning 된 데이터를 load 해줍니다. 이렇게 사용하기위해서 train 마지막에 저장을 했습니다.

model_loaded = AutoModelForSequenceClassification.from_pretrained("trained_model_hate")

trainer = Trainer(model = model_loaded)

아래 코드가 trainer로 predict 하는 코드 입니다.

pred_results = trainer.predict(pred_dataset)

***** Running Prediction *****
  Num examples = 6
  Batch size = 8
C:\Users\jun\AppData\Local\Temp\ipykernel_18388\3071759006.py:6: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

[1/1 : < :]

print(pred_results)

PredictionOutput(predictions=array([[-3.9400835 ,  2.6692195 , -2.795601  ],
       [ 1.0741174 , -3.2787116 , -1.4047624 ],
       [-3.7625163 ,  2.8469667 , -2.8889055 ],
       [-1.0301749 , -1.3925614 ,  0.11520092],
       [ 1.1580232 , -3.2189953 , -1.537382  ],
       [-3.730987  ,  2.8836164 , -2.843723  ]], dtype=float32), label_ids=None, metrics={'test_runtime': 2.4029, 'test_samples_per_second': 2.497, 'test_steps_per_second': 0.416})

pred_results.predictions

array([[-3.9400835 ,  2.6692195 , -2.795601  ],
       [ 1.0741174 , -3.2787116 , -1.4047624 ],
       [-3.7625163 ,  2.8469667 , -2.8889055 ],
       [-1.0301749 , -1.3925614 ,  0.11520092],
       [ 1.1580232 , -3.2189953 , -1.537382  ],
       [-3.730987  ,  2.8836164 , -2.843723  ]], dtype=float32)

np.argmax로 예측된 값중 가장 큰값을 가져오도록 합니다.

predictions = np.argmax(pred_results.predictions, axis=-1)

test_data["labels"]=predictions

원본 데이터와 예측한 값을 같이 표기해봤습니다. hate 0:[1,0,0], none 1:[0,1,0], offensive 2:[0,0,1]

test_data["labels"]=test_data["labels"].replace({1:"none",0:"hate",2:"offensive"})

test_data

	texts	labels
0	신선하다.	none
1	야이 XX야	hate
2	행복한 일만 생기길 바랍니다.	none
3	까불지 마라	offensive
4	꺼져라	hate
5	함께해요	none

SW정리

2023년 3월 19일 일요일

BERT MultiClass predict 예제

MultiClass predict 예제¶

댓글 없음:

댓글 쓰기