import torch
from Korpora import Korpora
import pandas as pd

fineturning할 nsmc 데이터를 가져옵니다.

NSMC = Korpora.load('nsmc')

    Korpora 는 다른 분들이 연구 목적으로 공유해주신 말뭉치들을
    손쉽게 다운로드, 사용할 수 있는 기능만을 제공합니다.

    말뭉치들을 공유해 주신 분들에게 감사드리며, 각 말뭉치 별 설명과 라이센스를 공유 드립니다.
    해당 말뭉치에 대해 자세히 알고 싶으신 분은 아래의 description 을 참고,
    해당 말뭉치를 연구/상용의 목적으로 이용하실 때에는 아래의 라이센스를 참고해 주시기 바랍니다.

    # Description
    Author : e9t@github
    Repository : https://github.com/e9t/nsmc
    References : www.lucypark.kr/docs/2015-pyconkr/#39

    Naver sentiment movie corpus v1.0
    This is a movie review dataset in the Korean language.
    Reviews were scraped from Naver Movies.

    The dataset construction is based on the method noted in
    [Large movie review dataset][^1] from Maas et al., 2011.

    [^1]: http://ai.stanford.edu/~amaas/data/sentiment/

    # License
    CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
    Details in https://creativecommons.org/publicdomain/zero/1.0/

[Korpora] Corpus `nsmc` is already installed at C:\Users\jun\Korpora\nsmc\ratings_train.txt
[Korpora] Corpus `nsmc` is already installed at C:\Users\jun\Korpora\nsmc\ratings_test.txt

dataframe 에 넣어봅니다.

train_data = pd.DataFrame({"texts":NSMC.train.texts, "labels":NSMC.train.labels})
test_data = pd.DataFrame({"texts":NSMC.test.texts, "labels":NSMC.test.labels})

train_data

	texts	labels
0	아 더빙.. 진짜 짜증나네요 목소리	0
1	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
2	너무재밓었다그래서보는것을추천한다	0
3	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
4	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...	1
...	...	...
149995	인간이 문제지.. 소는 뭔죄인가..	0
149996	평점이 너무 낮아서...	1
149997	이게 뭐요? 한국인은 거들먹거리고 필리핀 혼혈은 착하다?	0
149998	청춘 영화의 최고봉.방황과 우울했던 날들의 자화상	1
149999	한국 영화 최초로 수간하는 내용이 담긴 영화	0

150000 rows × 2 columns

test_data

	texts	labels
0	굳 ㅋ	1
1	GDNTOPCLASSINTHECLUB	0
2	뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아	0
3	지루하지는 않은데 완전 막장임... 돈주고 보기에는....	0
4	3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??	0
...	...	...
49995	오랜만에 평점 로긴했네ㅋㅋ 킹왕짱 쌈뽕한 영화를 만났습니다 강렬하게 육쾌함	1
49996	의지 박약들이나 하는거다 탈영은 일단 주인공 김대희 닮았고 이등병 찐따 OOOO	0
49997	그림도 좋고 완성도도 높았지만... 보는 내내 불안하게 만든다	0
49998	절대 봐서는 안 될 영화.. 재미도 없고 기분만 잡치고.. 한 세트장에서 다 해먹네	0
49999	마무리는 또 왜이래	0

50000 rows × 2 columns

max(len(l) for l in train_data['texts'])

max(len(l) for l in test_data['texts'])

여기에서 train/test 데이터가 너무 많아서 학습이 오래걸려 1/10 으로 줄여서 진행합니다. (이 코드는 샘플이므로)

train_data = train_data.head(int(len(train_data)/10))

test_data = test_data.head(int(len(test_data)/10))

학습에 사용될 pre-trained 된 BERT 모델을 가져와서 토큰화 하기

pretrained_model_name="beomi/kcbert-base"

from transformers import AutoTokenizer

# 경고가 뜬다면 다음 명령으로 설치해주자 !pip install ipywidgets

tokenizer = AutoTokenizer.from_pretrained(
    pretrained_model_name
)

tokenized_train_sentences = tokenizer(
    list(train_data.texts),
    return_tensors="pt",
    padding=True,
    truncation=True,
)

tokenized_test_sentences = tokenizer(
    list(test_data.texts),
    return_tensors="pt",
    padding=True,
    truncation=True,
)

출력해봅니다.

print(tokenized_train_sentences.keys())
print(tokenized_train_sentences['input_ids'])
print(tokenized_train_sentences['attention_mask'])
print(tokenized_train_sentences['token_type_ids'])

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
tensor([[    2,  2170,   832,  ...,     0,     0,     0],
        [    2,  3521,    17,  ...,     0,     0,     0],
        [    2,  8069,  4089,  ...,     0,     0,     0],
        ...,
        [    2,    43, 17697,  ...,     0,     0,     0],
        [    2,  2477,  4116,  ...,     0,     0,     0],
        [    2,  2170,  4565,  ...,     0,     0,     0]])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])
tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])

train_label = train_data['labels'].values
test_label = test_data['labels'].values

데이터 로더 준비, 이게 필요한 이유는 배치 처리하는 내부에서 원소를 액세스 하기 위함

class DataloaderDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DataloaderDataset(tokenized_train_sentences, train_label)
test_dataset = DataloaderDataset(tokenized_test_sentences, test_label)

from transformers import BertConfig, AutoModelForSequenceClassification, Trainer, TrainingArguments

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0

pretrained_model_config = BertConfig.from_pretrained(
    pretrained_model_name,
)
model = AutoModelForSequenceClassification.from_pretrained(
        pretrained_model_name,
        config=pretrained_model_config,
)

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at beomi/kcbert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

pretrained_model_config

BertConfig {
  "_name_or_path": "beomi/kcbert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 300,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

#!pip install evaluate
#!pip install scikit-learn
import numpy as np
import evaluate 
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    #per_device_train_batch_size=32,  # batch size per device during training
    #per_device_eval_batch_size=64,   # batch size for evaluation
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,
    save_steps=200,
    save_total_limit=2,
    save_on_each_node=True,
    do_train=True,                   # Perform training
    do_eval=True,                    # Perform evaluation
    evaluation_strategy="epoch",
    seed=3
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).

좀 더 많은 인자는 아래 링크에서 확인

https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/trainer#transformers.TrainingArguments

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

***** Running training *****
  Num examples = 15000
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 938
C:\Users\jun\AppData\Local\Temp\ipykernel_27736\1263192275.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

[938/938 04:42, Epoch 1/1]

Epoch	Training Loss	Validation Loss	Accuracy
1	0.344700	0.314100	0.867800

Saving model checkpoint to ./results\checkpoint-200
Configuration saved in ./results\checkpoint-200\config.json
Model weights saved in ./results\checkpoint-200\pytorch_model.bin
C:\Users\jun\AppData\Local\Temp\ipykernel_27736\1263192275.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Saving model checkpoint to ./results\checkpoint-400
Configuration saved in ./results\checkpoint-400\config.json
Model weights saved in ./results\checkpoint-400\pytorch_model.bin
Deleting older checkpoint [results\checkpoint-500] due to args.save_total_limit
C:\Users\jun\AppData\Local\Temp\ipykernel_27736\1263192275.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Saving model checkpoint to ./results\checkpoint-600
Configuration saved in ./results\checkpoint-600\config.json
Model weights saved in ./results\checkpoint-600\pytorch_model.bin
Deleting older checkpoint [results\checkpoint-200] due to args.save_total_limit
C:\Users\jun\AppData\Local\Temp\ipykernel_27736\1263192275.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Saving model checkpoint to ./results\checkpoint-800
Configuration saved in ./results\checkpoint-800\config.json
Model weights saved in ./results\checkpoint-800\pytorch_model.bin
Deleting older checkpoint [results\checkpoint-400] due to args.save_total_limit
C:\Users\jun\AppData\Local\Temp\ipykernel_27736\1263192275.py:7: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=938, training_loss=0.3618158074075988, metrics={'train_runtime': 283.118, 'train_samples_per_second': 52.981, 'train_steps_per_second': 3.313, 'total_flos': 824791491900000.0, 'train_loss': 0.3618158074075988, 'epoch': 1.0})

trainer.save_model("trained_model")

Saving model checkpoint to trained_model
Configuration saved in trained_model\config.json
Model weights saved in trained_model\pytorch_model.bin

SW정리

2023년 3월 14일 화요일

BERT nsmc(네이버 영화 리뷰 데이터) pytorch train 예제

댓글 없음:

댓글 쓰기