2020년 2월 3일 월요일

doc2vec 예제 분석(2)


https://swlock.blogspot.com/2020/01/doc2vec.html

앞에서 두가지 해야할 일을 남겨두었습니다.
이글은 그중에서 첫번째 이야기 입니다.

지난번 게시글에서 sample 코드로 해본것이 할때마다 다른값을 가지는것 같다고 하였습니다. 원하는 결과가 나오는것 같지도 않고요, 결론부터 얘기하자면 학습 횟수가 작아서 그렇습니다.
예제 코드를 만든 사람이 잘못 만들었다고 보는게 좋을것 같습니다. 그러면 1000, 2000 올려서 어떻게 되는지 같은 예제를 가지고 시험 해보겠습니다.

warning 제거

시작하기 전에 기존 코드에 waring이 많아서 제거하는 작업이 필요합니다.

1. iter 제거

아래와 같은 경고가 발생합니다.
doc2vectut.py:37: DeprecationWarning: Call to deprecated `iter` (Attribute will be removed in 4.0.0, use self.epochs instead).
  epochs=model.iter)

To avoid common mistakes around the model’s ability to do multiple training passes itself, an explicit epochs argument MUST be provided. In the common and recommended case where train() is only called once, you can set epochs=self.iter.

이렇게 되어있음에도 iter사용하면 안된다고 합니다. epochs로 사용하면 warning 사라집니다.

2. size 제거

C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\doc2vec.py:574: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.
  warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.")
iteration 0
이 부분을 size=vec_size => vec_size=vec_size 이렇게 조정하면 됩니다.

변경된 예제

epochs 100,1000,2000번 하도록 수정함
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = []

for i, _d in enumerate(data):
 tagged_data.append(TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]))

print(tagged_data)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

epochs = [100,1000,2000]

for max_epochs in epochs:
 vec_size = 20
 alpha = 0.025

 model = Doc2Vec(vec_size=vec_size,
     alpha=alpha, 
     min_alpha=0.00025,
     min_count=1,
     dm =1)
 

 
 model.build_vocab(tagged_data)
 #model.init_sims(replace=False)
 print("epoch count : ",max_epochs)

 for epoch in range(max_epochs):
  #print('iteration {0}'.format(epoch))
  model.train(tagged_data,
     total_examples=model.corpus_count,
     epochs=model.epochs)
  # decrease the learning rate
  model.alpha -= 0.0002
  # fix the learning rate, no decay
  model.min_alpha = model.alpha

 model.save("d2v.model")
 #model.clear_sims()
 
 print("Model Saved")

 #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


 from gensim.models.doc2vec import Doc2Vec

 model= Doc2Vec.load("d2v.model")
 #to find the vector of a document which is not in training data
 test_data = word_tokenize("I love chatbots".lower())
 #v1 = model.infer_vector(test_data)
 #print("V1_infer", v1)

 # to find most similar doc using tags
 similar_doc = model.docvecs.most_similar('1')
 print(similar_doc)

여러번 시험한 결과

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('0', 0.9945449829101562), ('2', 0.9924600124359131), ('3', 0.9910708665847778)]
epoch count :  1000
Model Saved
[('0', 0.9999822378158569), ('2', 0.9999727010726929), ('3', 0.9998421669006348)]
epoch count :  2000
Model Saved
[('0', 0.9999822378158569), ('2', 0.9999727010726929), ('3', 0.9998533725738525)]

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('0', 0.9962625503540039), ('3', 0.9957269430160522), ('2', 0.9942873120307922)]
epoch count :  1000
Model Saved
[('0', 0.9999853372573853), ('2', 0.9999737739562988), ('3', 0.999885082244873)]
epoch count :  2000
Model Saved
[('0', 0.9999853372573853), ('2', 0.9999737739562988), ('3', 0.9998925924301147)]

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('0', 0.9967427849769592), ('2', 0.9945513606071472), ('3', 0.9939110279083252)]
epoch count :  1000
Model Saved
[('0', 0.999987781047821), ('2', 0.9999759793281555), ('3', 0.9998540878295898)]
epoch count :  2000
Model Saved
[('0', 0.999987781047821), ('2', 0.9999759793281555), ('3', 0.9998698234558105)]

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('0', 0.9953392744064331), ('2', 0.9942353367805481), ('3', 0.9917100667953491)]
epoch count :  1000
Model Saved
[('0', 0.999980628490448), ('2', 0.9999741315841675), ('3', 0.9998222589492798)]
epoch count :  2000
Model Saved
[('0', 0.9999832510948181), ('2', 0.9999783635139465), ('3', 0.999845027923584)]

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('0', 0.9950520992279053), ('3', 0.992672324180603), ('2', 0.9917551875114441)]
epoch count :  1000
Model Saved
[('0', 0.9999843835830688), ('2', 0.9999716877937317), ('3', 0.9998539686203003)]
epoch count :  2000
Model Saved
[('0', 0.9999843835830688), ('2', 0.9999716877937317), ('3', 0.9998610019683838)]

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('2', 0.9950048327445984), ('0', 0.9948043823242188), ('3', 0.9913235306739807)]
epoch count :  1000
Model Saved
[('0', 0.9999803304672241), ('2', 0.9999801516532898), ('3', 0.9998198747634888)]
epoch count :  2000
Model Saved
[('0', 0.9999803304672241), ('2', 0.9999801516532898), ('3', 0.9998342990875244)]

C:\Users\USER\Documents\python\doc2vec>python 23.py
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
epoch count :  100
Model Saved
[('0', 0.9962822198867798), ('3', 0.994450569152832), ('2', 0.9934964179992676)]
epoch count :  1000
Model Saved
[('0', 0.9999848008155823), ('2', 0.999968409538269), ('3', 0.9998449087142944)]
epoch count :  2000
Model Saved
[('0', 0.9999848008155823), ('2', 0.999968409538269), ('3', 0.9998505115509033)]

결과 설명

위 결과를 잘 살펴보면 epoch count 100에서는 '0', '2' 값이 나오는 경우도 있고 뒤의 값들도 1000, 2000번 일때와는 상당히 다른 값입니다.
결론은 epoch 횟수가 늘어남에 따라 오차도 줄어들게 된다는것을 알 수 있습니다. 제대로 하려면 error rate값을 확인하는게 맞을텐데 여기에서는 error 값을 보기가 쉽지 않습니다.

아래글에서 error_rate_for_model 부분 참고하면 학습의 error 값 구현도 가능할것 같습니다.
https://radimrehurek.com/gensim/auto_examples/howtos/run_doc2vec_imdb.html








댓글 없음:

댓글 쓰기