이전 내용
https://swlock.blogspot.com/2020/01/doc2vec.html학습 방법
이번에 논의할 주제는 학습 방법에 대한 내용입니다.doc2vec예제를 많이 찾다보면 가장 일반적인 예제가 아래와 같은 예제입니다.
기존에 사용한 예제
model = Doc2Vec(size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1) model.build_vocab(tagged_data) for epoch in range(max_epochs): print('iteration {0}'.format(epoch)) model.train(tagged_data, total_examples=model.corpus_count, epochs=model.iter) # decrease the learning rate model.alpha -= 0.0002 # fix the learning rate, no decay model.min_alpha = model.alpha
따라서 이번에는 기존에 사용한 학습을 변경해서 한번에 학습 하는 형태로 변경해보고, 두개의 효율을 비교해 보겠습니다.
위 코드는 아래와 같이 epochs=max_epochs 라고 만 추가로 넣어주면 됩니다.
변경된 코드
model = Doc2Vec(tagged_data,size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1,epochs=max_epochs)
그럼 효율은 어떨까요?
테스트 해본결과 처음 코드 보다 결과도 좋고 속도도 더 빠릅니다.
이유를 고민해봤는데 처음 코드에 train시 epochs=model.iter 이 부분은 실제 출력해보면 model.iter는 5가 들어있습니다. 즉 100회 train을 했을때 100*5 회 train이 됩니다. 많은 예제들이 왜 이렇게 되어있는지는 모르겠지만, 나중에 train을 하는 방식이 아니라면 처음부터 epochs에 값을 넣어서 train하는것을 추천 합니다.
테스트 해봤을때 결론은 epochs 100회는 작아서 값이 흔들리지만 epochs 1000회로 했을때 항상 같은 값이 나오며 기존 예제보다 빠릅니다.
예제 소스
#Import all the dependencies from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ data = ["I love machine learning. Its awesome.", "I love coding in python", "I love building chatbots", "they chat amagingly well"] tagged_data = [] for i, _d in enumerate(data): tagged_data.append(TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)])) print(tagged_data) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ max_epochs = 1000 vec_size = 20 alpha = 0.025 model = Doc2Vec(tagged_data,size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1,epochs=max_epochs) """ model.build_vocab(tagged_data) for epoch in range(max_epochs): print('iteration {0}'.format(epoch)) model.train(tagged_data, total_examples=model.corpus_count, epochs=model.iter) # decrease the learning rate model.alpha -= 0.0002 # fix the learning rate, no decay model.min_alpha = model.alpha """ model.save("d2v.model") print("Model Saved") #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ from gensim.models.doc2vec import Doc2Vec model= Doc2Vec.load("d2v.model") #to find the vector of a document which is not in training data test_data = word_tokenize("I love chatbots".lower()) v1 = model.infer_vector(test_data) print("V1_infer", v1) # to find most similar doc using tags similar_doc = model.docvecs.most_similar('1') print(similar_doc) # to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data print(model.docvecs['1']) print(model.docvecs.most_similar(positive=[model.infer_vector(test_data)],topn=5))
결과
C:\Users\USER\Documents\python\doc2vec>python 2.py [TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])] C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\doc2vec.py:574: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead. warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.") Model Saved V1_infer [-0.1478307 -0.12874664 0.37128845 -0.03459457 0.57580817 0.04165906 -0.161333 0.27196127 -0.21677573 0.01032528 0.25623006 0.5556566 0.12414648 -0.54218525 -0.208055 -0.52243584 0.3771771 0.5290485 -0.4843504 -0.17298844] [('2', 0.9900819063186646), ('3', 0.9708542823791504), ('0', 0.970792293548584)] [-0.0752231 -0.35064244 0.34328702 0.08152835 0.6884224 -0.03017065 -0.29200274 0.27554145 -0.22331765 0.09528319 0.25715432 0.72438854 0.03624368 -0.6178097 -0.2795767 -0.76473147 0.44413832 0.69012666 -0.66727465 -0.21889076] [('2', 0.9923702478408813), ('1', 0.9780316352844238), ('3', 0.9494951963424683), ('0', 0.9215951561927795)] C:\Users\USER\Documents\python\doc2vec>python 2.py [TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])] C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\doc2vec.py:574: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead. warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.") Model Saved V1_infer [ 0.11019158 0.51370174 -0.26823092 0.65347505 0.4890385 -0.22247577 -0.23286937 0.05573504 0.18244551 0.66818357 0.35685948 -0.4554677 -0.2638483 0.3127647 0.165362 0.10424155 0.04493263 -0.06063128 0.26513922 -0.1957828 ] [('2', 0.9853891730308533), ('3', 0.9765698909759521), ('0', 0.9650870561599731)] [ 0.01700659 0.5642358 -0.4336092 0.8316983 0.5487111 -0.33020684 -0.35978654 0.00089785 0.08480686 0.790529 0.3226167 -0.69981 -0.31057844 0.5498898 0.11522991 0.2883605 0.09612332 -0.07747563 0.44214472 -0.16630754] [('2', 0.9960660338401794), ('1', 0.978064775466919), ('3', 0.961208164691925), ('0', 0.9451479315757751)] C:\Users\USER\Documents\python\doc2vec>python 2.py [TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])] C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\doc2vec.py:574: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead. warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.") Model Saved V1_infer [-0.12817842 0.15201263 -0.35161802 0.06387527 0.09849352 0.4477839 0.05868901 -0.7434576 -0.41151583 0.22117768 -0.19870114 0.45456207 0.29246542 0.27406123 -0.4315686 0.37656972 -0.5473998 0.05305056 0.2825684 0.16648887] [('2', 0.9858678579330444), ('0', 0.9623578786849976), ('3', 0.9542171955108643)] [-0.20036705 0.13212322 -0.47465435 0.12010227 0.23122686 0.504441 -0.01674547 -1.1015012 -0.4765742 0.24642535 -0.39708486 0.5127476 0.3206394 0.3630215 -0.4660666 0.30893013 -0.6207208 0.03018731 0.28201237 0.3812475 ] [('2', 0.9939821362495422), ('1', 0.979041337966919), ('3', 0.9386898875236511), ('0', 0.9325041770935059)]
epochs 1000일때 여러번 시험시에도 모두 같은 결과를 얻었습니다.