이전 내용
https://swlock.blogspot.com/2020/01/doc2vec.html
앞에서 기본 예제에 대해서 살펴 보았습니다.
이번에 다룰 주제는 '비슷한 문장인 경우 어떻게 찾을까?' 입니다.
이전 예제는 이미 존재하는 label과 가장 비슷한 문장을 찾는 예제였습니다.
similar_doc = model.docvecs.most_similar('1')
들어가기 앞서
먼저 공식 link를 살펴 보도록 하겠습니다. 이중에 wv, docvecs 라는것이 있습니다.https://radimrehurek.com/gensim/models/doc2vec.html
gensim.models.doc2vec.
Doc2Vec
(documents=None, corpus_file=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None, trim_rule=None, callbacks=(), **kwargs)¶gensim.models.base_any2vec.BaseWordEmbeddingsModel
Class for training, using and evaluating neural networks described in Distributed Representations of Sentences and Documents.
Some important internal attributes are the following:
wv
- This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways. See the module level docstring for examples.
- Type
Word2VecKeyedVectors
docvecs
- This object contains the paragraph vectors learned from the training data. There will be one such vector for each unique document tag supplied during training. They may be individually accessed using the tag as an indexed-access key. For example, if one of the training documents used a tag of ‘doc003’:
>>> model.docvecs['doc003']
- Type
Doc2VecKeyedVectors
WordVector
document 관련된 기능이지만 내부적으로word vector도 관리를 합니다. 그래서 word도 처리가 가능합니다. wv는 wordvector라서 아래 와 같이 단어를 넣어서 비슷한 값을 구할 수 있습니다.print(doc_vectorizer.wv.most_similar(positive=['영화', '남자배우'], negative=['여배우']))
사용법은 아래 블로그 참고 하시면 됩니다.
http://hero4earth.com/blog/projects/2018/01/21/naver_movie_review/
docvecs.most_similar 이용
docvecs.most_similar 이용하더다도 뒤쪽에 인자를 어떻게 사용하는지 궁금합니다. 간단합니다. positive에 token화된 vector를 넣으면 됩니다. model.infer_vector 이것이 vector로 변환하는것입니다.Document Similarity using doc2vec
# find most similar doc test_doc = word_tokenize("That is a good device".lower()) model.docvecs.most_similar(positive=[model.infer_vector(test_doc)],topn=5)출처는 아래 링크입니다.
https://www.thinkinfi.com/2019/10/doc2vec-implementation-in-python-gensim.html
소스 작업
지난번에 작업한 소스를 수정해 보도록 하겠습니다.
소스
from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ data = ["I love machine learning. Its awesome.", "I love coding in python", "I love building chatbots", "they chat amagingly well"] tagged_data = [] for i, _d in enumerate(data): tagged_data.append(TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)])) print(tagged_data) #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ max_epochs = 1000 vec_size = 20 alpha = 0.025 model = Doc2Vec(size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1) model.build_vocab(tagged_data) for epoch in range(max_epochs): #print('iteration {0}'.format(epoch)) model.train(tagged_data, total_examples=model.corpus_count, epochs=model.iter) # decrease the learning rate model.alpha -= 0.0002 # fix the learning rate, no decay model.min_alpha = model.alpha model.save("d2v.model") print("Model Saved") #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ from gensim.models.doc2vec import Doc2Vec model= Doc2Vec.load("d2v.model") #to find the vector of a document which is not in training data test_data = word_tokenize("I love chatbots".lower()) print(model.docvecs.most_similar(positive=[model.infer_vector(test_data)],topn=5))
결과 화면
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])] C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\doc2vec.py:574: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead. warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.") 29.py:37: DeprecationWarning: Call to deprecated `iter` (Attribute will be removed in 4.0.0, use self.epochs instead). epochs=model.iter) Model Saved [('2', 0.999824047088623), ('1', 0.9997918009757996), ('0', 0.9997824430465698), ('3', 0.9994299411773682)]
Strange "water hack" burns 2 lbs in your sleep
답글삭제Over 160 thousand women and men are trying a simple and secret "water hack" to drop 1-2 lbs each and every night while they sleep.
It is proven and it works all the time.
Here's how you can do it yourself:
1) Hold a glass and fill it up with water half glass
2) Proceed to use this crazy HACK
so you'll become 1-2 lbs thinner as soon as tomorrow!