2020년 2월 9일 일요일

doc2vec 예제 분석(3)


이전 내용


https://swlock.blogspot.com/2020/01/doc2vec.html

앞에서 기본 예제에 대해서 살펴 보았습니다.

이번에 다룰 주제는 '비슷한 문장인 경우 어떻게 찾을까?' 입니다.
이전 예제는 이미 존재하는 label과 가장 비슷한 문장을 찾는 예제였습니다.

similar_doc = model.docvecs.most_similar('1')




들어가기 앞서

먼저 공식 link를 살펴 보도록 하겠습니다. 이중에 wv, docvecs 라는것이 있습니다.
https://radimrehurek.com/gensim/models/doc2vec.html


class gensim.models.doc2vec.Doc2Vec(documents=Nonecorpus_file=Nonedm_mean=Nonedm=1dbow_words=0dm_concat=0dm_tag_count=1docvecs=Nonedocvecs_mapfile=Nonecomment=Nonetrim_rule=Nonecallbacks=()**kwargs)
Bases: gensim.models.base_any2vec.BaseWordEmbeddingsModel
Class for training, using and evaluating neural networks described in Distributed Representations of Sentences and Documents.
Some important internal attributes are the following:
wv
This object essentially contains the mapping between words and embeddings. After training, it can be used directly to query those embeddings in various ways. See the module level docstring for examples.
Type
Word2VecKeyedVectors
docvecs
This object contains the paragraph vectors learned from the training data. There will be one such vector for each unique document tag supplied during training. They may be individually accessed using the tag as an indexed-access key. For example, if one of the training documents used a tag of ‘doc003’:
>>> model.docvecs['doc003']
Type
Doc2VecKeyedVectors


WordVector

document 관련된 기능이지만 내부적으로word vector도 관리를 합니다. 그래서 word도 처리가 가능합니다. wv는 wordvector라서 아래 와 같이 단어를 넣어서 비슷한 값을 구할 수 있습니다.

print(doc_vectorizer.wv.most_similar(positive=['영화', '남자배우'], negative=['여배우']))

사용법은 아래 블로그 참고 하시면 됩니다.
http://hero4earth.com/blog/projects/2018/01/21/naver_movie_review/



docvecs.most_similar 이용

docvecs.most_similar 이용하더다도 뒤쪽에 인자를 어떻게 사용하는지 궁금합니다. 간단합니다. positive에 token화된 vector를 넣으면 됩니다. model.infer_vector 이것이 vector로 변환하는것입니다.

Document Similarity using doc2vec 

# find most similar doc 
test_doc = word_tokenize("That is a good device".lower())
model.docvecs.most_similar(positive=[model.infer_vector(test_doc)],topn=5)
출처는 아래 링크입니다.
https://www.thinkinfi.com/2019/10/doc2vec-implementation-in-python-gensim.html



소스 작업

지난번에 작업한 소스를 수정해 보도록 하겠습니다.

소스
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = []

for i, _d in enumerate(data):
 tagged_data.append(TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]))

print(tagged_data)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

max_epochs = 1000
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    #print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


from gensim.models.doc2vec import Doc2Vec

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love chatbots".lower())

print(model.docvecs.most_similar(positive=[model.infer_vector(test_data)],topn=5))

결과 화면
[TaggedDocument(words=['i', 'love', 'machine', 'learning', '.', 'its', 'awesome', '.'], tags=['0']), TaggedDocument(words=['i', 'love', 'coding', 'in', 'python'], tags=['1']), TaggedDocument(words=['i', 'love', 'building', 'chatbots'], tags=['2']), TaggedDocument(words=['they', 'chat', 'amagingly', 'well'], tags=['3'])]
C:\Users\USER\AppData\Local\Programs\Python\Python37\lib\site-packages\gensim\models\doc2vec.py:574: UserWarning: The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.
  warnings.warn("The parameter `size` is deprecated, will be removed in 4.0.0, use `vector_size` instead.")
29.py:37: DeprecationWarning: Call to deprecated `iter` (Attribute will be removed in 4.0.0, use self.epochs instead).
  epochs=model.iter)
Model Saved
[('2', 0.999824047088623), ('1', 0.9997918009757996), ('0', 0.9997824430465698), ('3', 0.9994299411773682)]
결론을 살펴보면 'I love chatbots'과 가장 비슷한 문장은 '2'번 "I love building chatbots"문





댓글 1개:

  1. Strange "water hack" burns 2 lbs in your sleep

    Over 160 thousand women and men are trying a simple and secret "water hack" to drop 1-2 lbs each and every night while they sleep.

    It is proven and it works all the time.

    Here's how you can do it yourself:

    1) Hold a glass and fill it up with water half glass

    2) Proceed to use this crazy HACK

    so you'll become 1-2 lbs thinner as soon as tomorrow!

    답글삭제