SW정리: word2vec

word2vec는 많이 사용하는 word embedding 입니다. 개인적으로는 생각하기에 따라하기도 좋게 되어있고 아래 링크에서 정리가 잘 되어 있습니다.
https://wikidocs.net/50739
여기에서는 word2vec에 대한 기본적인 개념을 알고 있다고 생각하고 읽은 내용을 바탕으로 궁금한 내용들을 정리하여 봤습니다.

input data

input format이 가장 중요한 데이터word2vec의 sentences문장을 list형태의 입력으로 사용됩니다.

https://rare-technologies.com/word2vec-tutorial/#preparing_the_input
여기 간단한 예제로 부터, 입력 데이터 format에 대해 살펴 보았습니다.

단순한 예제

Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):

# import modules & set up logging

import gensim, logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['first', 'sentence'], ['second', 'sentence']]

# train word2vec on the two sentences

model = gensim.models.Word2Vec(sentences, min_count=1)

간단한 예이며, 바로 위 문장입니다. 안쪽 리스트를 하나의 문장이라고 생각하고 해당 문장이 여러개를 가지는 형태인, 리스트의 리스트 형태로 되어 있습니다.
sentences = [
['first', 'sentence'], <== 문장 하나
['second', 'sentence'] <== 문장 하나
]

train

일반적인 이런 형태의 모델을 만들고 train은 별도로 있습니다.
그러나 gensim.models.Word2Vec 예제를 볼때 마다 train코드가 보이지 않습니다.

아래 예제입니다.
https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-download-auto-examples-tutorials-run-word2vec-py

Training Your Own Model

To start, you’ll need some data for training the model. For the following examples, we’ll use the Lee Corpus (which you already have if you’ve installed gensim).

This corpus is small enough to fit entirely in memory, but we’ll implement a memory-friendly iterator that reads it line-by-line to demonstrate how you would handle a larger corpus.

from gensim.test.utils import datapath
from gensim import utils

class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

If we wanted to do any custom preprocessing, e.g. decode a non-standard encoding, lowercase, remove numbers, extract named entities… All of this can be done inside the MyCorpus iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

Let’s go ahead and train a model on our corpus. Don’t worry about the training parameters much for now, we’ll revisit them later.

import gensim.models

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

Once we have our model, we can use it in the same way as in the demo above.

The main part of the model is model.wv, where “wv” stands for “word vectors”.

vec_king = model.wv['king']

Retrieving the vocabulary works the same way:

for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

Out:

hundreds
of
people
have
been
forced
to
their
homes
in

찾아봐도 train시키는 부분이 명확하게 나와 있지 않습니다.

Word2Vec의 경우 기본적으로 train을 같이 진행하도록 되어 있습니다.
https://rare-technologies.com/word2vec-tutorial/#training

train이 필요한 경우 data가 static하게 있지 않은경우(로컬 디스크에 있지 않은 경우) manual 로 할 수는 있습니다.
아래 내용을 확인해 보시면 되는데, 이렇게까지 사용하는 경우는 없을듯 합니다. 꼭 필요하다면 아래 링크를 참고해서 구현해보시기 바랍니다.

https://rare-technologies.com/word2vec-tutorial/#preparing_the_input

Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet

model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator

model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.

SW정리

2020년 1월 19일 일요일

word2vec에서의 input data / train

input data

train

Training Your Own Model