2018년 9월 9일 일요일

Deep Learning with Sequence Data and text (순차적 데이터와 텍스트의 딥러닝 PyTorch) (4)

시작하기 전에


여기 글의 예제는 PyTorch 로 되어 있습니다.
본문의 내용은 Deep Learning with PyTorch (Vishnu Subramanian) 책 내용으로 구성 되어있으며, 해당 내용은 Chapter 6 의 내용으로 구성되어 있습니다.


Building vocabulary(어휘 생성)

앞에서 thor_review예제에서 one-hot 엔코딩을 이용해서 word2idx 사전을 생성하였습니다. 그런데 그 사전은 문서 전체에서 독립적인(중복제외) 단어들을 모두 포함하였기 때문에 어휘라고 지칭합니다. torchtext 인스턴스는 이것을 더 쉽게 만듭니다. 데이터가 로드되어지면, build_vocal을 호출 할 수 있고, 데이터로부터 어휘를 만들 수 있는 필요한 인자들을 줍니다.
다음 코드가 어떻게 어휘를 생성하는지 보여 줍니다.

TEXT.build_vocab(train, vectors=GloVe(name='6B',dim=300),max_size=10000,min_freq=10)
LABEL.build_vocab(train)

위의 코드에서 어휘 생성을 위해서 필요한 train 오브젝트를 인자로 보냈습니다. 그리고 또한 그것이 미리 훈련된 300차원의 embeddings를 가진 벡터가 초기화 되도록 요청하였습니다. 미리 훈련된 가중치를 사용하여 의미 분류기를 훈련할때 build_vocab는 다운로드하고 나중에 사용될 차원을 생성 합니다. max_size 인스턴스는 어휘의 수를 제한 합니다. 그리고 min_freq는 10번이상 나타나지 않는 어떤 단어들을 제거합니다. 여기에서 10은 조정이 가능합니다.

from torchtext import data,datasets


TEXT = data.Field(lower=True, batch_first=True, fix_length=20) 
LABEL = data.Field(sequential=False) 

train, test = datasets.IMDB.splits(TEXT, LABEL)

from torchtext.vocab import GloVe

TEXT.build_vocab(train, vectors=GloVe(name='6B',dim=300),max_size=10000,min_freq=5000)
LABEL.build_vocab(train)
print("TEXT.vocab.freqs")
print(TEXT.vocab.freqs)
print("TEXT.vocab.vectors")
print(TEXT.vocab.vectors)
print("TEXT.vocab.stoi")
print(TEXT.vocab.stoi)

어휘가 빌드되어지고 나면 각각의 단어에 대해 빈도, 단어의 인덱스, 벡터 표현과 같은 값을 획득 할 수 있습니다. 다음 코드가 어떻게 그 값들에 접근 하는지 보여줍니다.

...생략...
lotte'": 1, '\x91autumn': 1, 'lovers).': 1, 'hammered-to-death': 1, '`whatever': 1, "jane?'<br": 1, '`mayberry': 1, "rfd'": 1, '`buck': 1, "`dynasty's'": 1, 'bellwood.<br': 1, 'extinguishers.': 1, 'decked...:)': 1, 'pentagon.)': 1, 'pressurized': 1, 'post-facto': 1, 'submarine.<br': 1, 'horizon.<br': 1, '/>"we\'re': 1, 'own!",': 1, 'phoenix."': 1, "'descent'": 1, 'pervy,': 1, "'rape/revenge'": 1, 'unsubstantial.': 1, 'raped:': 1, 'intensity!': 1, "'irreversible'-style": 1, "'inferno'": 1, 'american-canadian': 1, "me'...": 1, 'lascivious/decadent': 1, 'chronicles.<br': 1, 'whelk.<br': 1, '18,000': 1, "shite'": 1})
TEXT.vocab.vectors
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0466,  0.2132, -0.0074,  ...,  0.0091, -0.2099,  0.0539],
        ...,
        [-0.2573,  0.4637, -0.0436,  ..., -0.3611,  0.1480,  0.1106],
        [-0.4756,  0.0733,  0.1778,  ..., -0.3984, -0.1267, -0.0192],
        [-0.0881, -0.0217,  0.2986,  ..., -0.3547, -0.4271, -0.3932]])
TEXT.vocab.stoi
defaultdict(<function _default_unk_index at 0x00000118F9EDB0D0>, {'<unk>': 0, '<pad>': 1, 'the': 2, 'a': 3, 'and': 4, 'of': 5, 'to': 6, 'is': 7, 'in': 8, 'i': 9, 'this': 10, 'that': 11, 'it': 12, '/><br': 13, 'was': 14, 'as': 15, 'for': 16, 'with': 17, 'but': 18, 'on': 19, 'movie': 20, 'his': 21, 'are': 22, 'not': 23, 'film': 24, 'you': 25, 'have': 26, 'he': 27, 'be': 28, 'at': 29, 'one': 30, 'by': 31, 'an': 32, 'they': 33, 'from': 34, 'all': 35, 'who': 36, 'like': 37, 'so': 38, 'just': 39, 'or': 40, 'has': 41, 'her': 42, 'about': 43, "it's": 44, 'some': 45, 'if': 46, 'out': 47, 'what': 48, 'very': 49, 'when': 50, 'more': 51, 'there': 52, 'she': 53, 'would': 54, 'even': 55, 'good': 56, 'my': 57, 'only': 58, 'their': 59, 'no': 60, 'really': 61, 'had': 62, 'which': 63, 'can': 64, 'up': 65, 'were': 66, 'see': 67, 'than': 68, 'we': 69, '-': 70, 'been': 71, 'into': 72, 'get': 73, 'will': 74, 'story': 75, 'much': 76, 'because': 77, 'most': 78, 'how': 79, 'other': 80, 'also': 81, 'first': 82, 'its': 83, 'time': 84, 'do': 85, "don't": 86, 'me': 87, 'great': 88, 'people': 89, 'could': 90, 'make': 91, 'any': 92, '/>the': 93, 'after': 94, 'made': 95, 'then': 96, 'bad': 97, 'think': 98, 'being': 99, 'many': 100, 'him': 101, 'never': 102, 'two': 103, 'too': 104, 'little': 105, 'where': 106, 'well': 107, '<br': 108, 'way': 109, 'watch': 110, 'your': 111, 'it.': 112, 'did': 113, 'does': 114, 'them': 115, 'best': 116, 'movie.': 117, 'know': 118, 'seen': 119, 'love': 120, 'characters': 121, 'character': 122, 'movies': 123, 'these': 124, 'ever': 125, 'still': 126, 'over': 127})

미리 train된 데이터는 아래 링크에 있으며 간혹 안받아지는 문제가 있습니다.
'http://nlp.stanford.edu/data/glove.6B.zip'
소스는 설치된 소스 내에서 확인 가능합니다. Anaconda3\Lib\site-packages\torchtext\vocab.py

vocab.stoi 는 인덱스로 사전에 있는 단어를 접근할 수 있게 해줍니다.
https://torchtext.readthedocs.io/en/latest/vocab.html?highlight=stoi

간단한 custom data를 이용한 예제

위 예제는 미리 train된 데이터를 가져오게 됩니다. 또한 vector값 또한 다운로드 받기 때문에 수정해서 여러값들을 확인하기에는 부족합니다.
아주 단순한 예제를 준비하였습니다.

train data 준비

tsv 파일로 준비합니다. tsv파일은 tab으로 field를 구별합니다. 따라서 아래에서 text label 사이에는 공백이 아니라 tab 키가 들어있습니다. 또한 text문장 마지막에도 " 1이 있는데 tab으로 구별되어있고  0은 부정, 1은 긍정인 문장입니다.
traindata.tsv
text label
"An American in Paris is a wonderful musical about an American painter living in Paris for inspiration. He meets a rich woman who admires his paintings on the street and she believes she can get his work to be even more popular to the public, e.g. in a museum. Golden Globe nominated Gene Kelly as the artist Jerry Mulligan is just perfect at both singing and especially dancing. He also meets the main girl Lise Bouvier (Leslie Caron) who is engaged to his best friend. He can't help his feelings for this girl, even after he finds out who she is engaged to. Filled with nice romance and wonderful song and dance, this is a very good musical film. It may drag slightly with his dancing dream sequence, i.e. The American in Paris ballet, but there is a good happy ending. It won the Oscars for Best Art Direction-Set Decoration, Best Cinematography, Best Costume Design, Best Music, Scoring of a Musical Picture, Best Writing, Story and Screenplay and Best Picture, and it was nominated for Best Director for Vincente Minnelli and Best Film Editing, it was nominated the BAFTA for Best Film from any Source, and it won the Golden Globe for Best Motion Picture - Musical/Comedy, and it was nominated for Best Director for Vincente Millenni (Liza's father). Gene Kelly was number 66 on The 100 Movie Stars, and he was number 15 on 100 Years, 100 Stars - Men, ""I Got Rhythm"" was number 32 on 100 Years, 100 Songs, the film was number 9 on 100 Years of Musicals, it was number 39 on 100 Years, 100 Passions, it was number 68 on 100 Years, 100 Movies, and it was number 58 on The 100 Greatest Musicals. Very good!" 1
"If Todd Sheets were to come out and admit that this movie was intended to spoof the zombie genre, I would change my rating to an eight. Try to imagine a movie where every scene, line, and even every acting nuance was designed to be a parody. I could probably crap out alphabet soup, rearrange what was left of the letters, and still have a better script. Two scenes in particular come to mind when I think of this movie. SPOILER ALERT! One is when Mike's dad and the other dad walk, I repeat walk down a staircase jam packed with zombies. This is a small staircase and even though they brush up against the flailing undead, nothing happens to them. When they reach the end, the ex-marine turns around, says ""God you're a horny bastard"", and shoots only one. The other is in the military complex. The girl stabs a zombie with a machete and is immediately surrounded. The camera moves around her for roughly forty seconds, while she is surrounded by zombies at an arm's length away. She then almost casually runs out from the crowd and joins the other humans. SPOILER ALERT OVER! These scenes must be seen to be believed. Still, I enjoy this movie as much as almost any comedy just because it's so damn funny. Kudos to Todd Sheets for getting so many people in his movie and having the drive to make it but not really for anything else." 0
"This ambitious film suffers most from writer/director Paul Thomas Anderson's delusions of grandeur. Highly derivative of much better material (Altman's ""Nashville,"" Lumet's ""Network""), this lumbering elephant takes far too long to get nowhere. A couple of misguided detours along the way (an embarrassing musical interlude, a biblical plague) don't help matters. Neither does the uneven level of performances. Especially bad: William H. Macy, whose character and storyline could easily have been eliminated altogether; Julianne Moore, for her unconvincing angst. And how many times must we see John C. Reilly's Sad Sack shtick (""Chicago"" and ""The Hours"" will suffice)? Tom Cruise comes off well by comparison his misogynist, foul-mouthed Holy Roller was rather amusing. Speaking of foul mouths, the script was so loaded with ""F"" bombs, they lost their impact in no time. Don't even talk about that awful soundtrack, full of insipid and annoying vocals by Aimee Mann. Her extended rendition of ""One,"" a maudlin number to begin with, drove me to distraction at the start of the film. I should have heeded the handwriting on the wall and saved myself three more hours, by which time I'd been pushed to the brink of hell. One redeeming feature, which I haven't seen mentioned in other reviews, is the best performance in the bunch, by unknown Melora Walters in the role of Claudia, the damaged coke fiend bent on self-destruction. Her credibility exceeded all others by far. This film took itself way too seriously and just didn't know when to end." 0
"The astonishing waste of production money is filmic proof that the rich and famous can be just as stupid and wasteful as politicians. From a (silly) play by Tennessee Williams and directed (with a dead hand) by Joseph Losey and starring Taylor and Burton and Noel Coward - this project filmed in a spectacular cliff-top mountain island mansion in the Mediterranean must have seemed a sure fire winner when presented to Universal in 1967. The result is so absurd and tedious that it almost defies belief. Visually the film is spectacular but that is the force of nature that has allowed the setting and the fact that a real home is used instead of a set. The shrill antics of a screeching Taylor, Burton's half asleep wanderings, the loony dialog, Noel Coward laughing at himself, the ridiculous story and plot devices and the absurd costuming simply irritate the viewer. BOOM is a disgrace, a waste of money and talent and clear proof that lauded famous people can be idiots just like the rest of the planet's plebs. Not even fun. Just terrible and mad shocking waste." 0

tran data읽기

Field에대해서는 앞에서 설명 하였습니다. 여기에서 LABEL은 sequential=False가 됩니다. sequential의 의미는 Deep Leaning 강의에서 앞의 데이터가 뒤의 데이터와 연관성이 있는 자료를 의미합니다. 대표적으로 text 데이터가 됩니다. LABEL은 정답, 즉 여기에서는 0, 1 긍정이냐, 부정이냐는 의미를 가져가기 때문에 sequential 이 False가 됩니다. 또한 어휘를 생성할 필요가 없습니다. 그래서 use_vocab=False 이 됩니다.
그 후 TSV 파일을 읽기 위해서는 TabularDataset 함수를 사용합니다. skip_header=True 는 첫째줄을 skip 하기 위해서 사용합니다.

TEXT = Field(sequential=True,
             use_vocab=True,
             lower=True, 
             batch_first=True)  
LABEL = Field(sequential=False,
              use_vocab=False,
              batch_first=True)

train_data = TabularDataset(path='./traindata.tsv', skip_header=True, format='tsv', fields=[('text', TEXT), ('label', LABEL)])

어휘 생성

어휘 생성은 위에서도 언급 했지만 build_vocab만 호출 하면 됩니다. 
옵션이 여러개 있지만 여기에서 min_freq=2만 넣어봤습니다. 단어의 빈도 수가 2보다 작은 은 단어들을 처리를 하지 않습니다. 다시 말해서 빈도 수가 최소 2 보다 커야한다는 의미입니다.
TEXT.build_vocab(train_data, min_freq = 2)

전체 소스

지금까지 진행한 전체 소스를 공유합니다.
from torchtext.data import Field, Iterator, TabularDataset

TEXT = Field(sequential=True,
             use_vocab=True,
             lower=True, 
             batch_first=True)  
LABEL = Field(sequential=False,
              use_vocab=False,
              batch_first=True)

train_data = TabularDataset(path='./traindata.tsv', skip_header=True, format='tsv', fields=[('text', TEXT), ('label', LABEL)])

TEXT.build_vocab(train_data, min_freq = 2)

print('Total vocabulary: {}'.format(len(TEXT.vocab)))
print('Token for "<unk>": {}'.format(TEXT.vocab.stoi['<unk>']))
print('Token for "<pad>": {}'.format(TEXT.vocab.stoi['<pad>']))
print('Token freq : {}'.format(TEXT.vocab.freqs))
print (TEXT.vocab.stoi)

결과
(base) E:\>python stest.py
Total vocabulary: 104
Token for "<unk>": 0
Token for "<pad>": 1
Token freq : Counter({'the': 51, 'and': 40, 'a': 26, 'of': 20, 'to': 19, 'is': 18, 'was': 15, 'in': 14, 'for': 13, 'best': 13, 'it': 11, '100': 11, 'on': 10, 'this': 10, 'number': 8, 'that': 8, 'by': 8, 'his': 7, 'i': 7, 'be': 6, 'even': 6, 'just': 6, 'film': 6, 'he': 5, 'she': 5, 'as': 5, 'with': 5, 'movie': 5, 'when': 5, 'an': 4, 'musical': 4, 'nominated': 4, 'at': 4, 'out': 4, 'from': 4, 'years,': 4, 'have': 4, 'other': 4, 'her': 4, 'so': 4, 'american': 3, 'paris': 3, 'who': 3, 'can': 3, 'but': 3, '-': 3, 'they': 3, 'almost': 3, 'must': 3, 'wonderful': 2, 'about': 2, 'meets': 2, 'rich': 2, 'get': 2, 'more': 2, 'golden': 2, 'globe': 2, 'gene': 2, 'kelly': 2, 'especially': 2, 'girl': 2, 'engaged': 2, 'help': 2, 'very': 2, 'good': 2, 'film.': 2, 'won': 2, 'picture,': 2, 'story': 2, 'director': 2, 'vincente': 2, 'any': 2, 'todd': 2, 'sheets': 2, 'come': 2, 'zombie': 2, 'every': 2, 'could': 2, 'better': 2, 'scenes': 2, 'spoiler': 2, 'one': 2, 'dad': 2, 'staircase': 2, 'seen': 2, 'much': 2, 'many': 2, 'people': 2, 'not': 2, 'too': 2, 'way': 2, "don't": 2, 'been': 2, 'which': 2, 'waste': 2, 'money': 2, 'proof': 2, 'famous': 2, 'noel': 2, 'coward': 2, 'spectacular': 2, 'absurd': 2, 'painter': 1, 'living': 1, 'inspiration.': 1, 'woman': 1, 'admires': 1, 'paintings': 1, 'street': 1, 'believes': 1, 'work': 1, 'popular': 1, 'public,': 1, 'e.g.': 1, 'museum.': 1, 'artist': 1, 'jerry': 1, 'mulligan': 1, 'perfect': 1, 'both': 1, 'singing': 1, 'dancing.': 1, 'also': 1, 'main': 1, 'lise': 1, 'bouvier': 1, '(leslie': 1, 'caron)': 1, 'friend.': 1, "can't": 1, 'feelings': 1, 'girl,': 1, 'after': 1, 'finds': 1, 'to.': 1, 'filled': 1, 'nice': 1, 'romance': 1, 'song': 1, 'dance,': 1, 'may': 1, 'drag': 1, 'slightly': 1, 'dancing': 1, 'dream': 1, 'sequence,': 1, 'i.e.': 1, 'ballet,': 1, 'there': 1, 'happy': 1, 'ending.': 1, 'oscars': 1, 'art': 1, 'direction-set': 1, 'decoration,': 1, 'cinematography,': 1, 'costume': 1, 'design,': 1, 'music,': 1, 'scoring': 1, 'writing,': 1, 'screenplay': 1, 'minnelli': 1, 'editing,': 1, 'bafta': 1, 'source,': 1, 'motion': 1, 'picture': 1, 'musical/comedy,': 1, 'millenni': 1, "(liza's": 1, 'father).': 1, '66': 1, 'stars,': 1, '15': 1, 'stars': 1, 'men,': 1, '"i': 1, 'got': 1, 'rhythm"': 1, '32': 1, 'songs,': 1, '9': 1, 'years': 1, 'musicals,': 1, '39': 1, 'passions,': 1, '68': 1, 'movies,': 1, '58': 1, 'greatest': 1, 'musicals.': 1, 'good!': 1, 'if': 1, 'were': 1, 'admit': 1, 'intended': 1, 'spoof': 1, 'genre,': 1, 'would': 1, 'change': 1, 'my': 1, 'rating': 1, 'eight.': 1, 'try': 1, 'imagine': 1, 'where': 1, 'scene,': 1, 'line,': 1, 'acting': 1, 'nuance': 1, 'designed': 1, 'parody.': 1, 'probably': 1, 'crap': 1, 'alphabet': 1, 'soup,': 1, 'rearrange': 1, 'what': 1, 'left': 1, 'letters,': 1, 'still': 1, 'script.': 1, 'two': 1, 'particular': 1, 'mind': 1, 'think': 1, 'movie.': 1, 'alert!': 1, "mike's": 1, 'walk,': 1, 'repeat': 1, 'walk': 1, 'down': 1, 'jam': 1, 'packed': 1, 'zombies.': 1, 'small': 1, 'though': 1, 'brush': 1, 'up': 1, 'against': 1, 'flailing': 1, 'undead,': 1, 'nothing': 1, 'happens': 1, 'them.': 1, 'reach': 1, 'end,': 1, 'ex-marine': 1, 'turns': 1, 'around,': 1, 'says': 1, '"god': 1, "you're": 1, 'horny': 1, 'bastard",': 1, 'shoots': 1, 'only': 1, 'one.': 1, 'military': 1, 'complex.': 1, 'stabs': 1, 'machete': 1, 'immediately': 1, 'surrounded.': 1, 'camera': 1, 'moves': 1, 'around': 1, 'roughly': 1, 'forty': 1, 'seconds,': 1, 'while': 1, 'surrounded': 1, 'zombies': 1, "arm's": 1, 'length': 1, 'away.': 1, 'then': 1, 'casually': 1, 'runs': 1, 'crowd': 1, 'joins': 1, 'humans.': 1, 'alert': 1, 'over!': 1, 'these': 1, 'believed.': 1, 'still,': 1, 'enjoy': 1, 'comedy': 1, 'because': 1, "it's": 1, 'damn': 1, 'funny.': 1, 'kudos': 1, 'getting': 1, 'having': 1, 'drive': 1, 'make': 1, 'really': 1, 'anything': 1, 'else.': 1, 'ambitious': 1, 'suffers': 1, 'most': 1, 'writer/director': 1, 'paul': 1, 'thomas': 1, "anderson's": 1, 'delusions': 1, 'grandeur.': 1, 'highly': 1, 'derivative': 1, 'material': 1, "(altman's": 1, '"nashville,"': 1, "lumet's": 1, '"network"),': 1, 'lumbering': 1, 'elephant': 1, 'takes': 1, 'far': 1, 'long': 1, 'nowhere.': 1, 'couple': 1, 'misguided': 1, 'detours': 1, 'along': 1, '(an': 1, 'embarrassing': 1, 'interlude,': 1, 'biblical': 1, 'plague)': 1, 'matters.': 1, 'neither': 1, 'does': 1, 'uneven': 1, 'level': 1, 'performances.': 1, 'bad:': 1, 'william': 1, 'h.': 1, 'macy,': 1, 'whose': 1, 'character': 1, 'storyline': 1, 'easily': 1, 'eliminated': 1, 'altogether;': 1, 'julianne': 1, 'moore,': 1, 'unconvincing': 1, 'angst.': 1, 'how': 1, 'times': 1, 'we': 1, 'see': 1, 'john': 1, 'c.': 1, "reilly's": 1, 'sad': 1, 'sack': 1, 'shtick': 1, '("chicago"': 1, '"the': 1, 'hours"': 1, 'will': 1, 'suffice)?': 1, 'tom': 1, 'cruise': 1, 'comes': 1, 'off': 1, 'well': 1, 'comparison': 1, 'misogynist,': 1, 'foul-mouthed': 1, 'holy': 1, 'roller': 1, 'rather': 1, 'amusing.': 1, 'speaking': 1, 'foul': 1, 'mouths,': 1, 'script': 1, 'loaded': 1, '"f"': 1, 'bombs,': 1, 'lost': 1, 'their': 1, 'impact': 1, 'no': 1, 'time.': 1, 'talk': 1, 'awful': 1, 'soundtrack,': 1, 'full': 1, 'insipid': 1, 'annoying': 1, 'vocals': 1, 'aimee': 1, 'mann.': 1, 'extended': 1, 'rendition': 1, '"one,"': 1, 'maudlin': 1, 'begin': 1, 'with,': 1, 'drove': 1, 'me': 1, 'distraction': 1, 'start': 1, 'should': 1, 'heeded': 1, 'handwriting': 1, 'wall': 1, 'saved': 1, 'myself': 1, 'three': 1, 'hours,': 1, 'time': 1, "i'd": 1, 'pushed': 1, 'brink': 1, 'hell.': 1, 'redeeming': 1, 'feature,': 1, "haven't": 1, 'mentioned': 1, 'reviews,': 1, 'performance': 1, 'bunch,': 1, 'unknown': 1, 'melora': 1, 'walters': 1, 'role': 1, 'claudia,': 1, 'damaged': 1, 'coke': 1, 'fiend': 1, 'bent': 1, 'self-destruction.': 1, 'credibility': 1, 'exceeded': 1, 'all': 1, 'others': 1, 'far.': 1, 'took': 1, 'itself': 1, 'seriously': 1, "didn't": 1, 'know': 1, 'end.': 1, 'astonishing': 1, 'production': 1, 'filmic': 1, 'stupid': 1, 'wasteful': 1, 'politicians.': 1, '(silly)': 1, 'play': 1, 'tennessee': 1, 'williams': 1, 'directed': 1, '(with': 1, 'dead': 1, 'hand)': 1, 'joseph': 1, 'losey': 1, 'starring': 1, 'taylor': 1, 'burton': 1, 'project': 1, 'filmed': 1, 'cliff-top': 1, 'mountain': 1, 'island': 1, 'mansion': 1, 'mediterranean': 1, 'seemed': 1, 'sure': 1, 'fire': 1, 'winner': 1, 'presented': 1, 'universal': 1, '1967.': 1, 'result': 1, 'tedious': 1, 'defies': 1, 'belief.': 1, 'visually': 1, 'force': 1, 'nature': 1, 'has': 1, 'allowed': 1, 'setting': 1, 'fact': 1, 'real': 1, 'home': 1, 'used': 1, 'instead': 1, 'set.': 1, 'shrill': 1, 'antics': 1, 'screeching': 1, 'taylor,': 1, "burton's": 1, 'half': 1, 'asleep': 1, 'wanderings,': 1, 'loony': 1, 'dialog,': 1, 'laughing': 1, 'himself,': 1, 'ridiculous': 1, 'plot': 1, 'devices': 1, 'costuming': 1, 'simply': 1, 'irritate': 1, 'viewer.': 1, 'boom': 1, 'disgrace,': 1, 'talent': 1, 'clear': 1, 'lauded': 1, 'idiots': 1, 'like': 1, 'rest': 1, "planet's": 1, 'plebs.': 1, 'fun.': 1, 'terrible': 1, 'mad': 1, 'shocking': 1, 'waste.': 1})
defaultdict(<function _default_unk_index at 0x000002565C74E0D0>, {'<unk>': 0, '<pad>': 1, 'the': 2, 'and': 3, 'a': 4, 'of': 5, 'to': 6, 'is': 7, 'was': 8, 'in': 9, 'best': 10, 'for': 11, '100': 12, 'it': 13, 'on': 14, 'this': 15, 'by': 16, 'number': 17, 'that': 18, 'his': 19, 'i': 20, 'be': 21, 'even': 22, 'film': 23, 'just': 24, 'as': 25, 'he': 26, 'movie': 27, 'she': 28, 'when': 29, 'with': 30, 'an': 31, 'at': 32, 'from': 33, 'have': 34, 'her': 35, 'musical': 36, 'nominated': 37, 'other': 38, 'out': 39, 'so': 40, 'years,': 41, '-': 42, 'almost': 43, 'american': 44, 'but': 45, 'can': 46, 'must': 47, 'paris': 48, 'they': 49, 'who': 50, 'about': 51, 'absurd': 52, 'any': 53, 'been': 54, 'better': 55, 'come': 56, 'could': 57, 'coward': 58, 'dad': 59, 'director': 60, "don't": 61, 'engaged': 62, 'especially': 63, 'every': 64, 'famous': 65, 'film.': 66, 'gene': 67, 'get': 68, 'girl': 69, 'globe': 70, 'golden': 71, 'good': 72, 'help': 73, 'kelly': 74, 'many': 75, 'meets': 76, 'money': 77, 'more': 78, 'much': 79, 'noel': 80, 'not': 81, 'one': 82, 'people': 83, 'picture,': 84, 'proof': 85, 'rich': 86, 'scenes': 87, 'seen': 88, 'sheets': 89, 'spectacular': 90, 'spoiler': 91, 'staircase': 92, 'story': 93, 'todd': 94, 'too': 95, 'very': 96, 'vincente': 97, 'waste': 98, 'way': 99, 'which': 100, 'won': 101, 'wonderful': 102, 'zombie': 103})
TEXT.vocab.stoi는 string데이터를 사전의 index로의 변환을 제공합니다. TEXT.vocab.freqs 는 빈도수를 나타냅니다. min_freq = 2 의 영향을 받지는 않습니다.


댓글 없음:

댓글 쓰기