SW정리: Deep Learning with Sequence Data and text (순차적 데이터와 텍스트의 딥러닝 PyTorch) (1)

시작하기 전에

여기 글의 예제는 PyTorch 로 되어 있습니다.
본문의 내용은 Deep Learning with PyTorch (Vishnu Subramanian) 책 내용으로 구성 되어있으며, 해당 내용은 Chapter 6 의 내용으로 구성되어 있습니다.

Working with text data(텍스트 데이터의 작업)

텍스트는 주로 순차적인 데이터로 사용됩니다. 또한 텍스트는 글자의 순서 또는 단어의 순서로 보여집니다.
RNN 및 그 변형과 같은 Deep Learning 순차 모델은 다음 문제를 해결 할 수 있습니다.
-자연어의 이해
-문서의 분류
-의미 분류
기존 Deep Learning 모델에서는 텍스트를 이해할 수 없습니다. 그래서 텍스트를 수치적인 표현으로 변경해야합니다. 이 변환 과정을 벡터화(vectorization)이라고 합니다.
각각의 텍스트의 작은유닛을 token이라고 합니다. 그리고 텍스트를 쪼개는 절차를 토큰화(tokenization)이라고 합니다. 텍스트 데이터를 토큰으로 변환하고, 이 토큰을 벡터로 (map)주소화 하게 됩니다.
one-hot encoding 과 word embedding 이 토큰을 벡터로 매핑하는 가장 대중적인 방법입니다.

Tokenization(토큰화)

문장이 주어지면 각각의 문자나 단어를 자르는것을 토큰화라고 합니다. 여러가지 라이브러리가 있겠지만 여기에서는 Python의 split, list 를 사용합니다.
예를 위해서 토르 라그라로크 영화의 작은 리뷰를 가지고 작업해보겠습니다.
The action scenes were top notch in this movie. Thor has never been this epic in the MCU. He does some pretty epic sht in this movie and he is definitely not under-powered anymore. Thor in unleashed in this, I love that.

Converting text into charactors(문자들로 변환)

Python list 함수는 각각의 문자의 리스트들로 변환합니다.
다음 예제를 살펴봅시다.

thor_review = "The action scenes were top notch in this movie. Thor has never been this epic in the MCU. He does some pretty epic sht in this movie and he is definitely not under-powered anymore. Thor in unleashed in this, I love that."

print(list(thor_review))

결과

(base) E:\> python test.py
['T', 'h', 'e', ' ', 'a', 'c', 't', 'i', 'o', 'n', ' ', 's', 'c', 'e', 'n', 'e', 's', ' ', 'w', 'e', 'r', 'e', ' ', 't', 'o', 'p', ' ', 'n', 'o', 't', 'c', 'h', ' ', 'i', 'n', ' ', 't', 'h', 'i', 's', ' ', 'm', 'o', 'v', 'i', 'e', '.', ' ', 'T', 'h', 'o', 'r', ' ', 'h', 'a', 's', ' ', 'n', 'e', 'v', 'e', 'r', ' ', 'b', 'e', 'e', 'n', ' ', 't', 'h', 'i', 's', ' ', 'e', 'p', 'i', 'c', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'M', 'C', 'U', '.', ' ', 'H', 'e', ' ', 'd', 'o', 'e', 's', ' ', 's', 'o', 'm', 'e', ' ', 'p', 'r', 'e', 't', 't', 'y', ' ', 'e', 'p', 'i', 'c', ' ', 's', 'h', 't', ' ', 'i', 'n', ' ', 't', 'h', 'i', 's', ' ', 'm', 'o', 'v', 'i', 'e', ' ', 'a', 'n', 'd', ' ', 'h', 'e', ' ', 'i', 's', ' ', 'd', 'e', 'f', 'i', 'n', 'i', 't', 'e', 'l', 'y', ' ', 'n', 'o', 't', ' ', 'u', 'n', 'd', 'e', 'r', '-', 'p', 'o', 'w', 'e', 'r', 'e', 'd', ' ', 'a', 'n', 'y', 'm', 'o', 'r', 'e', '.', ' ', 'T', 'h', 'o', 'r', ' ', 'i', 'n', ' ', 'u', 'n', 'l', 'e', 'a', 's', 'h', 'e', 'd', ' ', 'i', 'n', ' ', 't', 'h', 'i', 's', ',', ' ', 'I', ' ', 'l', 'o', 'v', 'e', ' ', 't', 'h', 'a', 't', '.']

Converting text into words(단어들로 변환)

Python의 split 함수를 사용할 수 있습니다. 인자로 구별자를 넣으면 되며, 기본값으로 공백이됩니다. 단어의 분해자로서 공백을 사용하는 경우 아래와 같이 이용이 가능합니다.

thor_review = "The action scenes were top notch in this movie. Thor has never been this epic in the MCU. He does some pretty epic sht in this movie and he is definitely not under-powered anymore. Thor in unleashed in this, I love that."

print(thor_review.split())

결과

(base) E:\> python test.py
['The', 'action', 'scenes', 'were', 'top', 'notch', 'in', 'this', 'movie.', 'Thor', 'has', 'never', 'been', 'this', 'epic', 'in', 'the', 'MCU.', 'He', 'does', 'some', 'pretty', 'epic', 'sht', 'in', 'this', 'movie', 'and', 'he', 'is', 'definitely', 'not', 'under-powered', 'anymore.', 'Thor', 'in', 'unleashed', 'in', 'this,', 'I', 'love', 'that.']

N-gram representation(N-gram 표현)

앞에서는 텍스트가 문자와 단어로 표현되는것을 알아봤습니다. 때때로 2개,3개 또는 그 이상의 단어들을 함께 보는것이 유용합니다. N-grams는 주어진 텍스트로 부터 추출된 단어의 그룹입니다. 여기에서N은 그룹으로 묶이는 단어의 수를 의미합니다. Python ntlk 패키지를 이용해서 쉽게 생성 할 수 있습니다. N=2일때 bigram이라고 합니다. 그리고 이전 예제로 부터 아래의 결과를 얻을 수 있습니다.
N=2

from nltk import ngrams

thor_review = "The action scenes were top notch in this movie. Thor has never been this epic in the MCU. He does some pretty epic sht in this movie and he is definitely not under-powered anymore. Thor in unleashed in this, I love that."

print(list(ngrams(thor_review.split(),2)))

결과

(base) E:\> python test.py
[('The', 'action'), ('action', 'scenes'), ('scenes', 'were'), ('were', 'top'), ('top', 'notch'), ('notch', 'in'), ('in', 'this'), ('this', 'movie.'), ('movie.', 'Thor'), ('Thor', 'has'), ('has', 'never'), ('never', 'been'), ('been', 'this'), ('this', 'epic'), ('epic', 'in'), ('in', 'the'), ('the', 'MCU.'), ('MCU.', 'He'), ('He', 'does'), ('does', 'some'), ('some', 'pretty'), ('pretty', 'epic'), ('epic', 'sht'), ('sht', 'in'), ('in', 'this'), ('this', 'movie'), ('movie', 'and'), ('and', 'he'), ('he', 'is'), ('is', 'definitely'), ('definitely', 'not'), ('not', 'under-powered'), ('under-powered', 'anymore.'), ('anymore.', 'Thor'), ('Thor', 'in'), ('in', 'unleashed'), ('unleashed', 'in'), ('in', 'this,'), ('this,', 'I'), ('I', 'love'), ('love', 'that.')]

아래는 N=3 결과입니다. 결과를 잘보면 문장이 N 갯수 만큼 그룹이 되긴하지만 중간 단어들이 서로 중복됩니다.
N=3

print(list(ngrams(thor_review.split(),3)))

결과

(base) E:\work\ai\pytorch> python test.py
[('The', 'action', 'scenes'), ('action', 'scenes', 'were'), ('scenes', 'were', 'top'), ('were', 'top', 'notch'), ('top', 'notch', 'in'), ('notch', 'in', 'this'), ('in', 'this', 'movie.'), ('this', 'movie.', 'Thor'), ('movie.', 'Thor', 'has'), ('Thor', 'has', 'never'), ('has', 'never', 'been'), ('never', 'been', 'this'), ('been', 'this', 'epic'), ('this', 'epic', 'in'), ('epic', 'in', 'the'), ('in', 'the', 'MCU.'), ('the', 'MCU.', 'He'), ('MCU.', 'He', 'does'), ('He', 'does', 'some'), ('does', 'some', 'pretty'), ('some', 'pretty', 'epic'), ('pretty', 'epic', 'sht'), ('epic', 'sht', 'in'), ('sht', 'in', 'this'), ('in', 'this', 'movie'), ('this', 'movie', 'and'), ('movie', 'and', 'he'), ('and', 'he', 'is'), ('he', 'is', 'definitely'), ('is', 'definitely', 'not'), ('definitely', 'not', 'under-powered'), ('not', 'under-powered', 'anymore.'), ('under-powered', 'anymore.', 'Thor'), ('anymore.', 'Thor', 'in'), ('Thor', 'in', 'unleashed'), ('in', 'unleashed', 'in'), ('unleashed', 'in', 'this,'), ('in', 'this,', 'I'), ('this,', 'I', 'love'), ('I', 'love', 'that.')]

n-grams는 스펠링 정정과 텍스트 요약 작업에서도 사용됩니다.

SW정리

2018년 8월 19일 일요일

Deep Learning with Sequence Data and text (순차적 데이터와 텍스트의 딥러닝 PyTorch) (1)