Tuesday, November 27, 2018

Topic Modeling ( KoNLPy + NLTK + Gensim)

OS - Centos6
Python - 3.4.8
Java - 1.7
KoNLPy - 0.5.1
NLTK - 3.4


1. Installation
 - 해당링크의 내용을 바탕으로 작성
 - 기존의 python 2.7에서 python 3.4로 버전 업 및 라이브러리 호환버전으로 버전 업
 - KoNLPy
  1. Install dependency
$ wget https://www.python.org/ftp/python/3.4.*/Python-3.4.*.tar.xz
$ tar xf Python-3.*
$ cd Python-3.*
$ ./configure
$ make
$ sudo make altinstall

  2. Install KoNLPy
$ pip3.4 install konlpy

  3. Install Mecab
$ sudo yum install curl
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
ref : http://konlpy.org/en/latest/install/#centos

* Error : 14: PYCURL ERROR 6 ~~~ ( couldn't resolve host )
=> /etc/resolv.conf 에 `nameserver 8.8.8.8` 구글 nameserver 라인 추가
ref : https://www.centos.org/forums/viewtopic.php?t=8892

* pip3 설치
$ sudo yum install python34-setuptools
$ sudo easy_install-3.4 pip
ref : https://stackoverflow.com/questions/32618686/how-to-install-pip-in-centos-7

* Error : (konlpy 설치중) Cache entry deserialization failed, entry ignored (네트워크 에러, 재접속 후 해결)
ref : https://stackoverflow.com/questions/49671215/cache-entry-deserialization-failed-entry-ignored

  4. Test Run
$ python3
>>> from konlpy.tag import Kkma
>>> from konlpy.utils import pprint
>>> kkma = Kkma()
>>> pprint(kkma.sentences(u'네, 안녕하세요. 반갑습니다.'))
  [네, 안녕하세요..,반갑습니다.]

* Error : importerror no module named 'jpype1'
=> 호환되는 jdk버전으로 변경
$ alternatives –config java
원하는 버전 선택
ref : http://blog.daum.net/drinker/25

* Error : ImportError: No module named 'konlpy'
=> sys.path에 경로 추가
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://askubuntu.com/questions/470982/how-to-add-a-python-module-to-syspath
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://stackoverflow.com/questions/16114391/adding-directory-to-sys-path-pythonpath



 - NLTK
  1. Install dependency
$ sudo pip install -U numpy

  2. Install NLTK
$ sudo pip install -U nltk

  3. Test Run
$ python3
>>> import nltk
ref : https://www.nltk.org/install.html



 - Gensim
  1. Dependencies
   - Python >= 2.7 (tested with versions 2.7, 3.5 and 3.6)
   - NumPy >= 1.11.3
   - SciPy >= 0.18.1
   - Six >= 1.5.0
   - smart_open >= 1.2.1

  2. Install Gensim
$ pip install --upgrade gensim



 - twython (twitter api를 쉽게 사용하기 위해 필요)
$ sudo pip install twython





2. Topic Modeling
 - 어떤 문서에서 자주 나타나는 단어를 통해, 주제를 찾아주는 확률적인 모델을 디자인
 - LDA, LSI, HDP 등
 - Test Run
  1. KoNLPy 설정
   - 한글 파일들의 위치 : konlpy의 corpus 아래에 있는 kobill directory에 미리 저장
/?????/Python/3.4/site-packages/konlpy/data/corpus/kobill

  2. tokenizer 설정
   - nltk방식처럼 nltk + twitter api를 사용해서 토근화
from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]
from konlpy.tag import Okt; t = Twitter()
pos = lambda d: ['/'.join(p) for p in t.pos(d)]
texts_ko = [pos(doc) for doc in docs_ko]

  3. topic modeling
from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]
from konlpy.tag import Okt; t=Twitter()
pos = lambda d: ['/'.join(p) for p in t.pos(d, stem=True, norm=True)]
texts_ko = [pos(doc) for doc in docs_ko]

   - 토큰을 integer로 인코딩
from gensim import corpora
dictionary_ko = corpora.Dictionary(texts_ko)
dictionary_ko.save('ko.dict')

   - TF-IDF 계산
from gensim import models
tf_ko = [dictionary_ko.doc2bow(text) for text in texts_ko]
tfidf_model_ko = models.TfidfModel(tf_ko)
tfidf_ko = tfidf_model_ko[tf_ko]
corpora.MmCorpus.serialize('ko.mm', tfidf_ko)

   - topic model 학습
   - LSI (Latent Semantic Indexing)
ntopics, nwords = 3, 5
lsi_ko = models.lsimodel.LsiModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics)
print(lsi_ko.print_topics(num_topics=ntopics, num_words=nwords))

   - LDA (Latent Dirichlet Allocation)
import numpy as np; np.random.seed(42)
lda_ko = models.ldamodel.LdaModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics)
print(lda_ko.print_topics(num_topics=ntopics, num_words=nwords))

   - HDP (Hierarchical Dirichlet Process)
import numpy as np; np.random.seed(42)
hdp_ko = models.hdpmodel.HdpModel(tfidf_ko, id2word=dictionary_ko)
print(hdp_ko.print_topics(num_topics=ntopics, num_words=nwords))

   - Scoring document
bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[0])]
sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True)
sorted(lda_ko[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)
bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[1])]
sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True)
sorted(lda_ko[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)

   - 결과
    1. LSI
    [(0, '0.528*"육아휴직/Noun" + 0.248*"만/Noun" + 0.232*"×/Foreign" + 0.204*"고용/Noun" + 0.183*"자녀/Noun"'),
    (1, '0.423*"파견/Noun" + 0.419*"부대/Noun" + 0.263*"\n\n/Foreign" + 0.248*"UAE/Alpha" + 0.229*"○/Foreign"'),
    (2, '-0.308*"결혼/Noun" + -0.277*"손해/Noun" + -0.263*"예고/Noun" + -0.234*"사업자/Noun" + -0.202*"입법/Noun"')]
    2. LDA
    [(0, '0.001*"육아휴직/Noun" + 0.001*"결혼/Noun" + 0.001*"만/Noun" + 0.001*"×/Foreign" + 0.001*"자녀/Noun"'),
    (1, '0.001*"손해/Noun" + 0.001*"학위/Noun" + 0.001*"사업자/Noun" + 0.001*"육아휴직/Noun" + 0.001*"간호/Noun"'),
    (2, '0.003*"육아휴직/Noun" + 0.001*"×/Foreign" + 0.001*"고용/Noun" + 0.001*"파견/Noun" + 0.001*"부대/Noun"')]
    3. HDP
    [(0, '0.004*2011/Number + 0.003*육아휴직/Noun + 0.003*사단/Noun + 0.003*20/Number + 0.003*자/Suffix'),
    (1, '0.005*취득/Noun + 0.004*내지/Noun + 0.004*수/Modifier + 0.003*높이/Noun + 0.003*30억원/Number'),
    (2, '0.005*억원/Noun + 0.004*저하/Noun + 0.004*1년/Number + 0.004*기록/Noun + 0.004*까지임/Foreign')]

   - 결과 해석
    토픽0은 0.528”육아휴직” + 0.248"만" + 0.232"×" + 0.204"고용" + 0.183"자녀"로 표현
    이 토픽에 대해 기여하고 있는 상위 3개 키워드는 ‘육아휴직’, ’만’, ’x’ 등
    토픽0에 대한 ‘육아휴직’의 가중치는 0.528
    가중치는 키워드가 해당 토픽에 얼마나 중요한지를 반영

ref : http://www.engear.net/wp/topic-modeling-gensimpython/
ref : https://www.nextobe.com/single-post/2017/06/28/%ED%95%9C%EA%B8%80-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-%EB%B0%8F-word2vec%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%9C%A0%EC%82%AC%EB%8F%84-%EB%B6%84%EC%84%9D