Python - 3.4.8
Java - 1.7
KoNLPy - 0.5.1
NLTK - 3.4
1. Installation
- 해당링크의 내용을 바탕으로 작성
- 기존의 python 2.7에서 python 3.4로 버전 업 및 라이브러리 호환버전으로 버전 업
- KoNLPy
1. Install dependency
$ wget https://www.python.org/ftp/python/3.4.*/Python-3.4.*.tar.xz $ tar xf Python-3.* $ cd Python-3.* $ ./configure $ make $ sudo make altinstall
2. Install KoNLPy
$ pip3.4 install konlpy
3. Install Mecab
$ sudo yum install curl $ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)ref : http://konlpy.org/en/latest/install/#centos
* Error : 14: PYCURL ERROR 6 ~~~ ( couldn't resolve host )
=> /etc/resolv.conf 에 `nameserver 8.8.8.8` 구글 nameserver 라인 추가
ref : https://www.centos.org/forums/viewtopic.php?t=8892
* pip3 설치
$ sudo yum install python34-setuptools $ sudo easy_install-3.4 pipref : https://stackoverflow.com/questions/32618686/how-to-install-pip-in-centos-7
* Error : (konlpy 설치중) Cache entry deserialization failed, entry ignored (네트워크 에러, 재접속 후 해결)
ref : https://stackoverflow.com/questions/49671215/cache-entry-deserialization-failed-entry-ignored
4. Test Run
$ python3 >>> from konlpy.tag import Kkma >>> from konlpy.utils import pprint >>> kkma = Kkma() >>> pprint(kkma.sentences(u'네, 안녕하세요. 반갑습니다.')) [네, 안녕하세요..,반갑습니다.]
* Error : importerror no module named 'jpype1'
=> 호환되는 jdk버전으로 변경
$ alternatives –config java원하는 버전 선택
ref : http://blog.daum.net/drinker/25
* Error : ImportError: No module named 'konlpy'
=> sys.path에 경로 추가
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://askubuntu.com/questions/470982/how-to-add-a-python-module-to-syspath
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://stackoverflow.com/questions/16114391/adding-directory-to-sys-path-pythonpath
- NLTK
1. Install dependency
$ sudo pip install -U numpy
2. Install NLTK
$ sudo pip install -U nltk
3. Test Run
$ python3 >>> import nltkref : https://www.nltk.org/install.html
- Gensim
1. Dependencies
- Python >= 2.7 (tested with versions 2.7, 3.5 and 3.6)
- NumPy >= 1.11.3
- SciPy >= 0.18.1
- Six >= 1.5.0
- smart_open >= 1.2.1
2. Install Gensim
$ pip install --upgrade gensim
- twython (twitter api를 쉽게 사용하기 위해 필요)
$ sudo pip install twython
2. Topic Modeling
- 어떤 문서에서 자주 나타나는 단어를 통해, 주제를 찾아주는 확률적인 모델을 디자인
- LDA, LSI, HDP 등
- Test Run
1. KoNLPy 설정
- 한글 파일들의 위치 : konlpy의 corpus 아래에 있는 kobill directory에 미리 저장
/?????/Python/3.4/site-packages/konlpy/data/corpus/kobill
2. tokenizer 설정
- nltk방식처럼 nltk + twitter api를 사용해서 토근화
from konlpy.corpus import kobill docs_ko = [kobill.open(i).read() for i in kobill.fileids()] from konlpy.tag import Okt; t = Twitter() pos = lambda d: ['/'.join(p) for p in t.pos(d)] texts_ko = [pos(doc) for doc in docs_ko]
3. topic modeling
from konlpy.corpus import kobill docs_ko = [kobill.open(i).read() for i in kobill.fileids()] from konlpy.tag import Okt; t=Twitter() pos = lambda d: ['/'.join(p) for p in t.pos(d, stem=True, norm=True)] texts_ko = [pos(doc) for doc in docs_ko]
- 토큰을 integer로 인코딩
from gensim import corpora dictionary_ko = corpora.Dictionary(texts_ko) dictionary_ko.save('ko.dict')
- TF-IDF 계산
from gensim import models tf_ko = [dictionary_ko.doc2bow(text) for text in texts_ko] tfidf_model_ko = models.TfidfModel(tf_ko) tfidf_ko = tfidf_model_ko[tf_ko] corpora.MmCorpus.serialize('ko.mm', tfidf_ko)
- topic model 학습
- LSI (Latent Semantic Indexing)
ntopics, nwords = 3, 5 lsi_ko = models.lsimodel.LsiModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics) print(lsi_ko.print_topics(num_topics=ntopics, num_words=nwords))
- LDA (Latent Dirichlet Allocation)
import numpy as np; np.random.seed(42) lda_ko = models.ldamodel.LdaModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics) print(lda_ko.print_topics(num_topics=ntopics, num_words=nwords))
- HDP (Hierarchical Dirichlet Process)
import numpy as np; np.random.seed(42) hdp_ko = models.hdpmodel.HdpModel(tfidf_ko, id2word=dictionary_ko) print(hdp_ko.print_topics(num_topics=ntopics, num_words=nwords))
- Scoring document
bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[0])] sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True) sorted(lda_ko[bow], key=lambda x: x[1], reverse=True) sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True) bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[1])] sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True) sorted(lda_ko[bow], key=lambda x: x[1], reverse=True) sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)
- 결과
1. LSI
[(0, '0.528*"육아휴직/Noun" + 0.248*"만/Noun" + 0.232*"×/Foreign" + 0.204*"고용/Noun" + 0.183*"자녀/Noun"'),
(1, '0.423*"파견/Noun" + 0.419*"부대/Noun" + 0.263*"\n\n/Foreign" + 0.248*"UAE/Alpha" + 0.229*"○/Foreign"'),
(2, '-0.308*"결혼/Noun" + -0.277*"손해/Noun" + -0.263*"예고/Noun" + -0.234*"사업자/Noun" + -0.202*"입법/Noun"')]
2. LDA
[(0, '0.001*"육아휴직/Noun" + 0.001*"결혼/Noun" + 0.001*"만/Noun" + 0.001*"×/Foreign" + 0.001*"자녀/Noun"'),
(1, '0.001*"손해/Noun" + 0.001*"학위/Noun" + 0.001*"사업자/Noun" + 0.001*"육아휴직/Noun" + 0.001*"간호/Noun"'),
(2, '0.003*"육아휴직/Noun" + 0.001*"×/Foreign" + 0.001*"고용/Noun" + 0.001*"파견/Noun" + 0.001*"부대/Noun"')]
3. HDP
[(0, '0.004*2011/Number + 0.003*육아휴직/Noun + 0.003*사단/Noun + 0.003*20/Number + 0.003*자/Suffix'),
(1, '0.005*취득/Noun + 0.004*내지/Noun + 0.004*수/Modifier + 0.003*높이/Noun + 0.003*30억원/Number'),
(2, '0.005*억원/Noun + 0.004*저하/Noun + 0.004*1년/Number + 0.004*기록/Noun + 0.004*까지임/Foreign')]
- 결과 해석
토픽0은 0.528”육아휴직” + 0.248"만" + 0.232"×" + 0.204"고용" + 0.183"자녀"로 표현
이 토픽에 대해 기여하고 있는 상위 3개 키워드는 ‘육아휴직’, ’만’, ’x’ 등
토픽0에 대한 ‘육아휴직’의 가중치는 0.528
가중치는 키워드가 해당 토픽에 얼마나 중요한지를 반영
ref : http://www.engear.net/wp/topic-modeling-gensimpython/
ref : https://www.nextobe.com/single-post/2017/06/28/%ED%95%9C%EA%B8%80-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-%EB%B0%8F-word2vec%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%9C%A0%EC%82%AC%EB%8F%84-%EB%B6%84%EC%84%9D