Python - 3.4.8
Java - 1.7
KoNLPy - 0.5.1
NLTK - 3.4
1. Installation
- 해당링크의 내용을 바탕으로 작성
- 기존의 python 2.7에서 python 3.4로 버전 업 및 라이브러리 호환버전으로 버전 업
- KoNLPy
1. Install dependency
$ wget https://www.python.org/ftp/python/3.4.*/Python-3.4.*.tar.xz $ tar xf Python-3.* $ cd Python-3.* $ ./configure $ make $ sudo make altinstall
2. Install KoNLPy
$ pip3.4 install konlpy
3. Install Mecab
$ sudo yum install curl $ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)ref : http://konlpy.org/en/latest/install/#centos
* Error : 14: PYCURL ERROR 6 ~~~ ( couldn't resolve host )
=> /etc/resolv.conf 에 `nameserver 8.8.8.8` 구글 nameserver 라인 추가
ref : https://www.centos.org/forums/viewtopic.php?t=8892
* pip3 설치
$ sudo yum install python34-setuptools $ sudo easy_install-3.4 pipref : https://stackoverflow.com/questions/32618686/how-to-install-pip-in-centos-7
* Error : (konlpy 설치중) Cache entry deserialization failed, entry ignored (네트워크 에러, 재접속 후 해결)
ref : https://stackoverflow.com/questions/49671215/cache-entry-deserialization-failed-entry-ignored
4. Test Run
$ python3 >>> from konlpy.tag import Kkma >>> from konlpy.utils import pprint >>> kkma = Kkma() >>> pprint(kkma.sentences(u'네, 안녕하세요. 반갑습니다.')) [네, 안녕하세요..,반갑습니다.]
* Error : importerror no module named 'jpype1'
=> 호환되는 jdk버전으로 변경
$ alternatives –config java원하는 버전 선택
ref : http://blog.daum.net/drinker/25
* Error : ImportError: No module named 'konlpy'
=> sys.path에 경로 추가
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://askubuntu.com/questions/470982/how-to-add-a-python-module-to-syspath
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://stackoverflow.com/questions/16114391/adding-directory-to-sys-path-pythonpath
- NLTK
1. Install dependency
$ sudo pip install -U numpy
2. Install NLTK
$ sudo pip install -U nltk
3. Test Run
$ python3 >>> import nltkref : https://www.nltk.org/install.html
- Gensim
1. Dependencies
- Python >= 2.7 (tested with versions 2.7, 3.5 and 3.6)
- NumPy >= 1.11.3
- SciPy >= 0.18.1
- Six >= 1.5.0
- smart_open >= 1.2.1
2. Install Gensim
$ pip install --upgrade gensim
- twython (twitter api를 쉽게 사용하기 위해 필요)
$ sudo pip install twython
2. Topic Modeling
- 어떤 문서에서 자주 나타나는 단어를 통해, 주제를 찾아주는 확률적인 모델을 디자인
- LDA, LSI, HDP 등
- Test Run
1. KoNLPy 설정
- 한글 파일들의 위치 : konlpy의 corpus 아래에 있는 kobill directory에 미리 저장
/?????/Python/3.4/site-packages/konlpy/data/corpus/kobill
2. tokenizer 설정
- nltk방식처럼 nltk + twitter api를 사용해서 토근화
from konlpy.corpus import kobill docs_ko = [kobill.open(i).read() for i in kobill.fileids()] from konlpy.tag import Okt; t = Twitter() pos = lambda d: ['/'.join(p) for p in t.pos(d)] texts_ko = [pos(doc) for doc in docs_ko]
3. topic modeling
from konlpy.corpus import kobill docs_ko = [kobill.open(i).read() for i in kobill.fileids()] from konlpy.tag import Okt; t=Twitter() pos = lambda d: ['/'.join(p) for p in t.pos(d, stem=True, norm=True)] texts_ko = [pos(doc) for doc in docs_ko]
- 토큰을 integer로 인코딩
from gensim import corpora dictionary_ko = corpora.Dictionary(texts_ko) dictionary_ko.save('ko.dict')
- TF-IDF 계산
from gensim import models tf_ko = [dictionary_ko.doc2bow(text) for text in texts_ko] tfidf_model_ko = models.TfidfModel(tf_ko) tfidf_ko = tfidf_model_ko[tf_ko] corpora.MmCorpus.serialize('ko.mm', tfidf_ko)
- topic model 학습
- LSI (Latent Semantic Indexing)
ntopics, nwords = 3, 5 lsi_ko = models.lsimodel.LsiModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics) print(lsi_ko.print_topics(num_topics=ntopics, num_words=nwords))
- LDA (Latent Dirichlet Allocation)
import numpy as np; np.random.seed(42) lda_ko = models.ldamodel.LdaModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics) print(lda_ko.print_topics(num_topics=ntopics, num_words=nwords))
- HDP (Hierarchical Dirichlet Process)
import numpy as np; np.random.seed(42) hdp_ko = models.hdpmodel.HdpModel(tfidf_ko, id2word=dictionary_ko) print(hdp_ko.print_topics(num_topics=ntopics, num_words=nwords))
- Scoring document
bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[0])] sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True) sorted(lda_ko[bow], key=lambda x: x[1], reverse=True) sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True) bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[1])] sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True) sorted(lda_ko[bow], key=lambda x: x[1], reverse=True) sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)
- 결과
1. LSI
[(0, '0.528*"육아휴직/Noun" + 0.248*"만/Noun" + 0.232*"×/Foreign" + 0.204*"고용/Noun" + 0.183*"자녀/Noun"'),
(1, '0.423*"파견/Noun" + 0.419*"부대/Noun" + 0.263*"\n\n/Foreign" + 0.248*"UAE/Alpha" + 0.229*"○/Foreign"'),
(2, '-0.308*"결혼/Noun" + -0.277*"손해/Noun" + -0.263*"예고/Noun" + -0.234*"사업자/Noun" + -0.202*"입법/Noun"')]
2. LDA
[(0, '0.001*"육아휴직/Noun" + 0.001*"결혼/Noun" + 0.001*"만/Noun" + 0.001*"×/Foreign" + 0.001*"자녀/Noun"'),
(1, '0.001*"손해/Noun" + 0.001*"학위/Noun" + 0.001*"사업자/Noun" + 0.001*"육아휴직/Noun" + 0.001*"간호/Noun"'),
(2, '0.003*"육아휴직/Noun" + 0.001*"×/Foreign" + 0.001*"고용/Noun" + 0.001*"파견/Noun" + 0.001*"부대/Noun"')]
3. HDP
[(0, '0.004*2011/Number + 0.003*육아휴직/Noun + 0.003*사단/Noun + 0.003*20/Number + 0.003*자/Suffix'),
(1, '0.005*취득/Noun + 0.004*내지/Noun + 0.004*수/Modifier + 0.003*높이/Noun + 0.003*30억원/Number'),
(2, '0.005*억원/Noun + 0.004*저하/Noun + 0.004*1년/Number + 0.004*기록/Noun + 0.004*까지임/Foreign')]
- 결과 해석
토픽0은 0.528”육아휴직” + 0.248"만" + 0.232"×" + 0.204"고용" + 0.183"자녀"로 표현
이 토픽에 대해 기여하고 있는 상위 3개 키워드는 ‘육아휴직’, ’만’, ’x’ 등
토픽0에 대한 ‘육아휴직’의 가중치는 0.528
가중치는 키워드가 해당 토픽에 얼마나 중요한지를 반영
ref : http://www.engear.net/wp/topic-modeling-gensimpython/
ref : https://www.nextobe.com/single-post/2017/06/28/%ED%95%9C%EA%B8%80-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-%EB%B0%8F-word2vec%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%9C%A0%EC%82%AC%EB%8F%84-%EB%B6%84%EC%84%9D
this blog is very good.sharing more like this type of blog.many important points are there.thank you.
ReplyDeletePython Classes in Chennai
Python Training Institute in Chennai
ccna Training institute in Chennai
ccna institute in Chennai
Amazon web services Training in Chennai
Python Training in Anna nagar
Python Training in T nagar