Tuesday, November 27, 2018

Topic Modeling ( KoNLPy + NLTK + Gensim)

OS - Centos6
Python - 3.4.8
Java - 1.7
KoNLPy - 0.5.1
NLTK - 3.4

1. Installation
 - 해당링크의 내용을 바탕으로 작성
 - 기존의 python 2.7에서 python 3.4로 버전 업 및 라이브러리 호환버전으로 버전 업
 - KoNLPy
  1. Install dependency
$ wget https://www.python.org/ftp/python/3.4.*/Python-3.4.*.tar.xz
$ tar xf Python-3.*
$ cd Python-3.*
$ ./configure
$ make
$ sudo make altinstall

  2. Install KoNLPy
$ pip3.4 install konlpy

  3. Install Mecab
$ sudo yum install curl
$ bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)
ref : http://konlpy.org/en/latest/install/#centos

* Error : 14: PYCURL ERROR 6 ~~~ ( couldn't resolve host )
=> /etc/resolv.conf 에 `nameserver` 구글 nameserver 라인 추가
ref : https://www.centos.org/forums/viewtopic.php?t=8892

* pip3 설치
$ sudo yum install python34-setuptools
$ sudo easy_install-3.4 pip
ref : https://stackoverflow.com/questions/32618686/how-to-install-pip-in-centos-7

* Error : (konlpy 설치중) Cache entry deserialization failed, entry ignored (네트워크 에러, 재접속 후 해결)
ref : https://stackoverflow.com/questions/49671215/cache-entry-deserialization-failed-entry-ignored

  4. Test Run
$ python3
>>> from konlpy.tag import Kkma
>>> from konlpy.utils import pprint
>>> kkma = Kkma()
>>> pprint(kkma.sentences(u'네, 안녕하세요. 반갑습니다.'))
  [네, 안녕하세요..,반갑습니다.]

* Error : importerror no module named 'jpype1'
=> 호환되는 jdk버전으로 변경
$ alternatives –config java
원하는 버전 선택
ref : http://blog.daum.net/drinker/25

* Error : ImportError: No module named 'konlpy'
=> sys.path에 경로 추가
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://askubuntu.com/questions/470982/how-to-add-a-python-module-to-syspath
ref : https://stackoverflow.com/questions/23417941/import-error-no-module-named-does-exist/23418662
ref : https://stackoverflow.com/questions/16114391/adding-directory-to-sys-path-pythonpath

  1. Install dependency
$ sudo pip install -U numpy

  2. Install NLTK
$ sudo pip install -U nltk

  3. Test Run
$ python3
>>> import nltk
ref : https://www.nltk.org/install.html

 - Gensim
  1. Dependencies
   - Python >= 2.7 (tested with versions 2.7, 3.5 and 3.6)
   - NumPy >= 1.11.3
   - SciPy >= 0.18.1
   - Six >= 1.5.0
   - smart_open >= 1.2.1

  2. Install Gensim
$ pip install --upgrade gensim

 - twython (twitter api를 쉽게 사용하기 위해 필요)
$ sudo pip install twython

2. Topic Modeling
 - 어떤 문서에서 자주 나타나는 단어를 통해, 주제를 찾아주는 확률적인 모델을 디자인
 - LDA, LSI, HDP 등
 - Test Run
  1. KoNLPy 설정
   - 한글 파일들의 위치 : konlpy의 corpus 아래에 있는 kobill directory에 미리 저장

  2. tokenizer 설정
   - nltk방식처럼 nltk + twitter api를 사용해서 토근화
from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]
from konlpy.tag import Okt; t = Twitter()
pos = lambda d: ['/'.join(p) for p in t.pos(d)]
texts_ko = [pos(doc) for doc in docs_ko]

  3. topic modeling
from konlpy.corpus import kobill
docs_ko = [kobill.open(i).read() for i in kobill.fileids()]
from konlpy.tag import Okt; t=Twitter()
pos = lambda d: ['/'.join(p) for p in t.pos(d, stem=True, norm=True)]
texts_ko = [pos(doc) for doc in docs_ko]

   - 토큰을 integer로 인코딩
from gensim import corpora
dictionary_ko = corpora.Dictionary(texts_ko)

   - TF-IDF 계산
from gensim import models
tf_ko = [dictionary_ko.doc2bow(text) for text in texts_ko]
tfidf_model_ko = models.TfidfModel(tf_ko)
tfidf_ko = tfidf_model_ko[tf_ko]
corpora.MmCorpus.serialize('ko.mm', tfidf_ko)

   - topic model 학습
   - LSI (Latent Semantic Indexing)
ntopics, nwords = 3, 5
lsi_ko = models.lsimodel.LsiModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics)
print(lsi_ko.print_topics(num_topics=ntopics, num_words=nwords))

   - LDA (Latent Dirichlet Allocation)
import numpy as np; np.random.seed(42)
lda_ko = models.ldamodel.LdaModel(tfidf_ko, id2word=dictionary_ko, num_topics=ntopics)
print(lda_ko.print_topics(num_topics=ntopics, num_words=nwords))

   - HDP (Hierarchical Dirichlet Process)
import numpy as np; np.random.seed(42)
hdp_ko = models.hdpmodel.HdpModel(tfidf_ko, id2word=dictionary_ko)
print(hdp_ko.print_topics(num_topics=ntopics, num_words=nwords))

   - Scoring document
bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[0])]
sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True)
sorted(lda_ko[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)
bow = tfidf_model_ko[dictionary_ko.doc2bow(texts_ko[1])]
sorted(lsi_ko[bow], key=lambda x: x[1], reverse=True)
sorted(lda_ko[bow], key=lambda x: x[1], reverse=True)
sorted(hdp_ko[bow], key=lambda x: x[1], reverse=True)

   - 결과
    1. LSI
    [(0, '0.528*"육아휴직/Noun" + 0.248*"만/Noun" + 0.232*"×/Foreign" + 0.204*"고용/Noun" + 0.183*"자녀/Noun"'),
    (1, '0.423*"파견/Noun" + 0.419*"부대/Noun" + 0.263*"\n\n/Foreign" + 0.248*"UAE/Alpha" + 0.229*"○/Foreign"'),
    (2, '-0.308*"결혼/Noun" + -0.277*"손해/Noun" + -0.263*"예고/Noun" + -0.234*"사업자/Noun" + -0.202*"입법/Noun"')]
    2. LDA
    [(0, '0.001*"육아휴직/Noun" + 0.001*"결혼/Noun" + 0.001*"만/Noun" + 0.001*"×/Foreign" + 0.001*"자녀/Noun"'),
    (1, '0.001*"손해/Noun" + 0.001*"학위/Noun" + 0.001*"사업자/Noun" + 0.001*"육아휴직/Noun" + 0.001*"간호/Noun"'),
    (2, '0.003*"육아휴직/Noun" + 0.001*"×/Foreign" + 0.001*"고용/Noun" + 0.001*"파견/Noun" + 0.001*"부대/Noun"')]
    3. HDP
    [(0, '0.004*2011/Number + 0.003*육아휴직/Noun + 0.003*사단/Noun + 0.003*20/Number + 0.003*자/Suffix'),
    (1, '0.005*취득/Noun + 0.004*내지/Noun + 0.004*수/Modifier + 0.003*높이/Noun + 0.003*30억원/Number'),
    (2, '0.005*억원/Noun + 0.004*저하/Noun + 0.004*1년/Number + 0.004*기록/Noun + 0.004*까지임/Foreign')]

   - 결과 해석
    토픽0은 0.528”육아휴직” + 0.248"만" + 0.232"×" + 0.204"고용" + 0.183"자녀"로 표현
    이 토픽에 대해 기여하고 있는 상위 3개 키워드는 ‘육아휴직’, ’만’, ’x’ 등
    토픽0에 대한 ‘육아휴직’의 가중치는 0.528
    가중치는 키워드가 해당 토픽에 얼마나 중요한지를 반영

ref : http://www.engear.net/wp/topic-modeling-gensimpython/
ref : https://www.nextobe.com/single-post/2017/06/28/%ED%95%9C%EA%B8%80-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-%EB%B0%8F-word2vec%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-%EC%9C%A0%EC%82%AC%EB%8F%84-%EB%B6%84%EC%84%9D

Wednesday, June 7, 2017

Install JDK on Windows (+ Environment Variables setting)

JDK - Java Development Kit

1. Download JDK
- Go to www.oracle.com, select [Download] - [Java SE]

2. Check JDK version and click 'download'

3. Select "Accept License Agreement" and download "jdk-****-windows-x**.exe"

4. Execute *.exe file and click "Next"

5. Select option and installing path
- select all, without any special reason
- remember installing path, it need to setting environment variable
- click Next

6. JRE is installed automatically just click "Next"

7. Setting Environment Variable

** Setting Environment Variable
1. Click on "Advanced" tab on "System Properties"

2. Click on "Environment Variables" button

3. Create a new class path for JAVA_HOME (system variables)
- set Variable name as JAVA_HOME and value as c:\Programfiles\Java\jdk-*.*\bin

4. Modify "Path" in "System Variables"
- add ;%JAVA_HOME%\bin;
- Do not miss semicolon

5. Execute cmd and check installing completly
- type in "java -version" and "javac -version"

ref : http://recipes4dev.tistory.com/50#23-java-%EC%8B%A4%ED%96%89-%ED%99%98%EA%B2%BD-%EB%B3%80%EC%88%98-%EC%84%A4%EC%A0%95

Monday, July 25, 2016

3, 6, 9 게임 - Java (korean)

임의의 양의 정수를 정하고 그 수까지 3, 6, 9 게임을 수행

* 해당 숫자에서 3, 6, 9를 포함하고 있는 개수만큼 박수를 침

import java.util.ArrayList;
import java.util.List;

public class dateTest1 {
    public static void main(String[] args) {
        int a = 372;
        List  tsn = new ArrayList();
        boolean clap = false;
        for(int i=1 ; i<=a ; i++) {
            int temp = i;
            while(temp > 0) {
                    clap = true;
                temp = temp/10;
         if(clap) {
              clap = false;
         } else   

Sunday, July 24, 2016

Import Project from Bitbucket to Eclipse

how to import project to eclipse (by clone),
when you already have project repository on bitbucket

1. File -> Import -> Git -> Project from Git

2. select URI

3. get URI from bitbucket

4. ctrl+v

5. select branches

6. select local directory

7. choose option you want (choose second in my case)

8. following steps are like 'Create Java Project'

9. restart eclipse, check repository and branches

10. right click on project -> Team -> Commit
- select changed files and drag&drop like direction
- summary or memo on commit message
- click 'Commit and Push'

11. OK

Tuesday, July 19, 2016

Create Bitbucket Repository and Connect Eclipse

Ubuntu 14.04
Eclipse Mars 4.5.2

1. join bitbucket and create repository
- dashboard -> repositories -> create repository

- create (can create private repositories only five)

2. install git plug-in on eclipse
- Help -> Install New Software
- fill in ' Work with: http://download.eclipse.org/egit/updates ' or just hit 'git'
- check the box 'Eclipse Git Team Provider'

- next -> next -> I accept  the terms of license agreement -> finish (take few minute.....)
- restart eclipse and can see in import menu

3. create project and connect with bitbucket
- create project and right click on it
- Team -> Share Project

- check 'Use or create repository in parent folder of project'
- choose project you want (already existed....)

- if you can not check, click 'Create Repository' and choose again

4. commit and push
- right click on project -> Team -> Commit
- select and move source files from 'unstaged changes' to 'staged changes'
- commit message is optional

- commit, maybe get error
- right click on project -> Team -> Push Branch 'master'

- fill clone url in ' Location -> URI : ', get from bitbucket

- hit user and password -> next

- next -> finish

- refresh dashboard of bitbucket
- can find commits

ref : http://embedded.kookmin.ac.kr/lectureMobile/index.php/Eclipse_%2B_Github_%EC%82%AC%EC%9A%A9%EB%B2%95

Monday, July 18, 2016

Install Git on Ubintu

2 ways are installing git on ubuntu

A. Installation
1. Using apt-get - it's simple, but not latest version.
$ sudo apt-get install git

2. Download and install manually - it can use latest version.
- install libraries
$ sudo apt-get install libcurl4-gnutls-dev libexpat1-dev gettext \
    libz-dev libssl-dev
- if you want various documents format, need to these libraries
$ sudo apt-get install asciidoc xmlto docbook2x
- download latest release version from follows
Kernel.org( https://www.kernel.org/pub/software/scm/git/ )
GitHub mirror( https://github.com/git/git/releases )
- compile and install
$ tar -zxf git-1.9.1.tar.gz
$ cd git-1.9.1
$ make configure
$ ./configure --prefix=/usr
$ make all doc info
$ sudo make install install-doc install-html install-info

B. Configuration setting
- user
$ git config --global user.name "user name"
$ git config --global user.email emailaddress@email.com
- editor
$ git config --global core.editor editorname
- checking config settings
$ git config --list
- usage help
$ git help 
$ git  --help
$ man git-
ex) $ git help config

C. Generate project
- generate folder
$ mkdir testGitProject
- generate git repository in above folder
$ git init
- add and commit
$ git add *.c
$ git add LICENSE
$ git commit -m 'message about this project version'

D. Clone remote repository
- follow command makes directory named "libgit2" and generate .git in "libgit2". And clone repository
$ git clone https://github.com/libgit2/libgit2
- only differ name of directory "mylibgit", others are same
$ git clone https://github.com/libgit2/libgit2 mylibgit

E. other commands
- check state of repository
$ git status
- commit
$ git commit
- view log
$ git log
- fetch or pull remote repository (pull = fetch + merge)
$ git fetch [remote-name]
$ git pull [remote-name]
- push
$ git push origin master

ref : https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
ref : http://yokang90.tistory.com/47

Thursday, July 14, 2016

Ubuntu Commands and Tips #2

Memo - commands and tips in ubuntu

1. Network setting on ubuntu : when you can't connect lan(wired) network
- open follows
$ sudo gedit /etc/network/interfaces
- text in file
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface

auto lo
iface lo inet loopback
- case1_auto : add as follows under text in file
auto eth0
iface eth0 inet dhcp
- case2_manual : add as follows
auto eth0
iface eth0 inet static
        dns-nameservers ~~~~~
- save and network restart
$ sudo ifdown eth0
$ sudo ifup eth0
ref : http://blog.iolate.kr/89
ref : http://promobile.tistory.com/300

2. Error : " ubuntu wired network unmanaged "
- open follows
$ sudo gedit /etc/NetworkManager/NetworkManager.conf
- file contains, set "managed=true"

- save and restart network manager
$ /etc/init.d/network-manager restart
ref : https://wishkane.wordpress.com/2013/07/11/ubuntu-network-manager-wired-networks-are-unmanaged/

3. Activate workspaces
- System Settings -> Appearance -> Behavior
- check "Enable workspaces"

4. 우분투 한글 및 한글 키보드 설정
- System Setting -> Language Support -> Language Support install
- System Setting -> Text Entry -> 왼쪽 하단에 '+'를 이용해서 'Korean(Hangul)' 이라고 되어있는 항목을 추가
(되도록이면 영어 - 한국어 순으로 설정)
- System Setting -> Keyboard -> Shortcuts -> Typing -> Compose Key를 'Disabled'에서 'Right Alt'로 변경
- terminal에 아래의 내용을 입력 (한/영 키 사용 위함)
$ sudo apt-get install dconf-editor
- 설치가 완료되면 org -> gnome -> desktop -> wm -> keybindings -> switch-input-source의 값을 'Hangul'이라고 입력