Monday, October 26, 2015

CWB (The IMS Open Corpus Workbench) installation

CWB(The IMS Open Corpus Workbench) : managing and querying large text corpora
I will use CWB for preprocessing of topic modeling


(ubuntu 14.04.03)
1. Download CWB, extract
http://cwb.sourceforge.net/download.php

2. Install on target directory
$ sudo ./install-cwb.sh

3. Finish

4. If you want to test your CWB installation, download and unpack one of the pre-indexed demo corpora(novels by Charles Dickens)
http://cwb.sourceforge.net/download.php

5. Move to DemoCorpus/ and typing this on command line
$ cwb-describe-corpus -r registry -s DICKENS
result

it prints some information about the corpus and its attributes

6. Next type in
$ cqp -eC -r registry -D DICKENS
start an interactive CQP session and activate the demo corpus DICKENS for queries.

7. Type in some words like this, and check result.
DICKENS> "the"
result


8. Try applying options
DICKENS> set Context 1 s;
DICKENS> "the";
DICKENS> show +lemma
DICKENS> show +pos;


9. Finished settings and try testing again


****
If you want to use CWB wherever, not specific directory

1. On DemoCorpus/
$ sudo mkdir /usr/local/share/cwb
$ sudo mkdir /usr/local/share/cwb/registry

2. Open 'dickens' file in /DemoCorpus/registry and change
HOME data => HOME /home/won/Desktop/TopicModeling/CWB/DemoCorpus/data

3. Testing on other directory
$ cd registry/
$ sudo cp dickens /usr/local/share/cwb/registry/

****
CWB was written by perl. If you want to use on python, need to install python wrapper(like yannick-cwb-python)

1. Download source and unpack
https://bitbucket.org/yannick/cwb-python/src

2. Install wrapper
$ python setup.py build
In my case, error issued in this step. like that
running build
running build_py
running build_ext
building 'CWB.CL' extension
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Isrc -I/usr/include/python2.7 -c src/CWB/CL.cpp -o build/temp.linux-x86_64-2.7/src/CWB/CL.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
src/CWB/CL.cpp: In function ‘int __pyx_pf_3CWB_2CL_6IDList___cinit__(PyObject*, PyObject*, PyObject*)’:
src/CWB/CL.cpp:1659:7: warning: variable ‘__pyx_v_is_sorted’ set but not used [-Wunused-but-set-variable]
   int __pyx_v_is_sorted;
       ^
c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/src/CWB/CL.o -lcl -lpcre -lglib-2.0 -o build/lib.linux-x86_64-2.7/CWB/CL.so
/usr/bin/ld: cannot find -lpcre
/usr/bin/ld: cannot find -lglib-2.0
collect2: error: ld returned 1 exit status
error: command 'c++' failed with exit status 1
just install gcc and some libraries.
$ sudo apt-get install gcc
$ sudo apt-get install libpcre3 libpcre3-dev
$ sudo apt-get install libglib2.0-dev
then install wrapper continuously
$ sudo python setup.py install

3. Confirm installation
$ python -m doctest tests/idlist.txt
which should terminate with no output when everything is well.


Ref:
http://cwb.sourceforge.net/install.php
https://bitbucket.org/yannick/cwb-python/src

No comments:

Post a Comment