Showing posts with label cwb. Show all posts
Showing posts with label cwb. Show all posts

Monday, October 26, 2015

CWB (The IMS Open Corpus Workbench) installation

CWB(The IMS Open Corpus Workbench) : managing and querying large text corpora
I will use CWB for preprocessing of topic modeling


(ubuntu 14.04.03)
1. Download CWB, extract
http://cwb.sourceforge.net/download.php

2. Install on target directory
$ sudo ./install-cwb.sh

3. Finish

4. If you want to test your CWB installation, download and unpack one of the pre-indexed demo corpora(novels by Charles Dickens)
http://cwb.sourceforge.net/download.php

5. Move to DemoCorpus/ and typing this on command line
$ cwb-describe-corpus -r registry -s DICKENS
result

it prints some information about the corpus and its attributes

6. Next type in
$ cqp -eC -r registry -D DICKENS
start an interactive CQP session and activate the demo corpus DICKENS for queries.

7. Type in some words like this, and check result.
DICKENS> "the"
result


8. Try applying options
DICKENS> set Context 1 s;
DICKENS> "the";
DICKENS> show +lemma
DICKENS> show +pos;


9. Finished settings and try testing again


****
If you want to use CWB wherever, not specific directory

1. On DemoCorpus/
$ sudo mkdir /usr/local/share/cwb
$ sudo mkdir /usr/local/share/cwb/registry

2. Open 'dickens' file in /DemoCorpus/registry and change
HOME data => HOME /home/won/Desktop/TopicModeling/CWB/DemoCorpus/data

3. Testing on other directory
$ cd registry/
$ sudo cp dickens /usr/local/share/cwb/registry/

****
CWB was written by perl. If you want to use on python, need to install python wrapper(like yannick-cwb-python)

1. Download source and unpack
https://bitbucket.org/yannick/cwb-python/src

2. Install wrapper
$ python setup.py build
In my case, error issued in this step. like that
running build
running build_py
running build_ext
building 'CWB.CL' extension
x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Isrc -I/usr/include/python2.7 -c src/CWB/CL.cpp -o build/temp.linux-x86_64-2.7/src/CWB/CL.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
src/CWB/CL.cpp: In function ‘int __pyx_pf_3CWB_2CL_6IDList___cinit__(PyObject*, PyObject*, PyObject*)’:
src/CWB/CL.cpp:1659:7: warning: variable ‘__pyx_v_is_sorted’ set but not used [-Wunused-but-set-variable]
   int __pyx_v_is_sorted;
       ^
c++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -D_FORTIFY_SOURCE=2 -g -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/src/CWB/CL.o -lcl -lpcre -lglib-2.0 -o build/lib.linux-x86_64-2.7/CWB/CL.so
/usr/bin/ld: cannot find -lpcre
/usr/bin/ld: cannot find -lglib-2.0
collect2: error: ld returned 1 exit status
error: command 'c++' failed with exit status 1
just install gcc and some libraries.
$ sudo apt-get install gcc
$ sudo apt-get install libpcre3 libpcre3-dev
$ sudo apt-get install libglib2.0-dev
then install wrapper continuously
$ sudo python setup.py install

3. Confirm installation
$ python -m doctest tests/idlist.txt
which should terminate with no output when everything is well.


Ref:
http://cwb.sourceforge.net/install.php
https://bitbucket.org/yannick/cwb-python/src