Skip to content

Instantly share code, notes, and snippets.

@liubin
Forked from tomowarkar/mecab_cabocha.ipynb
Created October 24, 2022 03:18
Show Gist options
  • Save liubin/6a5339c9584d366aadd863ed2cbeab39 to your computer and use it in GitHub Desktop.
Save liubin/6a5339c9584d366aadd863ed2cbeab39 to your computer and use it in GitHub Desktop.
How to use MeCab and CaboCha in Google Colaboratory! you can also see here: https://tomowarkar.github.io/blog/posts/colab_mecab/
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@liubin
Copy link
Author

liubin commented Oct 24, 2022

My steps:

basic packages

apt-get install mecab swig libmecab-dev mecab-ipadic-utf8
apt install python
apt install python-dev

wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
python get-pip.py
pip install mecab-python==0.996

CRF

curl -sL -o CRF++-0.58.tar.gz "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ"
tar -zxf CRF++-0.58.tar.gz
cd CRF++-0.58
./configure && make && make install && ldconfig
cd ..

cabocha

url="https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7SDd1Q1dUQkZQaUU"
curl -sc /tmp/cookie ${url} >/dev/null
code="$(awk '/_warning_/ {print $NF}' /tmp/cookie)"
curl -sLb /tmp/cookie ${url}"&confirm=${code}" -o cabocha-0.69.tar.bz2
tar -jxf cabocha-0.69.tar.bz2
cd cabocha-0.69
./configure -with-charset=utf-8 
make
make check
make install
ldconfig
pip install python/
cd ..

中古和文UniDic

wget https://clrd.ninjal.ac.jp/unidic_archive/2203/UniDic-202203_20_chuko.zip
unzip UniDic-202203_20_chuko.zip 
cd 20_chuko/
ls -tl /var/lib/mecab/dic/ipadic-utf8
ln -s /var/lib/mecab/dic/ipadic-utf8/dicrc dicrc
mecab -d ./

Ptyhon

import CaboCha
cp = CaboCha.Parser("-d 20_chuko")
print(cp.parseToString("いづれの御時にか、女御、更衣あまたさぶらひたまひけるなかに、いとやむごとなき際にはあらぬが、すぐれて時めきたまふありけり"))

Result:

          いづれの御時にか、-----D        
                        女御、---D        
    更衣あまたさぶらひたまひける-D        
                          なかに、---D    
              いとやむごとなき際には-D    
                            あらぬが、---D
                                すぐれて-D
                      時めきたまふありけり

配置 cabocha

diff /usr/local/etc/cabocharc /usr/local/etc/cabocharc.bak 
8c8
< # posset = IPA
---
> posset = IPA
11c11
< posset = UNIDIC
---
> # posset = UNIDIC
39c39
< # parser-model  = /usr/local/lib/cabocha/model/dep.ipa.model
---
> parser-model  = /usr/local/lib/cabocha/model/dep.ipa.model
44c44
< parser-model  = /usr/local/lib/cabocha/model/dep.unidic.model
---
> # parser-model  = /usr/local/lib/cabocha/model/dep.unidic.model
48c48
< # chunker-model = /usr/local/lib/cabocha/model/chunk.ipa.model
---
> chunker-model = /usr/local/lib/cabocha/model/chunk.ipa.model
51c51
< chunker-model = /usr/local/lib/cabocha/model/chunk.unidic.model
---
> # chunker-model = /usr/local/lib/cabocha/model/chunk.unidic.model
54c54
< # ne-model = /usr/local/lib/cabocha/model/ne.ipa.model
---
> ne-model = /usr/local/lib/cabocha/model/ne.ipa.model
57c57
< ne-model = /usr/local/lib/cabocha/model/ne.unidic.model
---
> # ne-model = /usr/local/lib/cabocha/model/ne.unidic.model

Python

>>> import CaboCha
>>> cp = CaboCha.Parser("-d 20_chuko -P UNIDIC")
>>> print(cp.parseToString("いづれの御時にか、女御、更衣あまたさぶらひたまひけるなかに、いとやむごとなき際にはあらぬが、すぐれて時めきたまふありけり"))
                    いづれの-D                  
                    御時にか-----------------D
                          女御-D             |
      更衣あまたさぶらひたまひける-D           |
                            なかに-------D   |
                                  いと-D   |   |
                            やむごとなき-D |   |
                                    際には-D   |
                                  あらぬが---D
                                      すぐれて-D
                            時めきたまふありけり
EOS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment