Skip to content

Instantly share code, notes, and snippets.

View polm's full-sized avatar

Paul O'Leary McCann polm

View GitHub Profile
@polm
polm / check.py
Created October 12, 2021 10:59
Check differences with degree tokenization changes
# script to test degree tokenization related changes.
# https://github.com/explosion/spaCy/pull/9155
import spacy
langs = ("af am ar az bg bn ca cs da de el en es et eu fa fi fr ga grc gu he hi "
"hr hu hy id is it ja kn ko ky lb lij lt lv mk ml mr nb ne nl pl pt ro "
"ru sa si sk sl sq sr sv ta te th ti tl tn tr tt uk ur vi xx yo zh").split()
check = ("°c °f °k °C °F °K °c. °f. °k. °C. °F. °K. 1°c 1°f 1°k 1°C 1°F 1°K 1°c. "
"1°f. 1°k. 1°C. 1°F. 1°K.").split()
@polm
polm / fugashi-exe.md
Created July 16, 2021 14:04
PythonでMeCabを含むexeを作る方法

PythonでMeCabを含むexeを作る方法

Linuxで動作確認済みです。

まずvirtualenvでpipで必要なライブラリをインストールします。fugashiを使用していますがこの場合mecab-python3でも変わりません。

# MeCab関係
pip install fugashi unidic-lite
# EXEを作成するやつ

pip install pyinstaller

1. Transformersを使う場合
最新版のTransformersはそもそもmecab-python3を使っていません。こちらを実行してください。
pip install transformers[ja]
2. Neologdを使う場合
mecab-python3のバージョンが古いです。まずmecab-python3を更新します。
@polm
polm / chasen.py
Created July 30, 2020 10:19
mecab-python3で-Ochasenを再現する方法
# -Ochasen の出力フォーマットを直接指定する
import MeCab
import ipadic
CHASEN_ARGS = r' -F "%m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n"'
CHASEN_ARGS += r' -U "%m\t%m\t%m\t%F-[0,1,2,3]\t\t\n"'
tagger = MeCab.Tagger(ipadic.MECAB_ARGS + CHASEN_ARGS)
print(tagger.parse("図書館にいた事がバレた"))
# 出力
@polm
polm / chasen.py
Created July 30, 2020 10:19
mecab-python3で-Ochasenを再現する方法
# -Ochasen の出力フォーマットを直接指定する
import MeCab
import ipadic
CHASEN_ARGS = r' -F "%m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n"'
CHASEN_ARGS += r' -U "%m\t%m\t%m\t%F-[0,1,2,3]\t\t\n"'
tagger = MeCab.Tagger(ipadic.MECAB_ARGS + CHASEN_ARGS)
print(tagger.parse("図書館にいた事がバレた"))
# 出力
24731941 年
15955060 日
13733371 月
7032890 大
6115161 本
5634170 学
5352959 人
4568971 中
4437080 国
4403844 一
@polm
polm / spacy-convert.py
Created April 30, 2020 09:18
GSD CoNLL spaCy prep script
#!/usr/bin/env python3
"""
Convert GSD conll format to a format the spaCy convert script can use as-is.
There are two main changes:
1. POS tags format is changed slightly.
old: 名詞-普通名詞-一般
[ 2359.097] (WW) Failed to open protocol names file lib/xorg/protocol.txt
[ 2359.098]
X.Org X Server 1.20.7
X Protocol Version 11, Revision 0
[ 2359.100] Build Operating System: Linux Arch Linux
[ 2359.101] Current Operating System: Linux shougeimaru 5.6.5-arch3-1 #1 SMP PREEMPT Sun, 19 Apr 2020 13:14:25 +0000 x86_64
[ 2359.101] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=9342fe9f-1bc1-42e5-aa8e-b4f7d26ce115 rw quiet
[ 2359.103] Build Date: 14 January 2020 07:13:52AM
[ 2359.103]
[ 2359.104] Current version of pixman: 0.38.4
[ 2064.502] (WW) Failed to open protocol names file lib/xorg/protocol.txt
[ 2064.503]
X.Org X Server 1.20.7
X Protocol Version 11, Revision 0
[ 2064.505] Build Operating System: Linux Arch Linux
[ 2064.506] Current Operating System: Linux shougeimaru 5.6.5-arch3-1 #1 SMP PREEMPT Sun, 19 Apr 2020 13:14:25 +0000 x86_64
[ 2064.506] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=9342fe9f-1bc1-42e5-aa8e-b4f7d26ce115 rw quiet
[ 2064.508] Build Date: 14 January 2020 07:13:52AM
[ 2064.508]
[ 2064.509] Current version of pixman: 0.38.4
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig: version 304.43 ([email protected]) Sun Aug 19 21:28:54 PDT 2012
# nvidia-settings: X configuration file generated by nvidia-settings
# nvidia-settings: version 260.19.44 ([email protected]) Sun Feb 27 21:50:27 PST 2011
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0" 0 0
InputDevice "Keyboard0" "CoreKeyboard"