Skip to content

Instantly share code, notes, and snippets.

View ikegami-yukino's full-sized avatar

IKEGAMI Yukino ikegami-yukino

View GitHub Profile
@ikegami-yukino
ikegami-yukino / file0.txt
Last active April 9, 2016 08:23
Pure Python 版オンライン形態素解析ツール Rakuten MA ref: http://qiita.com/yukinoi/items/925bc238185aa2fad8a7
from rakutenma import RakutenMA
rma = RakutenMA(phi=1024, c=0.007812)
rma.load("model_ja.json")
rma.hash_func = rma.create_hash_func(15)
print(rma.tokenize("うらにわにはにわにわとりがいる"))
print(rma.train_one(
[["うらにわ","N-nc"],
["に","P-k"],
@ikegami-yukino
ikegami-yukino / longest_contiguous_common_subsequence.py
Created October 15, 2014 09:59
Longest Contiguous Common Subsequence
def to_ngrams(s, minimum_n):
"""Generate n-grams (len(string) >= n >= minimum) from string
Params:
<str> s
<int> minimum
Return:
<set <str>> ngrams
"""
ngrams = []
length = len(s)
@ikegami-yukino
ikegami-yukino / vim_pig.sh
Last active August 29, 2015 14:07
Pig Latin syntax coloring for Vim
#!/bin/sh
git clone https://github.com/motus/pig.vim.git /tmp/pig.vim
mkdir ~/.vim/syntax/
mkdir ~/.vim/ftdetect/
cp /tmp/pig.vim/syntax/pig.vim ~/.vim/syntax/
cp /tmp/pig.vim/ftdetect/pig.vim ~/.vim/ftdetect/
rm -r /tmp/pig.vim
@ikegami-yukino
ikegami-yukino / pig_.sh
Last active August 29, 2015 14:04
Apache Pig Installation on Ubuntu
wget http://ftp.kddilabs.jp/infosystems/apache/pig/latest/pig-0.13.0.tar.gz
tar -xvf pig-0.13.0.tar.gz
sudo mv pig-0.13.0 /usr/local/pig
rm pig-0.13.0.tar.gz
echo 'export PIG_HOME=/usr/local/pig' >> ~/.bashrc
echo 'export PATH=$PATH:$PIG_HOME/bin' >> ~/.bashrc
echo 'export PIG_CLASSPATH=$HADOOP_HOME/conf/' >> ~/.bashrc
source ~/.bashrc
pig -h
@ikegami-yukino
ikegami-yukino / madoka_bayes.py
Last active August 29, 2015 13:57
Standard Naive Bayes and Complement Naive Bayes using madoka
#-*- coding: utf-8 -*-
import numpy as np
from collections import Counter, defaultdict
import madoka
NUM_DOCS_INDEX = '[[NUM_DOCS]]'
ALL_WORD_INDEX = '[[ALL]]'
class TFIDF(object):
@ikegami-yukino
ikegami-yukino / levenshtein.py
Created February 5, 2014 07:02
重み付きレーベンシュタイン距離 Weighted Levenshtein Distance
def weighted_levenshtein(a, b, insert=1, delete=1, substitute=1):
len_a = len(a)
len_b = len(b)
m = [ [0] * (len_b + 1) for i in xrange(len_a + 1) ]
for i in xrange(len_a + 1):
m[i][0] = i * delete
for j in xrange(len_b + 1):
m[0][j] = j * insert
@ikegami-yukino
ikegami-yukino / propensity.py
Created January 26, 2014 17:04
性癖リストを同義語辞書用JSONにする https://dl.dropboxusercontent.com/u/49326509/Propensity.txt
# -*- coding: utf-8 -*-
import json
import fileinput
import re
import codecs
'''
性癖リストを同義語辞書用JSONにする
https://dl.dropboxusercontent.com/u/49326509/Propensity.txt
@ikegami-yukino
ikegami-yukino / 2ch_regex.py
Created January 8, 2014 15:44
2ちゃんねるや2ちゃんねるまとめサイト検出用正規表現 (2014/01/08(水) 12:30:37.67 ID:+pyxCrmX0みたいなやつにマッチする)
import re
re_2ch_post = re.compile(u'[^2]2[0-9]{3}/\d{2}/\d{2}\(.\) \d{2}:\d{2}:\d{2}\.\d{2} ID:[\w\d\+\/]+')
@ikegami-yukino
ikegami-yukino / nfkc_compare.txt
Created December 30, 2013 19:32
Pythonのunicodedata.normalize('NFKC')で正規化される文字の一覧
# -*- coding: utf-8 -*-
import unicodedata
for unicode_id in xrange(65536):
char = unichr(unicode_id)
normalized_char = unicodedata.normalize('NFKC', char)
if char != normalized_char:
if len(normalized_char) == 1:
print u'[%d] %s -> [%d] %s' % (unicode_id, char, ord(normalized_char), normalized_char)
else:
@ikegami-yukino
ikegami-yukino / wikipedia_anob.py
Created December 19, 2013 16:48
Wikipediaの不要見出し語をカットするためのsuffix一覧
#! /usr/bin/env python
# -*- coding: utf-8 -*-
from collections import Counter
import re
suffixes = Counter()
re_anob = re.compile(u'(?P<A>.+[^の])の(?P<B>[^の].+)')
re_hiragana = re.compile(u'[ぁ-ゖ]+')
def extract_anob(text):