Skip to content

Instantly share code, notes, and snippets.

@nemupm
Last active August 29, 2015 14:06
Show Gist options
  • Select an option

  • Save nemupm/878bb41f65604c31773a to your computer and use it in GitHub Desktop.

Select an option

Save nemupm/878bb41f65604c31773a to your computer and use it in GitHub Desktop.
Make Wikipedia dictionary for MeCab - txt_regular_formatter
#!/usr/bin/python
# -*- coding:utf-8 -*-
from unicodedata import normalize
def convert_str_to_regular_format(string):
uni = normalize('NFKC', string.decode('utf-8')).lower()
return uni.encode('utf-8')
f = open("jawiki-latest-all-titles-in-ns0","r")
w = open("jawiki-latest-all-titles-in-ns0_formatted","w")
for row in f:
w.write(convert_str_to_regular_format(row))
f.close()
w.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment