Andj andjc

Spellout numbers

from icu import RuleBasedNumberFormat, URBNFRuleSetTag, Locale
n = 111000
formatter = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("hi"))
formatter.format(n)   # 'एक लाख ग्यारह हज़ार'
formatter2 = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("en_IN"))
formatter2.format(n) # lakh eleven thousand'

Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

brew install yaz

yaz-client will then be available and you can use it like this:

run the program and you get a Z> prompt.

Installing Pyicu on Mac

Install icu4c and get the version

$ brew install icu4c

Do what homebrew tells you to do: set necessary env variables

Installing PyICU, libpostal, pypostal on Mac OS X 10.14+

libicu-dev (PyICU dependency)

brew uninstall --ignore-dependencies icu4c
brew install pkg-config icu4c  # keg-only

PyICU cheat sheet

Because you can't get the docs.

Transliteration

Create a transliterator:

greek2latin = icu.Transliterator.createInstance('Greek-Latin')

Most Japanese websites use default font sets provided on Windows, Mac or Ubuntu. The latest ones are Meiryo, Hiragino Kaku Gothic Pro and Noto. For older versions such like Windows XP, it is good to add former default fonts MS Gothic(or MS Mincho)/Osaka. Older Linux versions may include Takao fonts.

Some old browsers could not understand those font names in English, some others do not recognize the names in Japanese, so it is safe to write both in Japanese and English.

Meiryo and Hiragino's order is, because Mac users may have Meiryo from MS-Office, and Hiragino is more familiar and matching well on Mac, better by starting Hiragino series.

So the current recommended practice is like this:

font-family: "ヒラギノ角ゴ Pro W3", "Hiragino Kaku Gothic Pro", Osaka, メイリオ, Meiryo, "ＭＳ Ｐゴシック", "MS PGothic", "ＭＳ ゴシック" , "MS Gothic", "Noto Sans CJK JP", TakaoPGothic, sans-serif;

	import unicodedataplus as ud
	import regex as re

	def convert_digits(s, sep = (",", ".")):
	nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
	tsep, dsep = sep
	if nd.match(s):
	s = s.replace(tsep, "")
	s = ''.join([str(ud.decimal(c, c)) for c in s])
	if dsep in s:

	!wget https://github.com/Phonbopit/sarabun-webfont/raw/master/fonts/thsarabunnew-webfont.ttf
	# !pip install -U --pre matplotlib
	import matplotlib as mpl
	mpl.font_manager.fontManager.addfont('thsarabunnew-webfont.ttf') # 3.2+
	mpl.rc('font', family='TH Sarabun New')

	import PyICU

	# below from http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe

	df = token_count
	locale = 'UR.UTF-8'
	collator = icu.Collator.createInstance(icu.Locale(locale))

	def sort_pd(key=None,reverse=False,cmp=None):
	def sorter(series):