from icu import RuleBasedNumberFormat, URBNFRuleSetTag, Locale
n = 111000
formatter = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("hi"))
formatter.format(n) # 'एक लाख ग्यारह हज़ार'
formatter2 = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("en_IN"))
formatter2.format(n) # lakh eleven thousand'
import unicodedataplus as ud | |
import regex as re | |
def convert_digits(s, sep = (",", ".")): | |
nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$') | |
tsep, dsep = sep | |
if nd.match(s): | |
s = s.replace(tsep, "") | |
s = ''.join([str(ud.decimal(c, c)) for c in s]) | |
if dsep in s: |
When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:
>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install yaz
yaz-client will then be available and you can use it like this:
run the program and you get a Z> prompt.
!wget https://github.com/Phonbopit/sarabun-webfont/raw/master/fonts/thsarabunnew-webfont.ttf | |
# !pip install -U --pre matplotlib | |
import matplotlib as mpl | |
mpl.font_manager.fontManager.addfont('thsarabunnew-webfont.ttf') # 3.2+ | |
mpl.rc('font', family='TH Sarabun New') |
import PyICU | |
# below from http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe | |
df = token_count | |
locale = 'UR.UTF-8' | |
collator = icu.Collator.createInstance(icu.Locale(locale)) | |
def sort_pd(key=None,reverse=False,cmp=None): | |
def sorter(series): |
libicu-dev (PyICU dependency)
brew uninstall --ignore-dependencies icu4c
brew install pkg-config icu4c # keg-only
Most Japanese websites use default font sets provided on Windows, Mac or Ubuntu. The latest ones are Meiryo, Hiragino Kaku Gothic Pro and Noto. For older versions such like Windows XP, it is good to add former default fonts MS Gothic(or MS Mincho)/Osaka. Older Linux versions may include Takao fonts.
Some old browsers could not understand those font names in English, some others do not recognize the names in Japanese, so it is safe to write both in Japanese and English.
Meiryo and Hiragino's order is, because Mac users may have Meiryo from MS-Office, and Hiragino is more familiar and matching well on Mac, better by starting Hiragino series.
So the current recommended practice is like this:
font-family: "ヒラギノ角ゴ Pro W3", "Hiragino Kaku Gothic Pro", Osaka, メイリオ, Meiryo, "MS Pゴシック", "MS PGothic", "MS ゴシック" , "MS Gothic", "Noto Sans CJK JP", TakaoPGothic, sans-serif;