Skip to content

Instantly share code, notes, and snippets.

View andjc's full-sized avatar

Andj andjc

  • Melbourne, Australia
View GitHub Profile
@andjc
andjc / thaisort.py
Created February 20, 2023 10:44 — forked from korakot/thaisort.py
Thai Sort
import icu
thkey = icu.Collator.createInstance(icu.Locale('th_TH')).getSortKey
words = 'ไก่ ไข่ ก ฮา'.split()
print(sorted(words, key=thkey)) # ['ก', 'ไก่', 'ไข่', 'ฮา']
@andjc
andjc / 1-unicode-js-regex.md
Created December 22, 2022 08:03 — forked from jakub-g/1-unicode-js-regex.md
Unicode-aware JavaScript regex cheat sheet

Unicode-aware JavaScript regex (Unicode property escapes /\p{..}\P{..}/u) cheat sheet

Browser support MDN

@andjc
andjc / icu_totitle.py
Last active December 22, 2022 00:13
titlecasing: using pyicu, or to_title(), a wrapper to python's inbuilt method str.title().
from icu import Locale, UnicodeString
# loc = Locale.createCanonical("haw_US")
loc = Locale("haw_US")
s1 = "ʻōlelo hawaiʻi"
s2 = "oude ijssel "
print(UnicodeString(s1).toTitle(loc))
print(UnicodeString(s2).toTitle(Locale("nl_NL")).trim())
@andjc
andjc / docstrings.py
Created August 31, 2022 09:35 — forked from redlotus/docstrings.py
Google Style Python Docstrings
# -*- coding: utf-8 -*-
"""Example Google style docstrings.
This module demonstrates documentation as specified by the `Google Python
Style Guide`_. Docstrings may extend over multiple lines. Sections are created
with a section header and a colon followed by a block of indented text.
Example:
Examples can be given using either the ``Example`` or ``Examples``
sections. Sections support any reStructuredText formatting, including
@andjc
andjc / macos_lc_collate.md
Last active February 26, 2023 11:44
LC_COLLATE on macOS
$ ls -al /usr/share/locale/*/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-1/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    30 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-15/LC_COLLATE -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.UTF-8/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1131/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1251/LC_COLLATE
@andjc
andjc / normalisation_sorting.md
Last active July 12, 2022 01:36
Unicode normalisation and default Python sorting

Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py

Default python sorting

If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.

>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']
@andjc
andjc / python_sorting.md
Last active March 19, 2022 04:38
Laguage tailored sorting on Python

For more detailed information refer to Introduction to collation.

Python's list.sort() and sorted() functions are language invariant and can not be tailored. They give the same results regardless of the collation required by the language of the text. The functions have a key parameter that can be used to modify the strings before sorting or can be used to target a particular component of an object to use for sorting. The sort results will be consistent across platforms.

The following examples use a random selection of Slovak words.

>>> words = ['zem', 'čučoriedka', 'drevo', 'štebot', 'cesta', 'černice', 'ďateľ', 'rum', 'železo', 'prameň', 'sob']
>>> sorted(words)
['cesta', 'drevo', 'prameň', 'rum', 'sob', 'zem', 'černice', 'čučoriedka', 'ďateľ', 'štebot', 'železo']
@andjc
andjc / letter_frequency.md
Last active March 17, 2022 10:58
Letter frequency of text
@andjc
andjc / casefolding_matching.md
Last active January 28, 2024 05:13
Unicode casefolding and matching

Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.

>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
>>> su = s.upper()
@andjc
andjc / isalpha.md
Last active December 22, 2022 00:18
Python's str.isalpha()

The Python string operator str.isalpha() is sometimes used as a constraint or validator. But how useful is this in code that needs to support multiple languages?

The python documentation indicates that isalpha() matches any Unicode character with general category properties of Lu, Ll, Lt, Lm, or Lo.

While, Unicode defines an alphabetic character as any Unicode character with a category of Ll + Other_Lowercase + Lu + Other_Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic.

So it would be possible for a Unicode regex using \p{Alphabetic} to match characters that isalpha() would not match. Although in most practical cases the results would be the same.

It is interesting to note that the general categories Mn and Mc are not part of the Python or Unicode definition of an alphabetic character. What does this mean in practice?