Andj andjc

Unicode-aware JavaScript regex (Unicode property escapes `/\p{..}\P{..}/u`) cheat sheet

Browser support MDN

✅ Chrome 64 & Edge 79
✅ Safari 11.1
✅ Firefox 78
✅ nodejs: 10.0
✅ babel

$ ls -al /usr/share/locale/*/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-1/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    30 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-15/LC_COLLATE -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.UTF-8/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1131/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1251/LC_COLLATE

Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py

Default python sorting

If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.

>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']

For more detailed information refer to Introduction to collation.

Python's list.sort() and sorted() functions are language invariant and can not be tailored. They give the same results regardless of the collation required by the language of the text. The functions have a key parameter that can be used to modify the strings before sorting or can be used to target a particular component of an object to use for sorting. The sort results will be consistent across platforms.

The following examples use a random selection of Slovak words.

>>> words = ['zem', 'čučoriedka', 'drevo', 'štebot', 'cesta', 'černice', 'ďateľ', 'rum', 'železo', 'prameň', 'sob']
>>> sorted(words)
['cesta', 'drevo', 'prameň', 'rum', 'sob', 'zem', 'černice', 'čučoriedka', 'ďateľ', 'štebot', 'železo']

Based on python code posted on LinkedIn by Alekya D.

Using a tweaked version of Alice in Wonderland and the Dinka Padang translation of the UDHR

Refer to gists on graphemes and isalpha

import collections
import regex as re

Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.

>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
&gt;&gt;&gt; su = s.upper()

The Python string operator str.isalpha() is sometimes used as a constraint or validator. But how useful is this in code that needs to support multiple languages?

The python documentation indicates that isalpha() matches any Unicode character with general category properties of Lu, Ll, Lt, Lm, or Lo.

While, Unicode defines an alphabetic character as any Unicode character with a category of Ll + Other_Lowercase + Lu + Other_Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic.

So it would be possible for a Unicode regex using \p{Alphabetic} to match characters that isalpha() would not match. Although in most practical cases the results would be the same.

It is interesting to note that the general categories Mn and Mc are not part of the Python or Unicode definition of an alphabetic character. What does this mean in practice?

	import icu
	thkey = icu.Collator.createInstance(icu.Locale('th_TH')).getSortKey
	words = 'ไก่ ไข่ ก ฮา'.split()
	print(sorted(words, key=thkey)) # ['ก', 'ไก่', 'ไข่', 'ฮา']

	from icu import Locale, UnicodeString
	# loc = Locale.createCanonical("haw_US")
	loc = Locale("haw_US")
	s1 = "ʻōlelo hawaiʻi"
	s2 = "oude ijssel "
	print(UnicodeString(s1).toTitle(loc))
	print(UnicodeString(s2).toTitle(Locale("nl_NL")).trim())

	# -- coding: utf-8 --
	"""Example Google style docstrings.

	This module demonstrates documentation as specified by the `Google Python
	Style Guide`_. Docstrings may extend over multiple lines. Sections are created
	with a section header and a colon followed by a block of indented text.

	Example:
	Examples can be given using either the ``Example`` or ``Examples``
	sections. Sections support any reStructuredText formatting, including