Andj andjc

Greek case mapping is asymmetric:

Python case mapping uses language/locale insensitive full case mapping. Greek requires langauge sensitive mappings defined in CLDR transformations. These build on the data in UnicodeData.txt and SpecialCasing.txt.

The PyICU module supports Greek case mapping:

import icu

MARC-8 Code Tables:

codetables.xml – MARC-8 to Unicode XML mapping
eacc2uni.txt – MARC-8 to Unicode comma-delimited mapping file (EACC characters only)

	###
	# Downloads and parses https://lh.2xlibre.net/locales/ into a
	# JSON file split into the following fields:
	# - code: locale code, i.e. 'en_GB'
	# - suffix: locale code suffix, i.e. 'latin' from 'be_BY'
	# - name: locale name, i.e. 'English' from 'en_GB'
	# - country: locale country 'title'lized, i.e. 'United Kingdom' from 'en_GB'
	# Settings as on where to save the html file and locale file can be found below
	###

	# Python functions to improve sorting of text in alphabetic scripts.

	# Copyright 2025 Enabling Languages
	#
	# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
	# associated documentation files (the “Software”), to deal in the Software without restriction, including
	# without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	# copies of the Software, and to permit persons to whom the Software is furnished to do so, subject
	# to the following conditions:
	#

	####################################################################################################
	#
	# Bidi isolation: bidiIsolate()
	# Enabling Languages Python port of unicodeBidi.ts: https://github.com/signalapp/Signal-Desktop/blob/ce0fb220411b97722e1e080c14faa65d23165784/ts/util/unicodeBidi.ts
	# Original code by Signal Messenger, LLC
	# Released under AGPL 3.0 license
	#
	####################################################################################################

	import regex

	sudo apt-get update
	sudo apt-get install -y unzip git cmake python3-pip python3.11-venv libfreetype6-dev libharfbuzz-dev libfribidi-dev meson gtk-doc-tools libcairo2-dev libfontconfig-dev libjpeg-dev zlib1g-dev libpng-dev libtiff5-dev libfreetype6-dev liblcms2-dev libwebp-dev libxcb1-dev

	mkdir ~/tmp
	cd tmp
	git clone https://github.com/HOST-Oman/libraqm.git
	git clone https://github.com/ninja-build/ninja.git

	cd ninja
	./configure.py --bootstrap

	# We start by loading up PyICU.
	import PyICU as icu
	# Let's create a test text. Notice it contains some punctuation.
	test = u"This is (\"a\") test!"


	# We create a wordbreak iterator. All break iterators in ICU are really RuleBasedBreakIterators, and we need to tell it which locale to take the word break rules from. Most locales have the same rules for UAX#29 so we will use English.
	wb = icu.BreakIterator.createWordInstance(icu.Locale.getEnglish())

	# An iterator is just that. It contains state and then we iterate over it. The state in this case is the text we want to break. So we set that.