Skip to content

Instantly share code, notes, and snippets.

@andjc
Last active October 20, 2024 13:25
Show Gist options
  • Save andjc/43a98c6d6f5e419303604081d57a401e to your computer and use it in GitHub Desktop.
Save andjc/43a98c6d6f5e419303604081d57a401e to your computer and use it in GitHub Desktop.
Grapheme tokenisation in Python

Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']

This will give you discrete characters or codepoints. But this approach doesn't work as well for other languages. Let's take a Dinka string as an example:

>>> t2 = "dɛ̈tëicëkäŋ akɔ̈ɔ̈n"
>>> [char for char in t2]
['d', 'ɛ', '̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ', '̈', 'ɔ', '̈', 'n']

There is a mixture of precomposed and decomposed character sequences.

>>> import unicodedata as ud
>>> ud.is_normalized('NFC', t2)
True

The text is fully precomposed, using Unicode Normalization Form C, but many user-perceived characters do not exist as single codepoints. How do we work with ä vs ɛ̈? We can use Python packages such as grapheme and PyICU, or we can also get the same results from a simple regular expression assertion.

Unicode documentation refers to legacy grapheme clusters, extended grapheme clusters, and tailored grapheme clusters. Generally, when graphemes are referred to, extended grapheme clusters are meant.

Regex

I will use the regex module rather than re, since regex has more comprehensive Unicode support.

import regex as re
re.findall(r'\X',t2)
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

As can be seen, each grapheme is treated as a single unit, so rather than splitting ɛ̈ into two codepoints or characters, it treats it as a single unit.

For the Dinka string, grapheme segmentation will be consistent across implementations. If we look at a Devenagari example, differences can be observed between segmentation by different implementations

t3 = "हिन्दी"
re.findall(r'\X',t3)
# ['हि', 'न्', 'दी']

This generates three grapheme clusters for the string.

Grapheme

Alternatively we could use the grapheme package, which provides a number of functions to manipulate strings using graphemes as the basic unit, rather than individual codepoints, as standard Python string operations do.

N.B. The grapheme package hasn't been updated since 2020, so it is better to avoid this package if you require recent versions of Unicode.

import grapheme
grapheme.UNICODE_VERSION
# '13.0.0'
list(grapheme.graphemes(t2))
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

Likewise, grapheme package gives us three grapheme clusters:

list(grapheme.graphemes(t3))
['हि', 'न्', 'दी']

PyICU

Alternatively, we could use PyICU, with icu4c, for grapheme tokenisation.

Using pyicu is more complex than using regex or grapheme, many of the functions available in icu4c are low level functions, and it is necessary to develop your own wrapper around it. The break iterator returns a set of breakpoints in the string. It is necessary to iterate through the breakpoints. The PyICU cheat sheet has a useful function, iterate_breaks(), to iterate through each breakpoint in the string.

We then need to create a break interator instance, and then pass the string and instance to iterate_breaks(). In this instance I will use the root locale for the break iterator.

import icu
def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary
bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())
list(iterate_breaks(t2, bi))
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

Alternative code we use in our internal projects:

import icu
def generate_tokens(text, brkiter = None):
    if not brkiter:
        brkiter = icu.BreakIterator.createWordInstance(icu.Locale.getRoot())
    brkiter.setText(text)
    i = brkiter.first()
    for j in brkiter:
        yield text[i:j]
        i = j

def get_generated_tokens(text, bi = None):
    return [*generate_tokens(text, brkiter=bi)]

iter = icu.BreakIterator.createCharacterInstance(icu.Locale('und'))
get_generated_tokens(t2, iter)
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

If we look at the Devanagari string:

get_generated_tokens(t3, iter)
# ['हि', 'न्दी']

The icu4c break iterator generates two grapheme clusters. The key difference between the icu4c implementation and graphame and regex is that pyicu treats 094D ( ् ) DEVANAGARI SIGN VIRAMA as extending the grapheme cluster, while for the other two packages the U+094D does nor extend the grapheme cluster boundaries. The icu4c behaviour matches user expectations.

pyuegc

The next approach is pyuegc.

from pyuegc import EGC, UCD_VERSION
UCD_VERSION
# '15.1.0'

The lastest pyuegc package uses the current Unicode version.

EGC(t2)
# ['d', 'ɛ̈', 't', 'ë', 'i', 'c', 'ë', 'k', 'ä', 'ŋ', ' ', 'a', 'k', 'ɔ̈', 'ɔ̈', 'n']

EGC(t3)
['हि', 'न्दी']

pyegc splits the string into two grapheme clusters, giving the same result as pyicu.

ugrapheme

The final solution is ugrapheme. Unlike the other solutions, ugrapheme creates a class instance providing a range of methods for working with and manipulating strings on a grapheme level rather than a character or codepoint level. To mimic the other solutions, you can cast the object to a list:

from ugrapheme import graphemes
grs = graphemes(t3)
[*grs]
# ['हि', 'न्दी']

or, alternatively use the grapheme_split() function:

from ugrapheme import grapheme_split
grapheme_split(t3)
# ['हि', 'न्दी']

ugrapheme splits the string into two grapheme clusters, giving the same result as pyicu.

@ramSeraph
Copy link

ramSeraph commented Aug 24, 2024

I checked pyuegc against pyicu for Telugu, pyicu is more accurate. Just putting it out there for whomever ends up coming here looking for it in the future. I cannot reproduce the failures. pyuegc in general matches pyicu in behaviour

@andjc
Copy link
Author

andjc commented Sep 18, 2024

@ramSeraph, that is true, generally I prefer to use pyicu, it is also important to remember that the underlying version of ICU4C used by PyICU influences results as well. Ideally I update the version of ICU4C I am using with PyICU.

@ramSeraph
Copy link

@andjc I have looked at the source code of pyuegc and am mostly surprised that it didn't work, maybe I used it wrong. It was nice option when you don't want to deal with the ICU4C installation as it only uses the data files from Unicode. I think it was based on Unicode 15.1 when I used it, but I am not sure. I didn't really check. I immediately went with PyICU

@ramSeraph
Copy link

Maybe they missed normalizing before attempting clustering. I am not sure if that is a prerequisite step though.

@Z4JC
Copy link

Z4JC commented Oct 10, 2024

@ramSeraph: I think most grapheme segmenting libraries out there are not going to normalize automatically. I guess you know this already but doing unicodedata.normalize('NFC', x) before doing the splitting should work if there was a normalization issue. However, AFAIK if the implementation is fully UAX #29 compliant, it should spit out correct graphemes even for denormalized strings.

The ugrapheme library I just released hopefully deals with Telugu correctly.
I tried:

In [1]: from ugrapheme import graphemes
In [2]: anil = graphemes("అనిల్")
In [3]: len(anil)
Out[3]: 3
In [4]: list(anil)
Out[4]: ['అ', 'ని', 'ల్']
In [5]: anil[::-1]
Out[5]: 'ల్నిఅ'
In [6]: worship = graphemes("రాజులకు రాజైన యీ మన విభుని పూజ సేయుటకు రండి")
In [7]: len(worship)
Out[7]: 28
In [8]: print(worship.replace("", "|"))
|రా|జు||కు| |రా|జై|| |యీ| ||| |వి|భు|ని| |పూ|| |సే|యు||కు| |రం|డి|

but I am limited to just scouring stackoverflow.com and other places and looking for tricky examples. I even tried the evil "\u0c1c\u0c4d\u0c1e\u200c\u0c3e" that crashed iPhones in 2018 and that one comes out fine (as 1 grapheme)

Denormalized strings such as "ai\u0302ne\u0301e" (aînée) come out as 5 graphemes and reversing them works fine too.

In [2]: g = graphemes("ai\u0302ne\u0301e")   # aînée
In [3]: list(g)
Out[3]: ['a', 'î', 'n', 'é', 'e']
In [4]: g[::-1]
Out[4]: 'eénîa'

The denormalized aînée string shows up in my unitests

@ramSeraph
Copy link

@Z4JC when I first tried this and commented.. I didn't realise that NFC normalization was required. That explains the failure. Also, thanks for the ugrapheme library. I will check it out. I am currently evaluating things on the sangraha corpus. The graphemes aren't labelled in it but it is currently the biggest Telugu corpus around.

@Z4JC
Copy link

Z4JC commented Oct 10, 2024

@andjc: Do you mind taking a look at my ugrapheme and seeing how it fares?

Here's what I tried:

In [2]: t2 = "dɛ̈tëicëkäŋ akɔ̈ɔ̈n"
In [3]: print(' , '.join(t2))
d , ɛ , ̈ , t , ë , i , c , ë , k , ä , ŋ ,   , a , k , ɔ , ̈ , ɔ , ̈ , n
In [4]: g2 = graphemes("dɛ̈tëicëkäŋ akɔ̈ɔ̈n")
In [5]: print(' , '.join(g2))
d , ɛ̈ , t , ë , i , c , ë , k , ä , ŋ ,   , a , k , ɔ̈ , ɔ̈ , n
In [6]: t3 = "हिन्दी"
In [7]: list(t3)
Out[7]: ['ह', 'ि', 'न', '्', 'द', 'ी']
In [8]: g3 = graphemes("हिन्दी")
In [9]: list(g3)
Out[9]: ['हि', 'न्दी']

What you might like is that ugrapheme.graphemes in many cases acts exactly like strings, so you can slice, iterate, add, multiply, hash, pickle, join etc. using the same syntax as you would with regular strings. In fact you can concatentate regular strings to it and vice versa, without worrying about performance.

At the same time ugrapheme is at least 20x faster than pyuegc, and at around 5x of PyICU..
In fact, it approaches the performance of ordinary strings, despite having to do the segmenting internally.

Here's what I could not get PyICU to split correctly:
"Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद"
but using ugrapheme it works just fine.

@ramSeraph
Copy link

ramSeraph commented Oct 11, 2024

@Z4JC I am looking at the reversing op in the example you gave and I think the reverse of అనిల్ is లిన.

It is the phonetically reversed version, but I suspect that is hard to implement and is not captured in the unicode data or the standard.

Also this is not a common operation people do and the correctness of the operation might be subject to individual taste.

@ramSeraph
Copy link

ramSeraph commented Oct 11, 2024

Actually, thinking more on it, what you presented(ల్నిఅ) might also be considered accurate. And ల్‌నిఅ with a ZWNJ between ల్‌ and ని might also be considered correct. As you can see, this is a mess and unlikely to be captured accurately for Indic languages.

@ramSeraph
Copy link

ramSeraph commented Oct 11, 2024

@Z4JC there is one other discrepancy I found when going through the list of words I have.

word: 'పరీక్ిలకు'
pyicu split: 'ప', 'రీ', 'క్ి', 'ల', 'కు'
pyuegc split: 'ప', 'రీ', 'క్ి', 'ల', 'కు'
ugraphemes split: 'ప', 'రీ', 'క్ిల', 'కు'

I am not sure if it is a bug because the word itself is not valid( క్ి is not a valid sequence in Telugu ), but the pyicu and pyuegc splits are better. But these kind of words can be seen in the wild because of the lack of tools to check well-formedness for Indic languages.

@ramSeraph
Copy link

ramSeraph commented Oct 11, 2024

I have done an exhaustive check on the Telugu words found in the sangraha corpus( 55 million of them ) with and without normalisation( NFC and visual ). Visual normalisation was done using nisaba.

My conclusion is that for well formed words pyicu, pyuegc and ugrapheme match in the result produced. From the spot checks I have done the results look correct.

I have updated my original comment to reflect this.

@Z4JC
Copy link

Z4JC commented Oct 12, 2024

@ramSeraph: Great feedback.
ugrapheme, according to my analysis is the only one that parses 'పరీక్ిలకు' according to the UAX #29 spec.

Specifically, in Table 1c. Regex Definitions, we have:

conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+

For ease of reading, instead of writing \p{InCB=Consonant}, I will write C. Instead of \p{InCb=Linker}, I will write L and instead of `\p{InCB=Extend}' I will write E

The expression then becomes:

conjunctCluster := C ([E L]* L [E L]* C)+

In the word 'పరీక్ిలకు' we encounter the following unicode codepoints:
'క': {InCB=Consonant} or, shorthand, C
'్': {InCB=Linker} or, shorthand, L
'ి': {InCB=Extend}, or, shorthand, E
'ల': {InCB=Consonant}, or shorthand, C
the above 4 codepoints thus correspond to CLEC

The sequence CLEC is a match for the conjunctCluster production (match shown as bold font here):
conjunctCluster := C ([E L]* L [E L]* C)+

The spec has 'క్ిల' in a conjunct cluster, although I understand it is not a naturally occuring Telugu conjunct cluster and probably looks wrong. The spec does not make a difference between, for example, Devenagari and Telugu in terms of conjunct clusters and I am not familiar with the scripts enough to be able to tell whether there should be one. There has to be a reason for the conjunctCluster production in the spec to be so complicated, so I am thinking if pyicu and pyuegc broke this one differently, they might actually make a breaking mistake on some valid conjunct clusters..

@ramSeraph
Copy link

I think the conjunct clusters are similar in Telugu and Hindi. But there are quite a few scripts in the Brahmic script family, maybe some of them do things differently.

If you are curious as to why 'క్ిల' is not a valid sequence, you can checkout this FSA for well-formedness that the nisaba folks wrote. I am surprised this hasn't been codified in the unicode standard.

@ramSeraph
Copy link

@Z4JC As you are probably already aware, there was prior discussion of including the unicode segmentation into python here. I hope your work makes it into the core some day. It is a very nice piece of engineering. You clearly blew all other python implementations away in performance. Maybe you can benchmark it against the icu rust implementation here.

@andjc
Copy link
Author

andjc commented Oct 12, 2024

@Z4JC will look at it, have been busy with ethiopic data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment