Skip to content

Instantly share code, notes, and snippets.

View Z4JC's full-sized avatar
💆
Guru Meditation

!ZAJC! Z4JC

💆
Guru Meditation
View GitHub Profile
@andjc
andjc / graphemes_python.md
Last active October 20, 2024 13:25
Grapheme tokenisation in Python

Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']