!ZAJC! Z4JC

💆

Guru Meditation

!ZAJC! is a coder handle I thought was super cool when I was 11 years old. Meditating on the 1980s programming as a form of passion, art and self-expression.

2 followers · 0 following

Zagreb, Croatia
19:07 (UTC +02:00)
https://demozoo.org/sceners/134640/

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

andjc / graphemes_python.md

Last active October 20, 2024 13:25

Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']