The Python string operator str.isalpha() is sometimes used as a constraint or validator. But how useful is this in code that needs to support multiple languages?
The python documentation indicates that isalpha() matches any Unicode character with general category properties of Lu, Ll, Lt, Lm, or Lo.
While, Unicode defines an alphabetic character as any Unicode character with a category of Ll + Other_Lowercase + Lu + Other_Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic.
So it would be possible for a Unicode regex using \p{Alphabetic} to match characters that isalpha() would not match. Although in most practical cases the results would be the same.
It is interesting to note that the general categories Mn and Mc are not part of the Python or Unicode definition of an alphabetic character. What does this mean in practice?
If we take two canonically equivalent strings, "café" and "café". The first string uses Unicode Normalisation Form C, while the second uses Unicode Normalisation Form D.
>>> sl = ["café", "café"]
>>> print([(w, w.isalpha()) for w in sl])
[('café', True), ('café', False)]The first will match as a sequence of alphabetic characters, while the second will fail. The combining acute is not considered an alphabetic character. Although, the distinction between NFC and NFD data could be seen as a contrived example.
Wikipedia articles are often used within NLP tasks as a convenient data source. The Catalan Wikipedia has an article titled "Gal·lès" and the Dinka Wikipedia has an article titled "Brɛ̈kdhït". In both of these cases, the words are in NFC and will not match as a sequence of alphabetic characters.
>>> nl = ["Gal·lès", "Brɛ̈kdhït"]
>>> print([(w, w.isalpha()) for w in nl])
[('Gal·lès', False), ('Brɛ̈kdhït', False)]Gal·lès can also be written as Gaŀles, which will match.
>>> print("Gaŀles", "Gaŀles".isalpha())
Gaŀles True
>>> import regex as re
>>> print("Gaŀles", bool(re.match(r'^\p{Alphabetic}+$', "Gaŀles")))
Gaŀles TrueBoth Gal·lès and Gaŀles are used in Catalan text. But Python's str.isalpha and str.title do not treat the the same and will give different results.
Further examples:
>>> ol = ["မြန်မာစာ", "ភាសាខ្មែរ", "हिन्दी", "தமிழ்", "සිංහල", "বাংলা", "اُردُو", "ལྷ་སའི་སྐད་", "ગુજરાતી", "ਪੰਜਾਬੀ"]
>>> print([(w, w.isalpha()) for w in ol])
[('မြန်မာစာ', False), ('ភាសាខ្មែរ', False), ('हिन्दी', False), ('தமிழ்', False), ('සිංහල', False), ('বাংলা', False), ('اُردُو', False), ('ལྷ་སའི་སྐད་', False), ('ગુજરાતી', False), ('ਪੰਜਾਬੀ', False)]There are valid use cases for str.isalpha() and \p{Alphabetic}, but when working multilingual data, I often find it useful to expand the definition of Alphabetic.
>>> ol = ["မြန်မာစာ", "ភាសាខ្មែរ", "हिन्दी", "தமிழ்", "සිංහල", "বাংলা", "اُردُو", "ལྷ་སའི་སྐད་", "ગુજરાતી", "ਪੰਜਾਬੀ", "Gal·lès", "Brɛ̈kdhït"]
>>> print([(w, elu.isalpha_(w)) for w in ol])
[('မြန်မာစာ', True), ('ភាសាខ្មែរ', True), ('हिन्दी', True), ('தமிழ்', True), ('සිංහල', True), ('বাংলা', True), ('اُردُو', True), ('ལྷ་སའི་སྐད་', False), ('ગુજરાતી', True), ('ਪੰਜਾਬੀ', True), ('Gal·lès', False), ('Brɛ̈kdhït', True)]
>>>
Uh oh!
There was an error while loading. Please reload this page.