-
-
Save devnoname120/59a92c24eb357e39c0b1c673f39f7059 to your computer and use it in GitHub Desktop.
class String | |
# See https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=355 | |
COMBINING_DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*') | |
def removeaccents | |
self | |
.unicode_normalize(:nfd) # Decompose characters | |
.tr(COMBINING_DIACRITICS, '') | |
.unicode_normalize(:nfc) # Recompose characters | |
end | |
end |
For me on ruby 2.6.10p210 (2022-04-12 revision 67958) [universal.arm64e-darwin22]
, this doesn't convert ł to l. I had to use this:
require 'i18n'
I18n.config.available_locales = :en
I18n.transliterate(str)
I did a benchmark on a string with 144525 characters and I18n appears faster:
user system total real
I18n 0.011735 0.000046 0.011781 ( 0.011780)
removeaccents 0.029162 0.000093 0.029255 ( 0.029264)
On a string 1.46GB long, the I18n gem had the clear advantage:
I18n 127.551134 0.554655 128.105789 (128.158066)
removeaccents 305.104024 1.212222 306.316246 (306.473522)
With a more real-world test - an array of 1275 strings, averaging 111 characters each, the I18n gem is 3x faster:
I18n 0.014128 0.000110 0.014238 ( 0.014238)
removeaccents 0.043880 0.000823 0.044703 ( 0.044707)
@bkazez ł
isn't replaced because it's not an accented character, but a self-standing letter that is part of the Polish alphabet. The stroke can't be “removed” because ł
is a formed character, not a composed character. In fact it doesn't have any valid decomposition in unicode.
You nonetheless make a very interesting point! Even though ł
isn't a composed character per se, it can still be useful to replace it with l
to account for e.g. forms that were filled with an English keyboard (where l
was used because ł
wasn't available and it looked similar to it).
I'm not enthusiastic about I18n.transliterate()
however because it converts to ?
all the characters that can't be transliterated to the target locale.
If you only use it to compare strings then I suppose it works (although with false positives because the non-transliterable characters are all converted to ?
). If you plan to store the result in a database or output it somewhere then I18n.transliterate()
is a no-go.
See: https://stackoverflow.com/a/74029319/3634271