Skip to content

Instantly share code, notes, and snippets.

@devnoname120
Last active June 29, 2024 05:12
Show Gist options
  • Save devnoname120/59a92c24eb357e39c0b1c673f39f7059 to your computer and use it in GitHub Desktop.
Save devnoname120/59a92c24eb357e39c0b1c673f39f7059 to your computer and use it in GitHub Desktop.
[Ruby] Remove accents from UTF-8 string
class String
# See https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=355
COMBINING_DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')
def removeaccents
self
.unicode_normalize(:nfd) # Decompose characters
.tr(COMBINING_DIACRITICS, '')
.unicode_normalize(:nfc) # Recompose characters
end
end
@devnoname120
Copy link
Author

@bkazez
Copy link

bkazez commented Jul 23, 2023

For me on ruby 2.6.10p210 (2022-04-12 revision 67958) [universal.arm64e-darwin22], this doesn't convert ł to l. I had to use this:

require 'i18n'

I18n.config.available_locales = :en
I18n.transliterate(str)

I did a benchmark on a string with 144525 characters and I18n appears faster:

                    user     system      total        real
I18n            0.011735   0.000046   0.011781 (  0.011780)
removeaccents   0.029162   0.000093   0.029255 (  0.029264)

On a string 1.46GB long, the I18n gem had the clear advantage:

I18n          127.551134   0.554655 128.105789 (128.158066)
removeaccents 305.104024   1.212222 306.316246 (306.473522)

With a more real-world test - an array of 1275 strings, averaging 111 characters each, the I18n gem is 3x faster:

I18n            0.014128   0.000110   0.014238 (  0.014238)
removeaccents   0.043880   0.000823   0.044703 (  0.044707)

@devnoname120
Copy link
Author

@bkazez ł isn't replaced because it's not an accented character, but a self-standing letter that is part of the Polish alphabet. The stroke can't be “removed” because ł is a formed character, not a composed character. In fact it doesn't have any valid decomposition in unicode.

You nonetheless make a very interesting point! Even though ł isn't a composed character per se, it can still be useful to replace it with l to account for e.g. forms that were filled with an English keyboard (where l was used because ł wasn't available and it looked similar to it).

I'm not enthusiastic about I18n.transliterate() however because it converts to ? all the characters that can't be transliterated to the target locale.

If you only use it to compare strings then I suppose it works (although with false positives because the non-transliterable characters are all converted to ?). If you plan to store the result in a database or output it somewhere then I18n.transliterate() is a no-go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment