-
-
Save adamlwatson/9623703 to your computer and use it in GitHub Desktop.
# this scrubs emoji sequences from a string - i think it covers all of them | |
def strip_emoji ( str ) | |
str = str.force_encoding('utf-8').encode | |
clean_text = "" | |
# emoticons 1F601 - 1F64F | |
regex = /[\u{1f600}-\u{1f64f}]/ | |
clean_text = str.gsub regex, '' | |
#dingbats 2702 - 27B0 | |
regex = /[\u{2702}-\u{27b0}]/ | |
clean_text = clean_text.gsub regex, '' | |
# transport/map symbols | |
regex = /[\u{1f680}-\u{1f6ff}]/ | |
clean_text = clean_text.gsub regex, '' | |
# enclosed chars 24C2 - 1F251 | |
regex = /[\u{24C2}-\u{1F251}]/ | |
clean_text = clean_text.gsub regex, '' | |
# symbols & pics | |
regex = /[\u{1f300}-\u{1f5ff}]/ | |
clean_text = clean_text.gsub regex, '' | |
end | |
def test_strip_emoji | |
f = File.open("emoji.txt", "r") | |
f.each_line do |line| | |
puts strip_emoji_full(line) | |
end | |
f.close | |
end |
This caught my attention because a colleague of mine used it as reference.
If the objective is to remove the 4-bytes characters from an UTF-8 string (which is the widespread problem of MySQL installations who have been using the default utf8 character set), then this is a more standard solution:
scrubbed_utf8_mb3_string = utf8_mb4_string.each_char.select { |char| char.bytesize < 4 }.join
Note that his code is taken from https://github.com/maximeg/activecleaner.
Thanks for this method !
BTW:
Comment from above worked like a charm too : )
https://gist.github.com/adamlwatson/9623703#gistcomment-1785300
This does not work for all emojis.
see complete list here http://unicode.org/emoji/charts/full-emoji-list.html
example of unfiltered emojis:
U+1F195
U+1F1F2
U+1F6A7
...
comment from @saveriomiroddi is better.
scrubbed_utf8_mb3_string = utf8_mb4_string.each_char.select { |char| char.bytesize < 4 }.join
It removes Chinese as well...
Try this:
https://github.com/guanting112/remove_emoji
( 它不會移除任何中文,僅會根據標準將所有的 emoji 剔除 )
strip_emoji_full method is missing!