Created
September 2, 2012 14:57
-
-
Save rkh/3600034 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# encoding: binary | |
# Removes any bytes from a string that are not valid UTF-8 | |
class Cleaner | |
attr_reader :bytes, :buffer, :outstanding | |
def self.clean(str) | |
new.tap { |c| c << str }.to_s | |
end | |
def initialize(str = nil) | |
@bytes = [] | |
clear_buffer | |
end | |
def <<(input) | |
return self << input.bytes if input.respond_to? :bytes | |
return input.each { |b| self << b } if input.respond_to? :each | |
case input | |
when 001..127 then add(input) | |
when 128..191 then fill_buffer(input) | |
when 192..223 then start_buffer(input, 2) | |
when 224..239 then start_buffer(input, 3) | |
when 240..247 then start_buffer(input, 4) | |
when 248..251 then start_buffer(input, 5) | |
when 252..253 then start_buffer(input, 6) | |
else clear_buffer | |
end | |
end | |
def to_s | |
bytes.pack('C*').force_encoding('utf-8') | |
end | |
private | |
def clear_buffer | |
start_buffer(nil, 0) | |
end | |
def start_buffer(byte, size) | |
@buffer, @outstanding = Array(byte), size | |
end | |
def fill_buffer(byte) | |
buffer << byte | |
add(buffer) if buffer.size == outstanding | |
clear_buffer if buffer.size > outstanding | |
end | |
def add(input) | |
clear_buffer | |
bytes.concat Array(input) | |
end | |
end | |
str = "yummy\xE2 \xF0\x9F\x8D\x94 \x9F\x8D\x94" | |
puts str | |
puts Cleaner.clean(str) |
There is a third option to use the iconv bindings on 1.8, as @Burgerstrand's solution is 1.9 only.
No edit button for gist comments?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Questions on how to remove invalid UTF-8 characters from strings come up from time to time in #ruby and #ruby-lang. Up until now the only solution I’ve been able to give is this:
Now I have yet another option. You should gemify it. :)