Created
March 29, 2012 09:14
-
-
Save benhoskings/2235304 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Fix the encoding of a string to ensure it's valid UTF-8 by assuming | |
# the string is valid utf-8 but incorrectly marked. Changes the marker | |
# to UTF-8 then strips any invalid bytes | |
# | |
# It'd be nice if force_encoding had an option to strip invalid bytes in a | |
# single step. Until that's possible, the ugly round trip through UTF-16 is | |
# required. | |
# | |
# non strings are left untouched. | |
# | |
def clean_utf8(value) | |
return value unless value.respond_to?(:encoding) | |
return value if value.encoding == Encoding::UTF_8 && value.valid_encoding? | |
value = value.dup | |
value.force_encoding("utf-8") | |
if value.valid_encoding? | |
value | |
else | |
value.encode("utf-16be", :invalid => :replace, :replace => "?").encode("utf-8") | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment