Skip to content

Instantly share code, notes, and snippets.

@benhoskings
Created March 29, 2012 09:14
Show Gist options
  • Save benhoskings/2235304 to your computer and use it in GitHub Desktop.
Save benhoskings/2235304 to your computer and use it in GitHub Desktop.
# Fix the encoding of a string to ensure it's valid UTF-8 by assuming
# the string is valid utf-8 but incorrectly marked. Changes the marker
# to UTF-8 then strips any invalid bytes
#
# It'd be nice if force_encoding had an option to strip invalid bytes in a
# single step. Until that's possible, the ugly round trip through UTF-16 is
# required.
#
# non strings are left untouched.
#
def clean_utf8(value)
return value unless value.respond_to?(:encoding)
return value if value.encoding == Encoding::UTF_8 && value.valid_encoding?
value = value.dup
value.force_encoding("utf-8")
if value.valid_encoding?
value
else
value.encode("utf-16be", :invalid => :replace, :replace => "?").encode("utf-8")
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment