Skip to content

Instantly share code, notes, and snippets.

@fabrizioc1
Created October 6, 2011 05:53
Show Gist options
  • Save fabrizioc1/1266625 to your computer and use it in GitHub Desktop.
Save fabrizioc1/1266625 to your computer and use it in GitHub Desktop.
Sanitizing UTF8 or ISO8859-1 strings in Ruby 1.8.7
require 'cgi'
require 'active_support/all'
WHITELIST_CHAR_FRENCH_UTF8 =
["\xC3\x80", "\xC3\x84", "\xC3\x88", "\xC3\x89", "\xC3\x8A", "\xC3\x8B", "\xC3\x8E", "\xC3\x8F", "\xC3\x94", "\xC3\x99", "\xC3\x9B",
"\xC3\x9C", "\xC3\x87", "\xC3\xA0", "\xC3\xA2", "\xC3\xA4", "\xC3\xA8", "\xC3\xA9", "\xC3\xAA", "\xC3\xAB", "\xC3\xAE", "\xC3\xAF",
"\xC3\xB4", "\xC3\xB9", "\xC3\xBB", "\xC3\xBC", "\xC3\xBF", "\xC3\xA7"].join
WHITELIST_CHAR_FRENCH_ISO8859_1 =
"\300\304\310\311\312\313\316\317\324\331\333\334\307\340\342\344\350\351\352\353\356\357\364\371\373\374\377\347"+
WHITELIST_CHAR = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 ()*/_.-,|!\$?:"+WHITELIST_CHAR_FRENCH_UTF8
def html_unescape(s) s.gsub(/&#X([0-9A-Z]{2});/i){|c| sprintf("%c",("0x"+$1).hex) } end
def html_sanitize(s)
s = html_unescape(s)
puts "HTML ESCAPED: #{s}"
s = s.is_utf8? ? s : Iconv.conv("UTF8", "ISO8859-1", s)
raw = s.chars.select{|c| WHITELIST_CHAR.include?(c) }
puts "RAW: #{raw.join(',')}"
raw.join
end
input = ARGV[0]
puts "KCODE: #{$KCODE}"
puts "INPUT: #{input}"
puts "UTF-8: #{input.is_utf8?}" if input.respond_to?(:is_utf8?)
puts "BYTES: #{input.bytes.to_a.join(',')}"
puts "CHARS: #{input.chars.to_a.join(',')}"
puts "SANITIZED: #{html_sanitize(input)}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment