Last active
November 14, 2019 18:08
-
-
Save masasakano/10703ccb621d08c51f7e19e8900ea0ae to your computer and use it in GitHub Desktop.
Ruby sample code to highlight the difference among Japanese-character conversion methods
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env ruby | |
| # -*- coding: utf-8 -*- | |
| # Sample code for my answer: | |
| # https://stackoverflow.com/questions/58829740/how-do-i-filter-out-invisible-characters-without-affecting-japanese-character-se/58863231#58863231 | |
| require 'nkf' | |
| begin | |
| require 'iconv' # https://rubygems.org/gems/iconv/ | |
| rescue LoadError | |
| warn "WARNING: Gem 'iconv' is not installed, hence is ignored" | |
| end | |
| orig = "b2◇〒α()あ相〜\u3000_8D━●★】$£€Ç♡㌔③\u2028ハンカク" | |
| print "Orig:"; p orig | |
| print "Enc: "; p orig.encode('ISO-2022-JP', undef: :replace, replace: '').encode('UTF-8') | |
| print "NKF: "; p NKF.nkf('-w -E', NKF.nkf('-e', orig)) | |
| output = '' | |
| Iconv.open('iso-2022-jp-2', 'utf-8') do |cd| | |
| cd.discard_ilseq=true | |
| # Note: cd.transliterate='?' raises Iconv::IllegalSequence | |
| output = cd.iconv orig << cd.iconv(nil) | |
| ## The original way described in the iconv Gem website | |
| # orig.each_char { |s| output << cd.iconv(s) } | |
| # output << cd.iconv(nil) | |
| end | |
| s2 = Iconv.conv('utf-8', 'iso-2022-jp-2', output) | |
| print "Icon:"; p s2 | |
| ## Results (as of Ruby 2.6.5): | |
| # Orig:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡㌔③\u2028ハンカク" | |
| # Enc: "b2◇〒α()あ相〜 _8D━●★】$£" | |
| # NKF: "b2◇〒α()あ相〜 _8D━●★】$£㌔③ハンカク" | |
| # Icon:"b2◇〒α()あ相〜 _8D━●★】$£€Ç♡③ハンカク" | |
| ### For developers: | |
| # p output # Sanity check. | |
| # p s2.encoding # Sanity check. | |
| # p s2.valid_encoding? # Sanity check. | |
| # p Iconv.ctlmethods # List of the available instance methods of iconv Gem | |
| ## Some say this would work, but it seemed not as of Ruby 2.6.5. | |
| # p Iconv.iconv('iso-2022-jp-2', 'utf-8//IGNORE', orig) | |
| # p Iconv.conv('iso-2022-jp-2', 'utf-8//TRANSLIT', orig) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment