Created
July 4, 2021 00:05
-
-
Save masasakano/60ef140bef8f1ac7f7c5af3858cf7619 to your computer and use it in GitHub Desktop.
A Ruby helper method to convert CJK (Japanese) Zenkaku to ASCII
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # coding: utf-8 | |
| # Nkf is a built-in Gem/library up to Ruby 3.0 at least. | |
| # | |
| # The following method is a wrapper for NKF. | |
| # | |
| # NKF (Nihon-go Kanji Filter) is an old library since 1990s. But I find it is | |
| # the most convenient library still in 2021 to handle text with | |
| # some nasty Japanese character code (n.b., the "proper" Japanese character code | |
| # sets are fine, but unfortunately the character code sets popularalized | |
| # by Microsoft and Apple though its base had been born and used before | |
| # are pretty irregular...). | |
| require 'nkf' | |
| # Method to convert (CJK) Zenkaku alphabet/number/symbol to Hankaku. | |
| # | |
| # A JIS space is converted to 2 ASCII spaces in default (option zopt == 2), | |
| # corresponding to the NKF option '-Z'. | |
| # The other NKF options should be given as a string: opts. | |
| # Option '-w' is in default, which means the output is in UTF-8, | |
| # unless one of [-j, -e, -s] (JIS, EUC, SJIS) is included in opts. | |
| # Also opton '-m0' is default. | |
| # | |
| # @example | |
| # zenkaku_to_ascii('(あ)', zopt: 1) # => '(あ)' | |
| # | |
| # @param instr [String] Input to NKF | |
| # @option opts [String] NKF option string | |
| # @param zopt: [Integer, NilClass] | |
| # @return [String] | |
| def zenkaku_to_ascii(instr, opts='', zopt: nil) | |
| z_spaces = (zopt || 2) | |
| if /(^| )-[jesw]/ !~ opts | |
| optsnkf = ("-w "+opts).strip | |
| end | |
| NKF.nkf("-m0 -Z#{z_spaces} #{optsnkf}", instr) # [-Z2] Convert a JIS X0208 space to 2 ASCII spaces, as well as Zenkaku alphabet/number/symbol to Hankaku. | |
| end | |
| ############ Tests ############ | |
| if $0 == __FILE__ | |
| require 'minitest/autorun' | |
| class TestZenkakuToAsciiClass < MiniTest::Test | |
| def setup; end | |
| def teardown; end | |
| def test_zenkaku_to_ascii | |
| assert_equal '(あ)', zenkaku_to_ascii('(あ)', zopt: 1) | |
| assert_equal '1 A', zenkaku_to_ascii("\uff11\u3000\uff21", zopt: 1) | |
| assert_equal '1 A', zenkaku_to_ascii("\uff11\u3000\uff21", zopt: 2) | |
| end | |
| end | |
| end | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment