Skip to content

Instantly share code, notes, and snippets.

@masasakano
Created July 4, 2021 00:05
Show Gist options
  • Select an option

  • Save masasakano/60ef140bef8f1ac7f7c5af3858cf7619 to your computer and use it in GitHub Desktop.

Select an option

Save masasakano/60ef140bef8f1ac7f7c5af3858cf7619 to your computer and use it in GitHub Desktop.
A Ruby helper method to convert CJK (Japanese) Zenkaku to ASCII
# coding: utf-8
# Nkf is a built-in Gem/library up to Ruby 3.0 at least.
#
# The following method is a wrapper for NKF.
#
# NKF (Nihon-go Kanji Filter) is an old library since 1990s. But I find it is
# the most convenient library still in 2021 to handle text with
# some nasty Japanese character code (n.b., the "proper" Japanese character code
# sets are fine, but unfortunately the character code sets popularalized
# by Microsoft and Apple though its base had been born and used before
# are pretty irregular...).
require 'nkf'
# Method to convert (CJK) Zenkaku alphabet/number/symbol to Hankaku.
#
# A JIS space is converted to 2 ASCII spaces in default (option zopt == 2),
# corresponding to the NKF option '-Z'.
# The other NKF options should be given as a string: opts.
# Option '-w' is in default, which means the output is in UTF-8,
# unless one of [-j, -e, -s] (JIS, EUC, SJIS) is included in opts.
# Also opton '-m0' is default.
#
# @example
# zenkaku_to_ascii('(あ)', zopt: 1) # => '(あ)'
#
# @param instr [String] Input to NKF
# @option opts [String] NKF option string
# @param zopt: [Integer, NilClass]
# @return [String]
def zenkaku_to_ascii(instr, opts='', zopt: nil)
z_spaces = (zopt || 2)
if /(^| )-[jesw]/ !~ opts
optsnkf = ("-w "+opts).strip
end
NKF.nkf("-m0 -Z#{z_spaces} #{optsnkf}", instr) # [-Z2] Convert a JIS X0208 space to 2 ASCII spaces, as well as Zenkaku alphabet/number/symbol to Hankaku.
end
############ Tests ############
if $0 == __FILE__
require 'minitest/autorun'
class TestZenkakuToAsciiClass < MiniTest::Test
def setup; end
def teardown; end
def test_zenkaku_to_ascii
assert_equal '(あ)', zenkaku_to_ascii('(あ)', zopt: 1)
assert_equal '1 A', zenkaku_to_ascii("\uff11\u3000\uff21", zopt: 1)
assert_equal '1 A', zenkaku_to_ascii("\uff11\u3000\uff21", zopt: 2)
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment