Skip to content

Instantly share code, notes, and snippets.

@edouard-lopez
Last active March 29, 2017 18:03
Show Gist options
  • Save edouard-lopez/fcbb2b026d7f25786d914c54103ba208 to your computer and use it in GitHub Desktop.
Save edouard-lopez/fcbb2b026d7f25786d914c54103ba208 to your computer and use it in GitHub Desktop.
Filter codepoint from raw Unihan database
#!/usr/bin/env awk
BEGIN {
split(toggleValues, boundaries, ",")
start=0
end=start+1
}
/U\+/ && /kMandarin/ {
row = toHex($1);
if (toHex(boundaries[start]) <= row && row <= toHex(boundaries[end])) print $1"\t"$3
if (row >= toHex(boundaries[end])) {start+=2; end=start+1}
}
function toHex(field) { return strtonum("0x" substr(field, 3)) };
toggleValues="U+4E00,U+9FFF,U+3400,U+4DBF,U+20000,U+2A6DF,U+2A700,U+2B73F,U+2B740,U+2B81F,U+2B820,U+2CEAF,U+F900,U+FAFF"
awk -f src/chinese.awk \
-v toggleValues="$toggleValues" src/Unihan_Readings.txt
> src/codepoint-ruby.tsv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment