Skip to content

Instantly share code, notes, and snippets.

@edouard-lopez
Last active December 22, 2015 05:18
Show Gist options
  • Save edouard-lopez/6422608 to your computer and use it in GitHub Desktop.
Save edouard-lopez/6422608 to your computer and use it in GitHub Desktop.
Converting Unicode point to character with awk
awk '/U/ && /kMandarin/{split($1,uc,"+"); printf "%s;%s;%s\n",$1,system("/usr/bin/printf \"%b\n\" \"\\u" uc[2] "\""),$3 }' ./unihan/Unihan/Unihan_Readings.txt | head
U+3400;0;qiū
U+3401;0;tiàn
U+3404;0;kuà
U+3405;0;wǔ
U+3406;0;yǐn
awk '/U/ && /kMandarin/ {print $0}' ./resources/unihan/Unihan/Unihan_Readings.txt | head
U+3400 kMandarin qiū
U+3401 kMandarin tiàn
U+3404 kMandarin kuà
U+3405 kMandarin wǔ
U+3406 kMandarin yǐn
U+340C kMandarin yí
U+3416 kMandarin xié
U+341C kMandarin chóu
U+3421 kMandarin nuò
U+3424 kMandarin dān
@edouard-lopez
Copy link
Author

I end up using bash as it's simplier as recommended by geirha on #awk:

while read -r a b c; do if [[ $a = U+* && $b = kMandarin ]]; then printf '%s;%b;%s\n' "$a" "\\u${a#U+}" "$c"; fi; done < ./unihan/Unihan/Unihan_Readings.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment