Skip to content

Instantly share code, notes, and snippets.

@rise-worlds
Last active October 13, 2016 08:58
Show Gist options
  • Save rise-worlds/8ef03123b477b8a2a7324f2f95353791 to your computer and use it in GitHub Desktop.
Save rise-worlds/8ef03123b477b8a2a7324f2f95353791 to your computer and use it in GitHub Desktop.
utf8编码的全角字符转半角字符
//全角字符的定义为 unicode编码从0xFF01~0xFF5E 对应的半角字符为 半角字符unicode编码从0x21~0x7E,空格比较特殊, 全角为0x3000, 半角为0x20;除空格外, 全角/半角按unicode编码排序在顺序上是对应的
//首先,全角字符在utf-8下是三个字节表示,,具体表示为 1110xxxx 10xxxxxx 10xxxxxx
//所以,首先需要解析utf8编码的数据,如果是 FF01 到 FF5E 的情况下,则进行转换
void PreProcessor::half(std::string &input) {
std::string temp;
for (size_t i = 0; i < input.size(); i++) {
if (((input[i] & 0xF0) ^ 0xE0) == 0) {
int old_char = (input[i] & 0xF) << 12 | ((input[i + 1] & 0x3F) << 6 | (input[i + 2] & 0x3F));
if (old_char == 0x3000) { // blank
char new_char = 0x20;
temp += new_char;
} else if (old_char >= 0xFF01 && old_char <= 0xFF5E) { // full char
char new_char = old_char - 0xFEE0;
temp += new_char;
} else { // other 3 bytes char
temp += input[i];
temp += input[i + 1];
temp += input[i + 2];
}
i = i + 2;
} else {
temp += input[i];
}
}
input = temp;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment