Last active
September 29, 2017 19:00
-
-
Save ploxiln/5302242 to your computer and use it in GitHub Desktop.
my utf-8 coding test for software engineering job applicants
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/************************************************************************************************* | |
UTF-8 validation test | |
UTF-8 is used for encoding text, which means, mapping the bytes in a file to meaningful characters. | |
ASCII is the classic, once ubiquitous encoding in the western world. It uses a single byte per | |
character, and uses only the lower 7 bits of that byte. It can only represent the characters on a | |
US keyboard (as well as tab and carriage return etc). Other encodings exist which also use a single | |
byte per character, but use all 8 bits, like ISO-8859-1, or use multible bytes per character, like | |
UCS-32 which always uses 4 bytes per character. These encodings can represent many more characters, | |
such as accented ones. | |
UTF-8 uses a variable number of bytes per character - one, or two, or even more. Even better, all | |
the single-byte characters are ASCII, so all ASCII encoded text is also valid UTF-8 encoded text. | |
In this test, you'll write a function which takes a UTF-8 encoded string, and decides whether it's | |
valid UTF-8. This is simpler than it sounds: all you have to do is make sure it consists of valid | |
groups of bytes, and doesn't end in the middle of a group. (These groups are called codepoints.) | |
The rules for these groups of bytes are as follows: | |
Single byte groups match the pattern <most significant bit> 0xxxxxxx <least significant bit>, where | |
x means "either 0 or 1". For example, in both ASCII and UTF-8, the character 'K' is represented by | |
a single byte with the value 75 in decimal, or 0x4B in hexadecimal, or 01001011 in binary, which | |
fits with the constraint above for a single-byte group. | |
Multi-byte groups have a start byte which indicates the length of the group, and then the correct | |
number of continuation bytes. For example, the start byte for a 2-byte group is 110xxxxx, and the | |
continuation byte is of the form 10xxxxxx. To get the actual value of the group to figure out what | |
accented character that represented, you would take all the bits that go where the "x" are, and put | |
them together. Don't worry about that: for this problem, you don't have to find the values encoded, | |
just whether the groups of bytes are valid. | |
NULL bytes have the same meaning in ASCII and UTF-8, so in C, you can NULL terminate UTF-8 strings | |
just like you can ASCII strings. | |
Overview of byte groups aka codepoints: | |
1 byte group: 0xxxxxxx | |
2 byte group: 110xxxxx 10xxxxxx | |
3 byte group: 1110xxxx 10xxxxxx 10xxxxxx | |
4 byte group: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | |
5 byte group: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | |
6 byte group: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | |
As you can see, the start-byte of a 3-byte group is 1110xxxx, and all continuation bytes are the | |
same for all sizes of byte groups. | |
YOUR TASK: | |
Fill in the function test_utf8() below to make it return INVALID if called with an invalid | |
NULL-terminated byte sequence. | |
SUGGESTIONS: | |
Comment your code a bit. Compile and run it. Suggested compile command: | |
gcc --std=c99 -Wall -g -o utf8_test utf8_test.c | |
Feel free to make the test harness code more elegant, but focus on the correctness and style of | |
the core utf-8 validation logic. | |
**************************************************************************************************/ | |
#include <stdio.h> | |
#define VALID 0 | |
#define INVALID 1 | |
int test_utf8(const unsigned char *str) | |
{ | |
/* your code goes here, replace this faulty implementation */ | |
if ( str[0] & 0x80 ) { | |
return INVALID; | |
} | |
return VALID; | |
} | |
/* "K", should be valid */ | |
const unsigned char test1[] = { 0x4B, 0x00 }; | |
/* "hey" with accented e - "héy" should be valid */ | |
const unsigned char test2[] = { 0x68, 0xC3, 0xA9, 0x79, 0x00 }; | |
/* junk, should fail */ | |
const unsigned char test3[] = { 0x5A, 0xC3, 0xC3, 0xE9, 0x5A, 0x00 }; | |
/* a random-ish sequence I think is valid */ | |
const unsigned char test4[] = { 0xF4, 0xAF, 0xA7, 0xB2, 0xE6, 0xA1, 0xB3, 0x00 }; | |
/* junk, should fail */ | |
const unsigned char test5[] = { 0x5A, 0x79, 0xF4, 0xAF, 0xA7, 0x00, }; | |
const unsigned char *tests[] = {test1, test2, test3, test4, test5}; | |
int main() { | |
int i; | |
for (i = 0; i < 5; i++) { | |
printf("test%d: %s\n", i+1, test_utf8(tests[i]) == VALID ? "VALID" : "INVALID"); | |
} | |
return 0; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment