Created
October 30, 2011 12:08
-
-
Save meijeru/1325840 to your computer and use it in GitHub Desktop.
UTF-8 string to block of Unicode codepoints conversion: Red/System
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Red/System [] | |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | |
; function utf8-to-cps (cps = codepoints) | |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | |
; This function decodes UTF-8 information supplied as bytes in argument u | |
; and uses the argument res to store the array of codepoints | |
; (integers >= 0 and < 10FFFFh); the space for this array should have been | |
; allocated by the caller; an upper limit for the size of the array in bytes | |
; is 4 times the length of the UTF-8 string. The actual size used is returned. | |
; For the algorithm, see http://en.wikipedia.org/wiki/UTF-8. | |
; Remarks about the contents: | |
; (1) Coding errors are skipped until the first correct UTF-8 combination | |
; is encountered. The first byte of the offending combination is | |
; replaced by a standard byte according to one of the several options | |
; described in the article cited above and/or taken from elsewhere. | |
; (2) Codepoint U+0000 cannot be represented as UTF-8 single byte 00h | |
; since that signifies end-of-c-string for Red/System! | |
; However, this code point can be represented by C0h 80h. | |
; (3) As it stands, the function accepts "overlong" combinations, such as | |
; C0h 80h or E0h 80h 80h or even F0h 80h 80h 80h for U+0000. | |
; Uncomment the appropriate lines to test for this. | |
; (4) In any case, it rejects 4-byte sequences coding for 110000h and higher | |
; (in fact, up to 1FFFFFh), because codepoints beyond U+10FFFF | |
; are not defined in the Unicode standard. | |
; (5) Also, it rejects sequences resulting in the invalid Unicode codepoints | |
; U+DC00..U+DFFF, which are used in UTF-16 for high and low surrogate halves. | |
; It may, however, use some of these as replacement in the output. | |
; Remarks about the coding: | |
; (1) Another choice of type for the input parameter could have been | |
; pointer! [byte!], but this would have meant that the length of the | |
; input must be paassed as an extra parameter. It would have done | |
; away with the problem of the null byte, though. | |
; (2) The choice of type for the output parameter is almost forced upon us, | |
; since Red/System does not have native arrays. | |
; (3) The shift and bitwise or operators are assumed to be more efficient | |
; than division and addition. They are also more closely tied to the | |
; UTF-8 specs, reducing the possibility of errors. | |
; (4) The cases of invalid byte sequences include: | |
; - an invalid starting byte | |
; - an unexpected continuation byte | |
; - a start byte not followed by enough continuation bytes (incomplete sequence) | |
; - a sequence that decodes to a value that should use a shorter sequence | |
; (an "overlong form"). | |
; The first case is caught by the test: unless b1 < C0h | |
; The second case is caught by the tests: b2 >= 80h b2 < C0h etc. | |
; The third case is effectively caught by the same tests, since | |
; b2 >= 80h etc. will fail if b2 = 0 etc. | |
; Thus there is no need for the equivalent of the protective REBOL clause | |
; if not tail? next u | |
; etc. | |
; The last case is also explicitly caught if the appropriate lines are | |
; uncommented. | |
; (5) With a restriction to the BMP (codepoints up to U+FFFF) the resulting | |
; array of codepoints could have elements that are uint16!, which would | |
; save half of the space for this array. | |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; | |
#define code-array! [pointer! [integer!]] | |
#define replacement | |
; choose one of the following options | |
FFFDh ; U+FFFD = replacement character | |
; 1Ah ; U+001A = control SUB (substitute) | |
; 241Ah ; U+241A = symbol for substitute | |
; 2426h ; U+2426 = symbol for substitute form two | |
; 3Fh ; U+003F = question mark | |
; BFh ; U+00BF = inverted question mark | |
; DC00h + b1 ; U+DCxx where xx = b1 (never a Unicode codepoint) | |
utf8-to-cps: func [ | |
u [c-string!] | |
res [code-array!] | |
return: [integer!] | |
/local | |
b1 b2 b3 b4 ; up to four bytes in a UTF-8 sequence | |
; for computing purposes they are of integer! type | |
cp ; computed codepoint | |
res0 ; start of result | |
][ | |
res0: res | |
while [b1: as-integer u/1 b1 <> 0][ | |
; cycling through res is done at the end; likewise for u | |
; to account for this, as soon as a multiple byte sequence is consumed | |
; the pointer in u is moved one less than the number of bytes consumed | |
either b1 < 80h ; single byte (ASCII) | |
[ | |
res/value: b1 ; and we are done | |
][ | |
res/value: replacement | |
; assume error by default - this simplifies code greatly | |
; res/value is now only set if a correct sequence has been decoded | |
unless b1 < C0h [ ; 80h - BFh may not start a sequence | |
case [ | |
b1 < E0h [ ; start of two-byte sequence | |
b2: as-integer u/2 | |
if all [ | |
b2 >= 80h b2 < C0h | |
][ | |
cp: (b1 - C0h << 6) or | |
(b2 - 80h) | |
; if any [ | |
; cp > 7Fh ; optional test for overlong | |
; cp = 0 ; even so, must allow U+0000 | |
; ][ | |
res/value: cp | |
u: u + 1 | |
; ] | |
] | |
] | |
b1 < F0h [ ; start of three-byte sequence | |
b2: as-integer u/2 | |
b3: as-integer u/3 | |
if all [ | |
b2 >= 80h b2 < C0h | |
b3 >= 80h b3 < C0h | |
][ | |
cp: (b1 - E0h << 12) or | |
(b2 - 80h << 6) or | |
(b3 - 80h) | |
if all [ | |
any [cp < DC00h cp > DCFFh] | |
; cp > 7FFh ; optional test for overlong | |
][ | |
res/value: cp | |
u: u + 2 | |
] | |
] | |
] | |
b1 < F8h [ ; start of four-byte sequence | |
b2: as-integer u/2 | |
b3: as-integer u/3 | |
b4: as-integer u/4 | |
if all [ | |
b2 >= 80h b2 < C0h | |
b3 >= 80h b3 < C0h | |
b4 >= 80h b4 < C0h | |
][ | |
cp: (b1 - F0h << 18) or | |
(b2 - 80h << 12) or | |
(b3 - 80h << 6) or | |
(b4 - 80h) | |
if all [ | |
cp <= 10FFFFh | |
; cp > FFFFh ; optional test for overlong | |
][ | |
res/value: cp | |
u: u + 3 | |
] | |
] | |
] | |
; true [ | |
; error case | |
; ] | |
] | |
] | |
] | |
res: res + 1 | |
u: u + 1 | |
] | |
res - res0 | |
] |
Looks good to me!
Hello Rudolf, I'm having this random problem with Red Console crashing:
length? ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-in>> ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-out>> "Regular File"
; 12
length? ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-in>> ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-out>> "Regular File"
; 12
length? ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-in>> ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-out>> "Regular File"
; 12
length? ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
;call-in>> ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
*** Runtime Error 1: access violation
*** in file: unicode.reds
*** at line: 345
ecall/o
is just a wrapper to call/output
. This example without wrapper return the same error:
a: does[call/output {stat -n -f "%HT" '/i-am-a-file.txt'} o: ""]
; func [][call/output {stat -n -f "%HT" '/i-am-a-file.txt...
loop 100 [a]
; 0
loop 100 [a]
*** Runtime Error 1: access violation
*** in file: unicode.reds
*** at line: 345
I had one occurrence of this error also (but mostly it's the above error that comes out most):
;call-in>> ecall/o {stat -n -f "%HT" '/i-am-a-file.txt'}
*** Runtime Error 100: no value matched in CASE
*** in file: unicode.reds
*** at line: 276
load-utf8-buffer: func [
src [c-string!] ;-- UTF-8 input buffer (zero-terminated)
size [integer!] ;-- size of src in bytes (including terminal NUL)
dst [series!] ;-- optional output string! series
remain [int-ptr!] ;-- number of undecoded bytes at end of buffer
return: [node!]
/local
node [node!]
s [series!]
buf1 [byte-ptr!]
buf4 [int-ptr!]
end [byte-ptr!]
unit [integer!]
cp [integer!] ;-- computed codepoint
count [integer!]
used [integer!]
][
#if debug? = yes [if verbose > 0 [print-line "unicode/load-utf8-buffer"]]
assert positive? size
either null? dst [ ;-- test if output buffer is provided
node: alloc-series size 1 0
s: as series! node/value
unit: Latin1 ;-- start with 1 byte/codepoint
][
node: dst/node
s: dst
unit: GET_UNIT(s)
if s/size / unit < size [
s: expand-series s size * unit
]
]
buf1: as byte-ptr! s/offset
buf4: null
end: buf1 + s/size
count: size
if size = 1 [return node] ;-- terminal NUL accounted
;assert not zero? as-integer src/1 ;@@ ensure input string not empty
;-- the first part of loop is Rudolf's code with very minor modifications
;-- (res/value replaced by cp, 'u renamed to 'src)
;-- original source code: https://gist.github.com/1325840
until [
; cycling through res is done at the end; likewise for src
; to account for this, as soon as a multiple byte sequence is consumed
; the pointer in src is moved one less than the number of bytes consumed
used: count ;-- pass number of remaining bytes in input stream
cp: decode-utf8-char src :used
if cp = -1 [ ;-- premature exit if buffer incomplete
s/tail: as cell! either unit = UCS-4 [buf4][buf1] ;-- position s/tail at end of loaded characters (no NUL terminator)
remain/value: count ;-- return the number of unprocessed bytes
return node
]
line 345 is the return none
command of function load-utf8-buffer
(above) in %red/runtime/unicode.reds
I'm on mac osx btw. 8-) Using red-console built form latest nightly red + call.red
Do you have an idea what could be going wrong?
Thank you!!
Will Arp
Just tried on Archlinux and can't reproduce the error, so it may be os x specific, either in unicode handling or in the call function.
On Archlinux 64:
a: does [call/output {stat --printf="%F" '/i-am-a-file.txt'} o: ""]
; func [][call/output {stat --printf="%F" '/i-am-a-file.txt'} o: ""]
loop 100 [a]
; 0
call/output {stat --printf="%F" '/i-am-a-file.txt'} o: ""
; 0
o
; "regular file"
loop 100 [a]
; 0
loop 100 [a]
; 0
loop 100 [a]
; 0
loop 100 [a]
; 0
loop 100 [a]
; 0
loop 10000 [a]
; 0
loop 100 [a]
; 0
loop 100 [a]
; 0
I hope you are on Mac! ;-)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I have updated this routine to use the new CASE statement.