In writing a string manipulation library I ran into what I think is a bug in how substr
handles malformed UTF8 strings. I was testing corner cases, in particular the following case of a string that ends with the first byte of a UTF8 sequence:
string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence
Encoding(string) <- "UTF-8"
substr(string, 1, 10)
When run under valgrind with level 2 instrumentation, we get:
> string <- "abc\xEE" # \xEE indicates the start of a 3 byte UTF-8 sequence
> Encoding(string) <- "UTF-8"
> substr(string, 1, 10)
==613== Invalid read of size 1
==613== at 0x4EE519C: substr (character.c:286)
==613== by 0x4EE563D: do_substr (character.c:342)
==613== by 0x4F83444: bcEval (eval.c:6771)
==613== by 0x4F70257: Rf_eval (eval.c:624)
==613== by 0x4F72A7C: R_execClosure (eval.c:1764)
==613== by 0x4F72770: Rf_applyClosure (eval.c:1692)
==613== by 0x4F70A7F: Rf_eval (eval.c:747)
==613== by 0x4FB939A: Rf_ReplIteration (main.c:258)
==613== by 0x4FB955C: R_ReplConsole (main.c:308)
==613== by 0x4FBB00A: run_Rmainloop (main.c:1082)
==613== by 0x4FBB020: Rf_mainloop (main.c:1089)
==613== by 0x10896B: main (Rmain.c:29)
==613== Address 0xf5d04fd is 4,989 bytes inside a block of size 7,960 alloc'd
==613== at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==613== by 0x4FBF2B9: GetNewPage (memory.c:888)
==613== by 0x4FCD13C: Rf_allocVector3 (memory.c:2691)
==613== by 0x4FB2D9C: Rf_allocVector (Rinlinedfuns.h:577)
==613== by 0x4FA4CE6: do_strsplit (grep.c:462)
==613== by 0x4F83444: bcEval (eval.c:6771)
==613== by 0x4F70257: Rf_eval (eval.c:624)
==613== by 0x4F72A7C: R_execClosure (eval.c:1764)
==613== by 0x4F72770: Rf_applyClosure (eval.c:1692)
==613== by 0x4F8309A: bcEval (eval.c:6739)
==613== by 0x4F70257: Rf_eval (eval.c:624)
==613== by 0x4F72A7C: R_execClosure (eval.c:1764)
==613==
[1] "abc<ee>"
I am far from an expert in these things, but I think the read error is in the following block:
// main/character.c@276
static void substr(char *buf, const char *str, int ienc, int sa, int so)
{
/* Store the substring str [sa:so] into buf[] */
int i, j, used;
if (ienc == CE_UTF8) {
const char *end = str + strlen(str);
for (i = 0; i < so && str < end; i++) {
int used = utf8clen(*str);
if (i < sa - 1) { str += used; continue; }
for (j = 0; j < used; j++) *buf++ = *str++; /* PROBLEM HERE */
}
Because utf8clen
[VERIFY THIS] cares only about the leading byte, it happily will report a value of 3 when it hits \xEE
, which leads the potential for both an illegal read out of the end of str
and also an illegal write to buff
.
A potential patch is:
Index: src/main/character.c
===================================================================
--- src/main/character.c (revision 74482)
+++ src/main/character.c (working copy)
@@ -283,7 +283,7 @@
for (i = 0; i < so && str < end; i++) {
int used = utf8clen(*str);
if (i < sa - 1) { str += used; continue; }
- for (j = 0; j < used; j++) *buf++ = *str++;
+ for (j = 0; j < used && str < end; j++) *buf++ = *str++;
}
} else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;
Obviously the input string here is not a reasonable input, but it seems like this is a simple fix that can avoid potential bad behavior with the bad input.