Description of possible R substr bug

In writing a string manipulation library I ran into what I think is a bug in how substr handles malformed UTF8 strings. I was testing corner cases, in particular the following case of a string that ends with the first byte of a UTF8 sequence:

string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
Encoding(string) <- "UTF-8"
substr(string, 1, 10)

When run under valgrind with level 2 instrumentation, we get:

> string <- "abc\xEE"    # \xEE indicates the start of a 3 byte UTF-8 sequence
> Encoding(string) <- "UTF-8"
> substr(string, 1, 10)
==613== Invalid read of size 1
==613==    at 0x4EE519C: substr (character.c:286)
==613==    by 0x4EE563D: do_substr (character.c:342)
==613==    by 0x4F83444: bcEval (eval.c:6771)
==613==    by 0x4F70257: Rf_eval (eval.c:624)
==613==    by 0x4F72A7C: R_execClosure (eval.c:1764)
==613==    by 0x4F72770: Rf_applyClosure (eval.c:1692)
==613==    by 0x4F70A7F: Rf_eval (eval.c:747)
==613==    by 0x4FB939A: Rf_ReplIteration (main.c:258)
==613==    by 0x4FB955C: R_ReplConsole (main.c:308)
==613==    by 0x4FBB00A: run_Rmainloop (main.c:1082)
==613==    by 0x4FBB020: Rf_mainloop (main.c:1089)
==613==    by 0x10896B: main (Rmain.c:29)
==613==  Address 0xf5d04fd is 4,989 bytes inside a block of size 7,960 alloc'd
==613==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==613==    by 0x4FBF2B9: GetNewPage (memory.c:888)
==613==    by 0x4FCD13C: Rf_allocVector3 (memory.c:2691)
==613==    by 0x4FB2D9C: Rf_allocVector (Rinlinedfuns.h:577)
==613==    by 0x4FA4CE6: do_strsplit (grep.c:462)
==613==    by 0x4F83444: bcEval (eval.c:6771)
==613==    by 0x4F70257: Rf_eval (eval.c:624)
==613==    by 0x4F72A7C: R_execClosure (eval.c:1764)
==613==    by 0x4F72770: Rf_applyClosure (eval.c:1692)
==613==    by 0x4F8309A: bcEval (eval.c:6739)
==613==    by 0x4F70257: Rf_eval (eval.c:624)
==613==    by 0x4F72A7C: R_execClosure (eval.c:1764)
==613== 
[1] "abc<ee>"

I am far from an expert in these things, but I think the read error is in the following block:

// main/character.c@276

static void substr(char *buf, const char *str, int ienc, int sa, int so)
{
/* Store the substring	str [sa:so]  into buf[] */
    int i, j, used;

    if (ienc == CE_UTF8) {
	const char *end = str + strlen(str);
	for (i = 0; i < so && str < end; i++) {
	    int used = utf8clen(*str);
	    if (i < sa - 1) { str += used; continue; }      
	    for (j = 0; j < used; j++) *buf++ = *str++;    /* PROBLEM HERE */
	}

Because utf8clen [VERIFY THIS] cares only about the leading byte, it happily will report a value of 3 when it hits \xEE, which leads the potential for both an illegal read out of the end of str and also an illegal write to buff.

A potential patch is:

Index: src/main/character.c
===================================================================
--- src/main/character.c	(revision 74482)
+++ src/main/character.c	(working copy)
@@ -283,7 +283,7 @@
 	for (i = 0; i < so && str < end; i++) {
 	    int used = utf8clen(*str);
 	    if (i < sa - 1) { str += used; continue; }
-	    for (j = 0; j < used; j++) *buf++ = *str++;
+	    for (j = 0; j < used && str < end; j++) *buf++ = *str++;
 	}
     } else if (ienc == CE_LATIN1 || ienc == CE_BYTES) {
 	for (str += (sa - 1), i = sa; i <= so; i++) *buf++ = *str++;

Obviously the input string here is not a reasonable input, but it seems like this is a simple fix that can avoid potential bad behavior with the bad input.

brodieG/substr-bug.md