Comments on optimizations around string concatenation.

Note: The code links are to CPython 3.8.5, the most recent release when this was written.

I was recently asked about a performance optimization in CPython around using += and + for string objects. As some people may already know, if you use += or + a string, it can sometimes be just as fast as ''.join. The question was to explain when that optimization couldn't be performed.

We will be going through the following example scenarios:

def fast_1():
    """String append is fast (O(1)) if there is only one reference to the
    string:
    """
    s = ""
    for _ in range(500_000):
        s += "x"


def slow_1():
    """A second reference makes append O(N):
    """
    s = ""
    for _ in range(500_000):
        s += "x"
        r = s  # store a second reference to s


def fast_2():
    """Somehow "s = s + c" is still fast?
    """
    s = ""
    for _ in range(500_000):
        s = s + "x"


def slow_2():
    """But this is also slow: why? Where is the second reference?
    """
    s = [""]
    for _ in range(500_000):
        s[0] += "x"

As their name implies, two of these are fast (O(1) append) and two are slow (O(n) append).

`fast_1`

Bytecode

>>> dis.dis(fast_1)
  5           0 LOAD_CONST               1 ('')
              2 STORE_FAST               0 (s)

  6           4 LOAD_GLOBAL              0 (range)
              6 LOAD_CONST               2 (500000)
              8 CALL_FUNCTION            1
             10 GET_ITER
        >>   12 FOR_ITER                12 (to 26)
             14 STORE_FAST               1 (_)

  7          16 LOAD_FAST                0 (s)
             18 LOAD_CONST               3 ('x')
             20 INPLACE_ADD
             22 STORE_FAST               0 (s)
             24 JUMP_ABSOLUTE           12
        >>   26 LOAD_CONST               4 (None)
             28 RETURN_VALUE

The body of the loop is:

  7          16 LOAD_FAST                0 (s)
             18 LOAD_CONST               3 ('x')
             20 INPLACE_ADD
             22 STORE_FAST               0 (s)

`TARGET(INPLACE_ADD)`

To understand the optimization, we need to look into how INPLACE_ADD instructions are handled in the interpreter loop.

        case TARGET(INPLACE_ADD): {
            PyObject *right = POP();
            PyObject *left = TOP();
            PyObject *sum;
            if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(tstate, left, right, f, next_instr);
                /* unicode_concatenate consumed the ref to left */
            }
            else {
                sum = PyNumber_InPlaceAdd(left, right);
                Py_DECREF(left);
            }
            Py_DECREF(right);
            SET_TOP(sum);
            if (sum == NULL)
                goto error;
            DISPATCH();
        }

(https://github.com/python/cpython/blob/580fbb018fd0844806119614d752b41fc69660f9/Python/ceval.c#L1749-L1766)

NOTE: This code uses a lot of macros. f refers to the current stack frame.

This checks the top two values on the stack and if they are both exactly PyUnicodeObject (a Python str object, not including subclasses), delegate to unicode_concatenate. If the objects are not both exactly PyUnicodeObjects, then we use the generic PyNumber_Add function.

An important note is that INPLACE_ADD is not responsible for reassigning the variable. Instead, that is handled by one of the many STORE_* instructions. In fast_1, this is seen as a STORE_FAST which is used for local variables which are not closed over or captured from another stack frame.

Given that s and "x" are both unicode objects, we should fall into the unicode_concatenate case:

static PyObject *
unicode_concatenate(PyThreadState *tstate, PyObject *v, PyObject *w,
                    PyFrameObject *f, const _Py_CODEUNIT *next_instr)
{
    PyObject *res;
    if (Py_REFCNT(v) == 2) {
        /* In the common case, there are 2 references to the value
         * stored in 'variable' when the += is performed: one on the
         * value stack (in 'v') and one still stored in the
         * 'variable'.  We try to delete the variable now to reduce
         * the refcnt to 1.
         */
        int opcode, oparg;
        NEXTOPARG();
        switch (opcode) {
        case STORE_FAST:
        {
            PyObject **fastlocals = f->f_localsplus;
            if (GETLOCAL(oparg) == v)
                SETLOCAL(oparg, NULL);
            break;
        }
        case STORE_DEREF:
        {
            PyObject **freevars = (f->f_localsplus +
                                   f->f_code->co_nlocals);
            PyObject *c = freevars[oparg];
            if (PyCell_GET(c) ==  v) {
                PyCell_SET(c, NULL);
                Py_DECREF(v);
            }
            break;
        }
        case STORE_NAME:
        {
            PyObject *names = f->f_code->co_names;
            PyObject *name = GETITEM(names, oparg);
            PyObject *locals = f->f_locals;
            if (locals && PyDict_CheckExact(locals)) {
                PyObject *w = PyDict_GetItemWithError(locals, name);
                if ((w == v && PyDict_DelItem(locals, name) != 0) ||
                    (w == NULL && _PyErr_Occurred(tstate)))
                {
                    Py_DECREF(v);
                    return NULL;
                }
            }
            break;
        }
        }
    }
    res = v;
    PyUnicode_Append(&res, w);
    return res;
}

(https://github.com/python/cpython/blob/580fbb018fd0844806119614d752b41fc69660f9/Python/ceval.c#L5445-L5499)

The first check is to see if Py_REFCNT(v) == 2. This means that there are exactly 2 references to the left hand side argument. As the comment below the check explains, we are looking for cases where there is 1 reference owned by a named variable and a second on the stack.

Stepping Through the Loop

In the first iteration of the loop, v = s = "". Because strings are immutable, all empty strings are actually the same object, so "" is "". Given that other references to "" exist, Py_REFCNT(v) will be much greater than 2, and we go directly to the PyUnicode_Append case. When res is "", this just sets res = w: https://github.com/python/cpython/blob/580fbb018fd0844806119614d752b41fc69660f9/Objects/unicodeobject.c#L11351-L11356, which is very fast.

In the second iteration of the loop, s = "x". Importantly, because of the optimization in PyUnicode_Append, s points to the same object as the string literal, which is also owned by fast_1's code object in the co_consts field. This means that in the second iteration of the loop, when we check if Py_REFCNT(v) == 2, v will have at least one extra reference in the co_consts, which means we cannot enter into the fast path code. Because of this extra reference, we will also not be able to optimize much inside PyUnicode_Append. The second iteration of the loop is the only naive string concatenation. This naive concatenation will create a new unicode object which has a reference count of 1.

On the third iteration and all later iterations, v will hold the result from the second iteration's naive concatenation. The naive concatenation result will have two references at this point: the named variable s and the reference on the stack. Therefore, we will enter the if (Py_REFCNT(v) == 2) branch. Inside this branch, the next instruction is checked. There is special handling for a subset of the store instructions here:

        switch (opcode) {
        case STORE_FAST:
        {
            /* ... */
        }
        case STORE_DEREF:
        {
            /* ... */
        }
        case STORE_NAME:
        {
            /* ... */
        }
        }

In fast_1 the INPLACE_ADD is followed by a STORE_FAST, so we enter jump to this condition. Here we check to see if the assignment target (left hand side) is the same object as the left hand side of the concatenation:

            PyObject **fastlocals = f->f_localsplus;
            if (GETLOCAL(oparg) == v)
                SETLOCAL(oparg, NULL);
            break;

Here f is the stack frame, and GETLOCAL(oparg) returns the local variable referenced by the STORE_FAST instruction.

If the assignment target is the same object as the left hand side of the concatenation, which I believe must always be the case with bytecode that comes from the CPython compiler, then we temporarily delete the local variable by setting the local to NULL using SETLOCAL(oparg, NULL). This SETLOCAL macro does proper reference count management. This is doing the equivalent del s in Python code. By deleting the local variable version of v, Py_REFCNT(v) == 1, which will enable the inplace concatenation optimization in PyUnicode_Append.

Inside PyUnicode_Append there is the following check:

    if (unicode_modifiable(left)
        && PyUnicode_CheckExact(right)
        && PyUnicode_KIND(right) <= PyUnicode_KIND(left)
        /* Don't resize for ascii += latin1. Convert ascii to latin1 requires
           to change the structure size, but characters are stored just after
           the structure, and so it requires to move all characters which is
           not so different than duplicating the string. */
        && !(PyUnicode_IS_ASCII(left) && !PyUnicode_IS_ASCII(right)))
    {
        /* append inplace */
        if (unicode_resize(p_left, new_len) != 0)
            goto error;

        /* copy 'right' into the newly allocated area of 'left' */
        _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
    }

(https://github.com/python/cpython/blob/580fbb018fd0844806119614d752b41fc69660f9/Objects/unicodeobject.c#L11369-L11384)

This branch first checks to see if left is "modifiable" and that right is also a compatible exact unicode object. Only exact unicode objects are checked because subclasses may overload __radd__. If this branch is true, the object is resized in place if possible, and then the new data is just copied into the newly allocated space.

unicode_modifiable checks a few things, but the first thing it checks is the reference count:

static int
unicode_modifiable(PyObject *unicode)
{
    assert(_PyUnicode_CHECK(unicode));
    if (Py_REFCNT(unicode) != 1)
        return 0;
    if (_PyUnicode_HASH(unicode) != -1)
        return 0;
    if (PyUnicode_CHECK_INTERNED(unicode))
        return 0;
    if (!PyUnicode_CheckExact(unicode))
        return 0;
#ifdef Py_DEBUG
    /* singleton refcount is greater than 1 */
    assert(!unicode_is_singleton(unicode));
#endif
    return 1;
}

(https://github.com/python/cpython/blob/580fbb018fd0844806119614d752b41fc69660f9/Objects/unicodeobject.c#L1898-L1915)

The other checks that it does are:

Is the hash uncached?
Is the object interned (eternal)?
Is the object a subclass of str?

If any of these conditions are true, the object is not considered modifiable.

The hash check is important. Because Python uses strings-keyed dictionaries so often, str objects cache the result of __hash__. This makes it much faster to look up strings in dictionaries at the expense of 1 extra integer per string. If the hash has already been computed, changing the contents would change the hash value so there would be an issue. We will explore this concept a bit later.

`slow_1`

slow_1 looks almost exactly the same as fast_1 with the extra "dead" assignment. By capturing a second reference to s and saving it in r, we will always have an extra reference to s which disables the Py_REFCNT(v) == 2 check in unicode_concatenate. Therefore, we will end up in this alternative branch inside of PyUnicode_Append:

    else {
        maxchar = PyUnicode_MAX_CHAR_VALUE(left);
        maxchar2 = PyUnicode_MAX_CHAR_VALUE(right);
        maxchar = Py_MAX(maxchar, maxchar2);

        /* Concat the two Unicode strings */
        res = PyUnicode_New(new_len, maxchar);
        if (res == NULL)
            goto error;
        _PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
        _PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
        Py_DECREF(left);
        *p_left = res;
    }

Here an entire new unicode object is created and the characters from both the left and right are copied into it.

`fast_2`

fast_2 is like fast_1 but instead of using s += "x", it is manually expanded into s = s + "x".

Bytecode

The body of the loop is:

 24          16 LOAD_FAST                0 (s)
             18 LOAD_CONST               3 ('x')
             20 BINARY_ADD
             22 STORE_FAST               0 (s)

To compare, the body of fast_1 was:

  7          16 LOAD_FAST                0 (s)
             18 LOAD_CONST               3 ('x')
             20 INPLACE_ADD
             22 STORE_FAST               0 (s)

The only difference is BINARY_ADD instead of INPLACE_ADD.

`TARGET(BINARY_ADD)`

Like INPLACE_ADD, BINARY_ADD has the same unicode check which delegates to unicode_concatenation.

        case TARGET(BINARY_ADD): {
            PyObject *right = POP();
            PyObject *left = TOP();
            PyObject *sum;
            /* NOTE(haypo): Please don't try to micro-optimize int+int on
               CPython using bytecode, it is simply worthless.
               See http://bugs.python.org/issue21955 and
               http://bugs.python.org/issue10044 for the discussion. In short,
               no patch shown any impact on a realistic benchmark, only a minor
               speedup on microbenchmarks. */
            if (PyUnicode_CheckExact(left) &&
                     PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(tstate, left, right, f, next_instr);
                /* unicode_concatenate consumed the ref to left */
            }
            else {
                sum = PyNumber_Add(left, right);
                Py_DECREF(left);
            }
            Py_DECREF(right);
            SET_TOP(sum);
            if (sum == NULL)
                goto error;
            DISPATCH();
        }

The only difference between the handlers for BINARY_ADD and INPLACE_ADD is the non-unicode case: BINARY_ADD calls: PyNumber_Add where INPLACE_ADD calls PyNumber_InPlaceAdd. Remember, the store is still a separate step in INPLACE_ADD, so when the BINARY_ADD is followed directly by a STORE_FAST, we get the same optimization as +=.

`slow_2`

slow_2 uses a string inside of a list instead of a local variable. This shouldn't alter the reference count, just move the owner from the stack frame to the list.

Bytecode

The loop of the slow_2 loooks like:

             22 DUP_TOP_TWO
             24 BINARY_SUBSCR
             26 LOAD_CONST               4 ('x')
             28 INPLACE_ADD
             30 ROT_THREE
             32 STORE_SUBSCR
             34 JUMP_ABSOLUTE           14

This is much more complicated than fast_1 and fast_2 which didn't need to worry about duplicating values or rotating.

The first thing to notice is that INPLACE_ADD is followed by a ROT_THREE, not a STORE_* instruction. This means that even if the reference counts are correct, we won't enter the fast path on the next instruction. Remember, the fast path instructions are:

        switch (opcode) {
        case STORE_FAST:
        {
            /* ... */
        }
        case STORE_DEREF:
        {
            /* ... */
        }
        case STORE_NAME:
        {
            /* ... */
        }
        }

The other thing not note is that even if the instructions and stack were reordered to not require the ROT_THREE, there is no fast path for STORE_SUBSCR. This is because Python wants to inform the container of the new assignment. If the string were concatenated in place, then if the key were read back out of the object in __setattr__, it would already show the mutated value which would be confusing.

Making New Slow Cases

Based on the various optimizations we have looked at, let's construct some new slow cases to test our understanding.

Instruction Prediction

We know that one of the keys for this inplace concatenation is that we need a BINARY_ADD or INPLACE_ADD followed immediately by a STORE_* instruction. What if we tried to get an extra instruction between them?

def slow_assignment_expr():
    s = ""
    for _ in range(500_000):
        (s := s + "x")

If we look at the bytecode of the loop here, we see:

  4          16 LOAD_FAST                0 (s)
             18 LOAD_CONST               3 ('x')
             20 BINARY_ADD
             22 DUP_TOP
             24 STORE_FAST               0 (s)
             26 POP_TOP

By using an assignment expression, we have added an extra DUP_TOP in between our BINARY_ADD and the STORE_FAST, so we no longer get the optimization and degrade to linear time appends.

Compute the Hash

We know that PyUnicode_Append will only do the inplace append if the object is "modifiable". For an object to be modifiable, the hash cannot have yet been computed for the string.

def slow_hash_computed():
    s = ""
    for _ in range(500_000):
        hash(s)
        s += "x"

By computing the hash, even though the bytecode has the correct pattern and the reference counts are correct, we still do not get the inplace concatenation.

NOTE: I believe this could still be optimized in CPython by recomputing the hash after this occurs. I am not sure if that breaks other cases though.

llllllllll/unicode-concatenation.md

Comments on optimizations around string concatenation.

fast_1

Bytecode

TARGET(INPLACE_ADD)

Stepping Through the Loop

slow_1

fast_2

Bytecode

TARGET(BINARY_ADD)

slow_2

Bytecode

Making New Slow Cases

Instruction Prediction

Compute the Hash

`fast_1`

`TARGET(INPLACE_ADD)`

`slow_1`

`fast_2`

`TARGET(BINARY_ADD)`

`slow_2`