Note: The code links are to CPython 3.8.5, the most recent release when this was written.
I was recently asked about a performance optimization in CPython around using +=
and +
for string objects.
As some people may already know, if you use +=
or +
a string, it can sometimes be just as fast as ''.join
.
The question was to explain when that optimization couldn't be performed.
We will be going through the following example scenarios:
def fast_1():
"""String append is fast (O(1)) if there is only one reference to the
string:
"""
s = ""
for _ in range(500_000):
s += "x"
def slow_1():
"""A second reference makes append O(N):
"""
s = ""
for _ in range(500_000):
s += "x"
r = s # store a second reference to s
def fast_2():
"""Somehow "s = s + c" is still fast?
"""
s = ""
for _ in range(500_000):
s = s + "x"
def slow_2():
"""But this is also slow: why? Where is the second reference?
"""
s = [""]
for _ in range(500_000):
s[0] += "x"
As their name implies, two of these are fast (O(1)
append) and two are slow (O(n)
append).
>>> dis.dis(fast_1)
5 0 LOAD_CONST 1 ('')
2 STORE_FAST 0 (s)
6 4 LOAD_GLOBAL 0 (range)
6 LOAD_CONST 2 (500000)
8 CALL_FUNCTION 1
10 GET_ITER
>> 12 FOR_ITER 12 (to 26)
14 STORE_FAST 1 (_)
7 16 LOAD_FAST 0 (s)
18 LOAD_CONST 3 ('x')
20 INPLACE_ADD
22 STORE_FAST 0 (s)
24 JUMP_ABSOLUTE 12
>> 26 LOAD_CONST 4 (None)
28 RETURN_VALUE
The body of the loop is:
7 16 LOAD_FAST 0 (s)
18 LOAD_CONST 3 ('x')
20 INPLACE_ADD
22 STORE_FAST 0 (s)
To understand the optimization, we need to look into how INPLACE_ADD
instructions are handled in the interpreter loop.
case TARGET(INPLACE_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(tstate, left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
else {
sum = PyNumber_InPlaceAdd(left, right);
Py_DECREF(left);
}
Py_DECREF(right);
SET_TOP(sum);
if (sum == NULL)
goto error;
DISPATCH();
}
NOTE: This code uses a lot of macros.
f
refers to the current stack frame.
This checks the top two values on the stack and if they are both exactly PyUnicodeObject
(a Python str
object, not including subclasses), delegate to unicode_concatenate
.
If the objects are not both exactly PyUnicodeObject
s, then we use the generic PyNumber_Add
function.
An important note is that INPLACE_ADD
is not responsible for reassigning the variable.
Instead, that is handled by one of the many STORE_*
instructions.
In fast_1
, this is seen as a STORE_FAST
which is used for local variables which are not closed over or captured from another stack frame.
Given that s
and "x"
are both unicode objects, we should fall into the unicode_concatenate
case:
static PyObject *
unicode_concatenate(PyThreadState *tstate, PyObject *v, PyObject *w,
PyFrameObject *f, const _Py_CODEUNIT *next_instr)
{
PyObject *res;
if (Py_REFCNT(v) == 2) {
/* In the common case, there are 2 references to the value
* stored in 'variable' when the += is performed: one on the
* value stack (in 'v') and one still stored in the
* 'variable'. We try to delete the variable now to reduce
* the refcnt to 1.
*/
int opcode, oparg;
NEXTOPARG();
switch (opcode) {
case STORE_FAST:
{
PyObject **fastlocals = f->f_localsplus;
if (GETLOCAL(oparg) == v)
SETLOCAL(oparg, NULL);
break;
}
case STORE_DEREF:
{
PyObject **freevars = (f->f_localsplus +
f->f_code->co_nlocals);
PyObject *c = freevars[oparg];
if (PyCell_GET(c) == v) {
PyCell_SET(c, NULL);
Py_DECREF(v);
}
break;
}
case STORE_NAME:
{
PyObject *names = f->f_code->co_names;
PyObject *name = GETITEM(names, oparg);
PyObject *locals = f->f_locals;
if (locals && PyDict_CheckExact(locals)) {
PyObject *w = PyDict_GetItemWithError(locals, name);
if ((w == v && PyDict_DelItem(locals, name) != 0) ||
(w == NULL && _PyErr_Occurred(tstate)))
{
Py_DECREF(v);
return NULL;
}
}
break;
}
}
}
res = v;
PyUnicode_Append(&res, w);
return res;
}
The first check is to see if Py_REFCNT(v) == 2
.
This means that there are exactly 2 references to the left hand side argument.
As the comment below the check explains, we are looking for cases where there is 1 reference owned by a named variable and a second on the stack.
In the first iteration of the loop, v = s = ""
.
Because strings are immutable, all empty strings are actually the same object, so "" is ""
.
Given that other references to ""
exist, Py_REFCNT(v)
will be much greater than 2, and we go directly to the PyUnicode_Append
case.
When res
is ""
, this just sets res = w
: https://github.com/python/cpython/blob/580fbb018fd0844806119614d752b41fc69660f9/Objects/unicodeobject.c#L11351-L11356, which is very fast.
In the second iteration of the loop, s = "x"
.
Importantly, because of the optimization in PyUnicode_Append
, s
points to the same object as the string literal, which is also owned by fast_1
's code object in the co_consts
field.
This means that in the second iteration of the loop, when we check if Py_REFCNT(v) == 2
, v
will have at least one extra reference in the co_consts
, which means we cannot enter into the fast path code.
Because of this extra reference, we will also not be able to optimize much inside PyUnicode_Append
.
The second iteration of the loop is the only naive string concatenation.
This naive concatenation will create a new unicode object which has a reference count of 1.
On the third iteration and all later iterations, v
will hold the result from the second iteration's naive concatenation.
The naive concatenation result will have two references at this point: the named variable s
and the reference on the stack.
Therefore, we will enter the if (Py_REFCNT(v) == 2)
branch.
Inside this branch, the next instruction is checked.
There is special handling for a subset of the store instructions here:
switch (opcode) {
case STORE_FAST:
{
/* ... */
}
case STORE_DEREF:
{
/* ... */
}
case STORE_NAME:
{
/* ... */
}
}
In fast_1
the INPLACE_ADD
is followed by a STORE_FAST
, so we enter jump to this condition.
Here we check to see if the assignment target (left hand side) is the same object as the left hand side of the concatenation:
PyObject **fastlocals = f->f_localsplus;
if (GETLOCAL(oparg) == v)
SETLOCAL(oparg, NULL);
break;
Here f
is the stack frame, and GETLOCAL(oparg)
returns the local variable referenced by the STORE_FAST
instruction.
If the assignment target is the same object as the left hand side of the concatenation, which I believe must always be the case with bytecode that comes from the CPython compiler, then we temporarily delete the local variable by setting the local to NULL
using SETLOCAL(oparg, NULL)
.
This SETLOCAL
macro does proper reference count management.
This is doing the equivalent del s
in Python code.
By deleting the local variable version of v
, Py_REFCNT(v) == 1
, which will enable the inplace concatenation optimization in PyUnicode_Append
.
Inside PyUnicode_Append
there is the following check:
if (unicode_modifiable(left)
&& PyUnicode_CheckExact(right)
&& PyUnicode_KIND(right) <= PyUnicode_KIND(left)
/* Don't resize for ascii += latin1. Convert ascii to latin1 requires
to change the structure size, but characters are stored just after
the structure, and so it requires to move all characters which is
not so different than duplicating the string. */
&& !(PyUnicode_IS_ASCII(left) && !PyUnicode_IS_ASCII(right)))
{
/* append inplace */
if (unicode_resize(p_left, new_len) != 0)
goto error;
/* copy 'right' into the newly allocated area of 'left' */
_PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
This branch first checks to see if left
is "modifiable" and that right
is also a compatible exact unicode object.
Only exact unicode objects are checked because subclasses may overload __radd__
.
If this branch is true, the object is resized in place if possible, and then the new data is just copied into the newly allocated space.
unicode_modifiable
checks a few things, but the first thing it checks is the reference count:
static int
unicode_modifiable(PyObject *unicode)
{
assert(_PyUnicode_CHECK(unicode));
if (Py_REFCNT(unicode) != 1)
return 0;
if (_PyUnicode_HASH(unicode) != -1)
return 0;
if (PyUnicode_CHECK_INTERNED(unicode))
return 0;
if (!PyUnicode_CheckExact(unicode))
return 0;
#ifdef Py_DEBUG
/* singleton refcount is greater than 1 */
assert(!unicode_is_singleton(unicode));
#endif
return 1;
}
The other checks that it does are:
- Is the hash uncached?
- Is the object interned (eternal)?
- Is the object a subclass of
str
?
If any of these conditions are true, the object is not considered modifiable.
The hash check is important.
Because Python uses strings-keyed dictionaries so often, str
objects cache the result of __hash__
.
This makes it much faster to look up strings in dictionaries at the expense of 1 extra integer per string.
If the hash has already been computed, changing the contents would change the hash value so there would be an issue.
We will explore this concept a bit later.
slow_1
looks almost exactly the same as fast_1
with the extra "dead" assignment.
By capturing a second reference to s
and saving it in r
, we will always have an extra reference to s
which disables the Py_REFCNT(v) == 2
check in unicode_concatenate
.
Therefore, we will end up in this alternative branch inside of PyUnicode_Append
:
else {
maxchar = PyUnicode_MAX_CHAR_VALUE(left);
maxchar2 = PyUnicode_MAX_CHAR_VALUE(right);
maxchar = Py_MAX(maxchar, maxchar2);
/* Concat the two Unicode strings */
res = PyUnicode_New(new_len, maxchar);
if (res == NULL)
goto error;
_PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
_PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
Py_DECREF(left);
*p_left = res;
}
Here an entire new unicode object is created and the characters from both the left and right are copied into it.
fast_2
is like fast_1
but instead of using s += "x"
, it is manually expanded into s = s + "x"
.
The body of the loop is:
24 16 LOAD_FAST 0 (s)
18 LOAD_CONST 3 ('x')
20 BINARY_ADD
22 STORE_FAST 0 (s)
To compare, the body of fast_1
was:
7 16 LOAD_FAST 0 (s)
18 LOAD_CONST 3 ('x')
20 INPLACE_ADD
22 STORE_FAST 0 (s)
The only difference is BINARY_ADD
instead of INPLACE_ADD
.
Like INPLACE_ADD
, BINARY_ADD
has the same unicode check which delegates to unicode_concatenation
.
case TARGET(BINARY_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
/* NOTE(haypo): Please don't try to micro-optimize int+int on
CPython using bytecode, it is simply worthless.
See http://bugs.python.org/issue21955 and
http://bugs.python.org/issue10044 for the discussion. In short,
no patch shown any impact on a realistic benchmark, only a minor
speedup on microbenchmarks. */
if (PyUnicode_CheckExact(left) &&
PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(tstate, left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
else {
sum = PyNumber_Add(left, right);
Py_DECREF(left);
}
Py_DECREF(right);
SET_TOP(sum);
if (sum == NULL)
goto error;
DISPATCH();
}
The only difference between the handlers for BINARY_ADD
and INPLACE_ADD
is the non-unicode case: BINARY_ADD
calls: PyNumber_Add
where INPLACE_ADD
calls PyNumber_InPlaceAdd
.
Remember, the store is still a separate step in INPLACE_ADD
, so when the BINARY_ADD
is followed directly by a STORE_FAST
, we get the same optimization as +=
.
slow_2
uses a string inside of a list instead of a local variable.
This shouldn't alter the reference count, just move the owner from the stack frame to the list.
The loop of the slow_2
loooks like:
22 DUP_TOP_TWO
24 BINARY_SUBSCR
26 LOAD_CONST 4 ('x')
28 INPLACE_ADD
30 ROT_THREE
32 STORE_SUBSCR
34 JUMP_ABSOLUTE 14
This is much more complicated than fast_1
and fast_2
which didn't need to worry about duplicating values or rotating.
The first thing to notice is that INPLACE_ADD
is followed by a ROT_THREE
, not a STORE_*
instruction. This means that even if the reference counts are correct, we won't enter the fast path on the next instruction.
Remember, the fast path instructions are:
switch (opcode) {
case STORE_FAST:
{
/* ... */
}
case STORE_DEREF:
{
/* ... */
}
case STORE_NAME:
{
/* ... */
}
}
The other thing not note is that even if the instructions and stack were reordered to not require the ROT_THREE
, there is no fast path for STORE_SUBSCR
.
This is because Python wants to inform the container of the new assignment.
If the string were concatenated in place, then if the key were read back out of the object in __setattr__
, it would already show the mutated value which would be confusing.
Based on the various optimizations we have looked at, let's construct some new slow cases to test our understanding.
We know that one of the keys for this inplace concatenation is that we need a BINARY_ADD
or INPLACE_ADD
followed immediately by a STORE_*
instruction.
What if we tried to get an extra instruction between them?
def slow_assignment_expr():
s = ""
for _ in range(500_000):
(s := s + "x")
If we look at the bytecode of the loop here, we see:
4 16 LOAD_FAST 0 (s)
18 LOAD_CONST 3 ('x')
20 BINARY_ADD
22 DUP_TOP
24 STORE_FAST 0 (s)
26 POP_TOP
By using an assignment expression, we have added an extra DUP_TOP
in between our BINARY_ADD
and the STORE_FAST
, so we no longer get the optimization and degrade to linear time appends.
We know that PyUnicode_Append
will only do the inplace append if the object is "modifiable".
For an object to be modifiable, the hash cannot have yet been computed for the string.
def slow_hash_computed():
s = ""
for _ in range(500_000):
hash(s)
s += "x"
By computing the hash, even though the bytecode has the correct pattern and the reference counts are correct, we still do not get the inplace concatenation.
NOTE: I believe this could still be optimized in CPython by recomputing the hash after this occurs. I am not sure if that breaks other cases though.