For this benchmark pypy --jit off is 3x slower than cpython3.11
from docutils.core import publish_doctree
from docutils import nodes
RST = """
Title One
=========
Paragraph with *emphasis* and **strong** text.
Here is ``inline code`` and more content.
- Item one with some text
- Item two with *emphasis*
- Item three
- Nested item A
- Nested item B
Title Two
---------
Another paragraph with multiple sentences and content.
""" * 20
class CountVisitor(nodes.NodeVisitor):
def __init__(self, document):
super().__init__(document)
self.count = 0
def unknown_visit(self, node): self.count += 1
def unknown_departure(self, node): self.count += 1
def visit_Text(self, node): self.count += 1
def depart_Text(self, node): pass
doc = publish_doctree(RST)
for _ in range(300):
v = CountVisitor(doc)
doc.walkabout(v)
print(v.count)
The 3x gap decomposes as 2.24x instructions × 1.28x IPC, and there's no tractable path to significantly reduce either:
- Instructions: requires fewer C function calls — a major structural change to the 7-level dispatch chain
- IPC: driven by i-cache pressure (29.24% frontend stalls vs CPython's 14.44%) — not addressable by data layout tweaks
Digging into the instruction counts: the 2.44x instruction gap breaks down into a few big buckets, not one dominant one:
| Category | Excess instructions | Ratio |
|---|---|---|
| Type/attr lookup | ~1.26B | 4.8x |
| String/UTF-8 | ~0.77B | 2.3x |
| Call overhead | ~0.58B | 3.5x |
| Stack check+TLS | ~0.47B | ∞ |
| Frame alloc | ~0.35B | 3.6x |
There's no single silver bullet. But there are three directions that could give 5-10%+ each and compound:
- Arguments__match_signature fast path — it's 1.94% of ALL instructions (second biggest single hotspot), fires on every Python call including trivial def visit_X(self, node) methods. A "all positional, no defaults, exact argcount" flag computed at function definition time would let the common case skip all the matching logic. Contained change in argument.py.
- Method cache investigation — W_TypeObject_lookup_where_with_method_cache is 4.16% of all instructions, 4.8x CPython. That's the single biggest function. The docutils benchmark uses 20+ node types with getattr(visitor, 'visit_' + classname). The question is whether the cache is missing (size/hash issue) or whether the cache hit path itself is slow. Understanding which leads to different fixes. 3. UTF-8 codepoints_in_utf8 at 4.75% — 'visit_' + classname creates a new string object each call. If PyPy's string concatenation doesn't propagate the
_lengthfield, every use as a dict key (for__dict__lookup) recomputes codepoint count via O(n) scan. There may be a fast path for ASCII-only strings.
The _match_signature hotspot (1.94%) is partially a red herring — the FLATPYCALL path exists and DOES skip it for direct method calls. The 1.94% is from calls that arrive via a bound Method object (method = getattr(visitor, name); method(node)). Method.__call__ goes call_args → _match_signature and never hits the FLATPYCALL fast path. Fix: add a fast path in Method.__call__ for all-positional args — delegate to w_function.funccall(self.w_instance, *args_w).
The method cache hotspot (4.16%) is mostly a string equality cost problem. The cache is 2048 entries and is being hit (no class mutations during traversal), but the hit path does a full == comparison on the name because 'visit_' + classname creates a new (non-interned) string object every call. The hash is already computed to find the cache slot — we could store it in the cache and compare it before doing the full string equality.
These two together are tractable, localized changes. The bound-method fast call is maybe 30 lines in function.py. The hash-in-cache change is maybe 20 lines in typeobject.py. Together they'd probably save 3-5% of total instructions.