Possible small optimization in PyPy's interpreter

For this benchmark pypy --jit off is 3x slower than cpython3.11

from docutils.core import publish_doctree                                                        
from docutils import nodes                                                                       
                                                                                                 
RST = """                                                                                        
Title One                                                                                        
=========                                                                                        
                                                                                                 
Paragraph with *emphasis* and **strong** text.                                                   
Here is ``inline code`` and more content.                                                        
                                                                                                 
- Item one with some text                                                                        
- Item two with *emphasis*                                                                       
- Item three                                                                                     
                                                                                                 
  - Nested item A                                                                                
  - Nested item B                                                                                
                                                                                                 
Title Two                                                                                        
---------                                                                                        
                                                                                                 
Another paragraph with multiple sentences and content.                                           
""" * 20                                                                                         
                                                                                                 
class CountVisitor(nodes.NodeVisitor):                                                           
    def __init__(self, document):                                                                
        super().__init__(document)                                                               
        self.count = 0                                                                           
    def unknown_visit(self, node): self.count += 1                                               
    def unknown_departure(self, node): self.count += 1                                           
    def visit_Text(self, node): self.count += 1                                                  
    def depart_Text(self, node): pass                                                            
                                                                                                 
doc = publish_doctree(RST)                                                                       
                                                                                                 
for _ in range(300):                                                                             
    v = CountVisitor(doc)                                                                        
    doc.walkabout(v)                                                                             
print(v.count)

The 3x gap decomposes as 2.24x instructions × 1.28x IPC, and there's no tractable path to significantly reduce either:

Instructions: requires fewer C function calls — a major structural change to the 7-level dispatch chain
IPC: driven by i-cache pressure (29.24% frontend stalls vs CPython's 14.44%) — not addressable by data layout tweaks

Digging into the instruction counts: the 2.44x instruction gap breaks down into a few big buckets, not one dominant one:

Category	Excess instructions	Ratio
Type/attr lookup	~1.26B	4.8x
String/UTF-8	~0.77B	2.3x
Call overhead	~0.58B	3.5x
Stack check+TLS	~0.47B	∞
Frame alloc	~0.35B	3.6x

There's no single silver bullet. But there are three directions that could give 5-10%+ each and compound:

Arguments__match_signature fast path — it's 1.94% of ALL instructions (second biggest single hotspot), fires on every Python call including trivial def visit_X(self, node) methods. A "all positional, no defaults, exact argcount" flag computed at function definition time would let the common case skip all the matching logic. Contained change in argument.py.
Method cache investigation — W_TypeObject_lookup_where_with_method_cache is 4.16% of all instructions, 4.8x CPython. That's the single biggest function. The docutils benchmark uses 20+ node types with getattr(visitor, 'visit_' + classname). The question is whether the cache is missing (size/hash issue) or whether the cache hit path itself is slow. Understanding which leads to different fixes. 3. UTF-8 codepoints_in_utf8 at 4.75% — 'visit_' + classname creates a new string object each call. If PyPy's string concatenation doesn't propagate the _length field, every use as a dict key (for __dict__ lookup) recomputes codepoint count via O(n) scan. There may be a fast path for ASCII-only strings.

The _match_signature hotspot (1.94%) is partially a red herring — the FLATPYCALL path exists and DOES skip it for direct method calls. The 1.94% is from calls that arrive via a bound Method object (method = getattr(visitor, name); method(node)). Method.__call__ goes call_args → _match_signature and never hits the FLATPYCALL fast path. Fix: add a fast path in Method.__call__ for all-positional args — delegate to w_function.funccall(self.w_instance, *args_w).

The method cache hotspot (4.16%) is mostly a string equality cost problem. The cache is 2048 entries and is being hit (no class mutations during traversal), but the hit path does a full == comparison on the name because 'visit_' + classname creates a new (non-interned) string object every call. The hash is already computed to find the cache slot — we could store it in the cache and compare it before doing the full string equality.

These two together are tractable, localized changes. The bound-method fast call is maybe 30 lines in function.py. The hash-in-cache change is maybe 20 lines in typeobject.py. Together they'd probably save 3-5% of total instructions.

mattip/analysis.md

Select an option

No results found

Select an option

No results found