Conversation
- set_find_slot: equality result in eax was clobbered by pop rax,
causing false matches (e.g. {-2} contained -1). Save result in edi
before restoring stack.
- BINARY_OP inplace float fallback: TAG_FLOAT lacks TAG_RC_BIT so
inplace ops (x -= 0.5) fell through to dunder lookup. Add float
coercion check in the inplace→non-inplace fallback path.
- min/max: rewrite as shared DRY minmax_impl supporting both
multi-arg (min(a,b,c)) and single-iterable (min(list)) forms.
Previous code only handled multi-arg, returning the list itself
for min([1,2,3]).
- sum: dispatch to float_add when either operand is TAG_FLOAT,
fixing crash on sum([1.0, 2.0, 3.0]).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- COMPARE_OP_INT now pushes TAG_BOOL instead of heap bool (no INCREF/DECREF) - POP_JUMP_IF_FALSE/TRUE inline TAG_BOOL fast path (skip obj_is_true call) - Fused COMPARE_OP_INT_JUMP_FALSE/TRUE superinstructions (opcodes 215/216) - Specialized float binops: ADD/SUB/MUL/TRUEDIV (opcodes 217-220) - Specialized SmallInt multiply (opcode 221) with imul+overflow check - Float truediv includes zero-divisor guard that deopts to generic path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ap_memmove (forward rep movsq, backward manual qword loop) and replace 3 identical element-by-element shift loops with single calls. Eliminates redundant ob_item reload per iteration and leverages hardware-optimized string ops for the forward (pop/remove) case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e zeroing optimization Phase 1: Slice object freelist pool (16 entries) eliminates malloc/free churn for ephemeral slices in tight loops. list_ass_subscript shift loops replaced with ap_memmove. list_getslice step=1 uses ap_memcpy for contiguous copies instead of per-element imul loop. Phase 2: SmallInt floor division fast path in op_binary_op with specialization to opcode 222 (op_binary_op_floordiv_int), bypassing the generic 8-comparison dispatch chain. Phase 3: Unrolled zeroing for frames with 1-4 locals avoids rep stosq startup penalty for small functions (common case). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mallInt - Tuple pool: freelist for sizes 1-3 (max 16 each), avoiding gc_alloc/gc_dealloc - List header pool: freelist for 40-byte PyListObject headers (max 16) - float_pow: sqrtsd fast path for **0.5 (~12 vs ~100+ cycles), mulsd for **2.0 - enumerate_iternext: inline SmallInt construction, removing int_from_i64 call Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This pull request implements a comprehensive set of performance optimizations for a Python interpreter written in x86-64 assembly. The changes focus on reducing memory allocation overhead through object pooling, optimizing memory operations with specialized routines, and adding fast paths for common arithmetic and comparison operations.
Changes:
- Object pooling for frequently allocated types (tuples size 1-3, slices, and list headers) to reduce GC pressure
- Memory operation optimizations: new
ap_memmoveimplementation and replacement of manual shift loops with bulk memory operations - Arithmetic fast paths: specialized opcodes for int/float operations (add, sub, mul, div) and float power optimization for common exponents (0.5, 2.0)
- Control flow optimizations: superinstructions fusing COMPARE_OP_INT with POP_JUMP_IF_FALSE/TRUE, and fast path for TAG_BOOL in conditional jumps
- Iterator optimizations: inline SmallInt creation in enumerate, float handling in sum(), and unified min/max implementation with iterator support
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_list_insert_pop.py | New test file for list insert/pop/remove operations |
| src/pyo/tuple.asm | Implements object pooling for 1-3 element tuples |
| src/pyo/slice.asm | Implements object pooling for slice objects |
| src/pyo/list.asm | Implements list header pooling and memcpy/memmove optimizations |
| src/pyo/set.asm | Fixes register preservation in set_find_slot |
| src/pyo/float.asm | Adds fast paths for pow(x, 0.5) and pow(x, 2.0) |
| src/opcodes_misc.asm | Adds specialized opcodes for arithmetic ops and fused compare-jump |
| src/eval.asm | Updates opcode dispatch table with new specialized opcodes |
| src/methods.asm | Replaces manual shift loops with ap_memmove in list methods |
| src/lib/memops.asm | Implements ap_memmove for overlapping memory regions |
| src/frame.asm | Unrolls zero-fill loops for small frames (1-4 slots) |
| src/itertools.asm | Inlines SmallInt creation in enumerate iterator |
| src/builtins_extra.asm | Adds float support to sum(), refactors min/max with shared implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ; Only gc_track if freshly allocated (pooled tuples are already tracked) | ||
| ; Check: if tuple came from pool, ob_type is already set from previous use | ||
| ; For fresh alloc, gc_alloc sets ob_type. We can skip gc_track for pooled. | ||
| ; Pooled tuples were gc_untracked in dealloc, so we must gc_track them again. |
There was a problem hiding this comment.
The comment at lines 98-101 is contradictory. It states "Only gc_track if freshly allocated (pooled tuples are already tracked)" but then says "Pooled tuples were gc_untracked in dealloc, so we must gc_track them again." The code calls gc_track unconditionally at line 103, which is correct since pooled tuples ARE untracked in dealloc (line 224). However, the comment should be clarified to avoid confusion.
| ; Only gc_track if freshly allocated (pooled tuples are already tracked) | |
| ; Check: if tuple came from pool, ob_type is already set from previous use | |
| ; For fresh alloc, gc_alloc sets ob_type. We can skip gc_track for pooled. | |
| ; Pooled tuples were gc_untracked in dealloc, so we must gc_track them again. | |
| ; Ensure the tuple is GC-tracked regardless of origin: | |
| ; - Fresh tuples from gc_alloc start out untracked. | |
| ; - Pooled tuples were gc_untracked in dealloc before being put back. | |
| ; Therefore every tuple returned from tuple_new must be gc_track'ed here. |
| lea rdi, [rbx - GC_HEAD_SIZE] | ||
| call ap_free |
There was a problem hiding this comment.
When freeing a slice that was allocated with gc_alloc (at line 69), the code should use gc_dealloc instead of manually adjusting the pointer and calling ap_free. The current approach at lines 137-138 bypasses the GC deallocation logic and directly frees memory, which may cause issues with GC bookkeeping. Change to call gc_dealloc(rbx) instead.
| lea rdi, [rbx - GC_HEAD_SIZE] | |
| call ap_free | |
| mov rdi, rbx | |
| call gc_dealloc |
| INCREF_VAL r14, rdx | ||
| DECREF_VAL rax, rdx ; DECREF iternext result |
There was a problem hiding this comment.
At lines 2376-2377, the code INCREFs the first element and then immediately DECREFs it. Since iternext returns a new reference that we're taking ownership of, these operations cancel out and are redundant. Consider removing both the INCREF_VAL and DECREF_VAL lines to avoid unnecessary reference count manipulation.
| INCREF_VAL r14, rdx | |
| DECREF_VAL rax, rdx ; DECREF iternext result |
No description provided.