Skip to content

Comments

Updates#2

Merged
jgarzik merged 5 commits intomainfrom
updates
Feb 22, 2026
Merged

Updates#2
jgarzik merged 5 commits intomainfrom
updates

Conversation

@jgarzik
Copy link
Owner

@jgarzik jgarzik commented Feb 22, 2026

No description provided.

jgarzik and others added 5 commits February 22, 2026 01:49
- set_find_slot: equality result in eax was clobbered by pop rax,
  causing false matches (e.g. {-2} contained -1). Save result in edi
  before restoring stack.
- BINARY_OP inplace float fallback: TAG_FLOAT lacks TAG_RC_BIT so
  inplace ops (x -= 0.5) fell through to dunder lookup. Add float
  coercion check in the inplace→non-inplace fallback path.
- min/max: rewrite as shared DRY minmax_impl supporting both
  multi-arg (min(a,b,c)) and single-iterable (min(list)) forms.
  Previous code only handled multi-arg, returning the list itself
  for min([1,2,3]).
- sum: dispatch to float_add when either operand is TAG_FLOAT,
  fixing crash on sum([1.0, 2.0, 3.0]).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- COMPARE_OP_INT now pushes TAG_BOOL instead of heap bool (no INCREF/DECREF)
- POP_JUMP_IF_FALSE/TRUE inline TAG_BOOL fast path (skip obj_is_true call)
- Fused COMPARE_OP_INT_JUMP_FALSE/TRUE superinstructions (opcodes 215/216)
- Specialized float binops: ADD/SUB/MUL/TRUEDIV (opcodes 217-220)
- Specialized SmallInt multiply (opcode 221) with imul+overflow check
- Float truediv includes zero-divisor guard that deopts to generic path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ap_memmove (forward rep movsq, backward manual qword loop) and
replace 3 identical element-by-element shift loops with single calls.
Eliminates redundant ob_item reload per iteration and leverages
hardware-optimized string ops for the forward (pop/remove) case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e zeroing optimization

Phase 1: Slice object freelist pool (16 entries) eliminates malloc/free
churn for ephemeral slices in tight loops. list_ass_subscript shift
loops replaced with ap_memmove. list_getslice step=1 uses ap_memcpy
for contiguous copies instead of per-element imul loop.

Phase 2: SmallInt floor division fast path in op_binary_op with
specialization to opcode 222 (op_binary_op_floordiv_int), bypassing
the generic 8-comparison dispatch chain.

Phase 3: Unrolled zeroing for frames with 1-4 locals avoids rep stosq
startup penalty for small functions (common case).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mallInt

- Tuple pool: freelist for sizes 1-3 (max 16 each), avoiding gc_alloc/gc_dealloc
- List header pool: freelist for 40-byte PyListObject headers (max 16)
- float_pow: sqrtsd fast path for **0.5 (~12 vs ~100+ cycles), mulsd for **2.0
- enumerate_iternext: inline SmallInt construction, removing int_from_i64 call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgarzik jgarzik requested a review from Copilot February 22, 2026 04:11
@jgarzik jgarzik self-assigned this Feb 22, 2026
@jgarzik jgarzik added bug Something isn't working enhancement New feature or request labels Feb 22, 2026
@jgarzik jgarzik merged commit 56786f4 into main Feb 22, 2026
3 checks passed
@jgarzik jgarzik deleted the updates branch February 22, 2026 04:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a comprehensive set of performance optimizations for a Python interpreter written in x86-64 assembly. The changes focus on reducing memory allocation overhead through object pooling, optimizing memory operations with specialized routines, and adding fast paths for common arithmetic and comparison operations.

Changes:

  • Object pooling for frequently allocated types (tuples size 1-3, slices, and list headers) to reduce GC pressure
  • Memory operation optimizations: new ap_memmove implementation and replacement of manual shift loops with bulk memory operations
  • Arithmetic fast paths: specialized opcodes for int/float operations (add, sub, mul, div) and float power optimization for common exponents (0.5, 2.0)
  • Control flow optimizations: superinstructions fusing COMPARE_OP_INT with POP_JUMP_IF_FALSE/TRUE, and fast path for TAG_BOOL in conditional jumps
  • Iterator optimizations: inline SmallInt creation in enumerate, float handling in sum(), and unified min/max implementation with iterator support

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_list_insert_pop.py New test file for list insert/pop/remove operations
src/pyo/tuple.asm Implements object pooling for 1-3 element tuples
src/pyo/slice.asm Implements object pooling for slice objects
src/pyo/list.asm Implements list header pooling and memcpy/memmove optimizations
src/pyo/set.asm Fixes register preservation in set_find_slot
src/pyo/float.asm Adds fast paths for pow(x, 0.5) and pow(x, 2.0)
src/opcodes_misc.asm Adds specialized opcodes for arithmetic ops and fused compare-jump
src/eval.asm Updates opcode dispatch table with new specialized opcodes
src/methods.asm Replaces manual shift loops with ap_memmove in list methods
src/lib/memops.asm Implements ap_memmove for overlapping memory regions
src/frame.asm Unrolls zero-fill loops for small frames (1-4 slots)
src/itertools.asm Inlines SmallInt creation in enumerate iterator
src/builtins_extra.asm Adds float support to sum(), refactors min/max with shared implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +98 to +101
; Only gc_track if freshly allocated (pooled tuples are already tracked)
; Check: if tuple came from pool, ob_type is already set from previous use
; For fresh alloc, gc_alloc sets ob_type. We can skip gc_track for pooled.
; Pooled tuples were gc_untracked in dealloc, so we must gc_track them again.
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at lines 98-101 is contradictory. It states "Only gc_track if freshly allocated (pooled tuples are already tracked)" but then says "Pooled tuples were gc_untracked in dealloc, so we must gc_track them again." The code calls gc_track unconditionally at line 103, which is correct since pooled tuples ARE untracked in dealloc (line 224). However, the comment should be clarified to avoid confusion.

Suggested change
; Only gc_track if freshly allocated (pooled tuples are already tracked)
; Check: if tuple came from pool, ob_type is already set from previous use
; For fresh alloc, gc_alloc sets ob_type. We can skip gc_track for pooled.
; Pooled tuples were gc_untracked in dealloc, so we must gc_track them again.
; Ensure the tuple is GC-tracked regardless of origin:
; - Fresh tuples from gc_alloc start out untracked.
; - Pooled tuples were gc_untracked in dealloc before being put back.
; Therefore every tuple returned from tuple_new must be gc_track'ed here.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +138
lea rdi, [rbx - GC_HEAD_SIZE]
call ap_free
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When freeing a slice that was allocated with gc_alloc (at line 69), the code should use gc_dealloc instead of manually adjusting the pointer and calling ap_free. The current approach at lines 137-138 bypasses the GC deallocation logic and directly frees memory, which may cause issues with GC bookkeeping. Change to call gc_dealloc(rbx) instead.

Suggested change
lea rdi, [rbx - GC_HEAD_SIZE]
call ap_free
mov rdi, rbx
call gc_dealloc

Copilot uses AI. Check for mistakes.
Comment on lines +2376 to +2377
INCREF_VAL r14, rdx
DECREF_VAL rax, rdx ; DECREF iternext result
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At lines 2376-2377, the code INCREFs the first element and then immediately DECREFs it. Since iternext returns a new reference that we're taking ownership of, these operations cancel out and are redundant. Consider removing both the INCREF_VAL and DECREF_VAL lines to avoid unnecessary reference count manipulation.

Suggested change
INCREF_VAL r14, rdx
DECREF_VAL rax, rdx ; DECREF iternext result

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant