Skip to content

Conversation

@Winter-Soren
Copy link
Contributor

What was wrong?

Demonstrating an example of NAT traversal using Circuit Relay v2, DCUtR hole punching, and AutoNAT was missing from the codebase.

Issue #870

How was it fixed?

Implemented a comprehensive NAT traversal example that demonstrates the complete flow:

  1. A publicly reachable relay node running Circuit Relay v2
  2. A listener node behind NAT that advertises via relay and supports DCUtR
  3. A dialer node that connects via relay and upgrades to direct connection via hole punching

The implementation includes three components:

  • relay.py: Publicly accessible relay node with resource management
  • listener.py: NAT'd peer that accepts connections via relay and supports direct upgrades
  • dialer.py: NAT'd peer that connects through relay and attempts hole punching
graph TD
    subgraph "Dialer (Behind NAT)"
        A[Dialer Node]
        A1[AutoNAT]
        A2[Circuit Relay CLIENT Role]
        A3[DCUtR Protocol]
    end
    
    subgraph "Public Relay"
        B[Relay Node]
        B1[Circuit Relay HOP Role]
        B2[Resource Management]
    end
    
    subgraph "Listener (Behind NAT)"
        C[Listener Node]
        C1[AutoNAT]
        C2[Circuit Relay STOP Role]
        C3[DCUtR Protocol]
    end
    
    A -->|"1. Connect via relay"| B
    B -->|"2. Relay connection"| C
    A -->|"3. DCUtR hole punching"| C
    
    %% Component connections
    A --- A1
    A --- A2
    A --- A3
    B --- B1
    B --- B2
    C --- C1
    C --- C2
    C --- C3
    
    class A,C natNode
    class B publicNode
Loading

To-Do

  • Clean up commit history
  • Add or update documentation related to these changes
  • Add entry to the release notes

@seetadev
Copy link
Contributor

seetadev commented Oct 8, 2025

@Winter-Soren : Appreciate your great efforts, Soham.

Re-ran the CI/CD pipeline. There are a couple of issues, which need resolution.

We will need a proper screencast demonstrating the example along with proper documentation to enable hole punching interop testing as the next step.

CCing @acul71, who could help you on the hole punching interop testing front using the example as a starting point.

@seetadev
Copy link
Contributor

seetadev commented Dec 8, 2025

@Winter-Soren : Great progress, Soham. Appreciate your support and efforts.

Wish if you could resolve the CI/CD issues.

Looking forward to seeing this PR landed before the upcoming release.

@seetadev seetadev marked this pull request as ready for review December 8, 2025 21:10
Winter-Soren and others added 4 commits December 10, 2025 01:14
- Fix AutoNATService type mismatch by casting IHost to BasicHost
- Add type: ignore comments for raw_conn attribute access (implementation-specific)
- Fix stream handler type mismatch in listener.py with proper casting
- Fix all line length violations (E501) by breaking long lines
- Remove unused variable listener_peer_info
- Fix docstring format (D301) by using raw string for backslashes
- Fix DCUtR test failures by restoring cache trust in _have_direct_connection
- Add examples.nat to documentation toctree to fix build warning

All linting, type checking, tests, and documentation builds now pass.
@acul71
Copy link
Contributor

acul71 commented Dec 14, 2025

Hello @Winter-Soren and @seetadev

I've fixed all the linting, type checking, test failures, and documentation issues identified in the review. Here's a summary of the changes:

Fixed Issues

1. Type Checking Errors

AutoNATService type mismatch:

# examples/nat/dialer.py and listener.py
from typing import cast
from libp2p.host.basic_host import BasicHost

# new_host() returns IHost but always returns BasicHost or RoutedHost
autonat_service = AutoNATService(cast(BasicHost, host))

Missing raw_conn attribute:

# Added type: ignore comments for implementation-specific attributes
conn = stream.muxed_conn.raw_conn  # type: ignore[attr-defined]

Stream handler type mismatch:

# examples/nat/listener.py
from libp2p.network.stream.net_stream import NetStream

# Wrap bound method to match expected function signature
host.set_stream_handler(
    AUTONAT_PROTOCOL_ID,
    lambda stream: autonat_service.handle_stream(cast(NetStream, stream)),
)

2. Linting Errors

  • Fixed 30+ line length violations (E501) by breaking long lines appropriately
  • Removed unused variable listener_peer_info
  • Fixed docstring format (D301) by using raw string for backslashes

3. Test Failures

DCUtR test failures fixed:
The issue was that _have_direct_connection was trying to verify connections even when the peer was in the cache. In test scenarios with mocks, there are no real connections to verify, causing the cache entry to be removed and the method to return False.

Fix in libp2p/relay/circuit_v2/dcutr.py:

async def _have_direct_connection(self, peer_id: ID) -> bool:
    """
    Check if we already have a direct connection to a peer.
    """
    # Check our direct connections cache first
    # Trust the cache for fast path - this allows early return when we know
    # we have a direct connection. Verification happens when connections are
    # established (in _dial_peer) to ensure cache accuracy.
    if peer_id in self._direct_connections:
        return True

    # Check if the peer is connected and verify it's direct
    return await self._verify_direct_connection(peer_id)

This restores the original behavior where the cache is trusted for the fast path, while verification still happens when connections are established to ensure cache accuracy.

4. Documentation

Added examples.nat to the toctree in docs/examples.rst to fix the build warning.

Test Results

✅ All tests passing: 1806 passed, 13 skipped
✅ Linting: All checks passing
✅ Type checking: All errors resolved
✅ Documentation: Build successful with no warnings

Question

The two previously failing DCUtR tests (test_dcutr_max_attempts_and_already_connected and test_dcutr_relay_attempt_limiting) are now passing. The fix restores the original behavior where _have_direct_connection trusts the cache for the fast path, while verification still happens when connections are established (in _dial_peer).

Question: Is this the expected behavior? The verification logic in _verify_direct_connection is still used when connections are established, so the cache should remain accurate. However, I want to confirm that trusting the cache for the fast path (without verification) is the intended design, especially for the early return case in initiate_hole_punch.

@seetadev
Copy link
Contributor

@Winter-Soren , @acul71 : Great, this PR is indeed ready for review + merge.

Will do a final review before merging.

Copy link
Member

@pacrob pacrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just please add a newsfragment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants